Object Storage (S3)

Object storage is the backbone of modern data infrastructure — not just for user uploads, but for data lakes, ML training sets, backups, and static asset delivery at any scale. S3 is the canonical implementation; understanding its design explains why it works differently from block or file storage.

Core Model: Flat Namespace

Object storage has no real directory hierarchy. A bucket is a flat keyspace. The apparent “folder” structure (images/2024/jan/photo.jpg) is just a key string — the / delimiter is cosmetic. There are no directories to create or delete.

Bucket: my-app-assets
  Key: images/2024/jan/photo.jpg    ← one atomic object
  Key: images/2024/feb/photo.jpg    ← another atomic object
  Key: videos/intro.mp4

# "images/2024/" is not a directory — it is a key prefix.
# Listing objects with prefix "images/2024/jan/" scans the keyspace.

Three components per object:

  • Key — the unique identifier within a bucket (max 1,024 bytes UTF-8)
  • Data — the object body (bytes, up to 5 TB per object)
  • Metadata — system metadata (Content-Type, ETag, Last-Modified) + up to 2 KB of user-defined key-value pairs; stored separately from data, returned on HEAD requests without fetching the body

Objects are immutable — you cannot append or partially update. Every write creates a new version of the object. This immutability is what makes consistency and replication tractable at global scale.

Internal Architecture

                        ┌─────────────────────────────┐
Client ──── HTTPS ──────►      S3 Frontend (API layer)  │
                        └────────────┬────────────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │   Metadata Service               │
                    │   (bucket → object key → data    │
                    │    node location mapping)        │
                    └────────────────┬────────────────┘
                                     │ locate data nodes
           ┌─────────────────────────▼──────────────────────────┐
           │              Data Layer (distributed storage nodes)  │
           │   Node A     Node B     Node C    Node D    Node E  │
           │  [chunk 1]  [chunk 2]  [chunk 3] [parity1] [parity2]│
           └─────────────────────────────────────────────────────┘

The metadata service (a distributed key-value store internally) maps (bucket, key) → object location on data nodes. The data layer stores the actual bytes, split into chunks and protected with erasure coding. A read requires one metadata lookup followed by data fetches from the appropriate nodes.

Object Operations and Semantics

OperationHTTPBehavior
PUT ObjectPUT /bucket/keyUpload object (up to 5 GB in one request; use multipart above 100 MB)
GET ObjectGET /bucket/keyDownload full object; use Range header for partial fetch
HEAD ObjectHEAD /bucket/keyReturns metadata only — no body transfer; use for existence checks
DELETE ObjectDELETE /bucket/keyWith versioning: adds a delete marker. Without: permanent deletion
COPY ObjectPUT with x-amz-copy-sourceServer-side copy — no data leaves AWS; instant for same-region
LIST ObjectsGET /bucket?prefix=&delimiter=Paginated key listing; expensive on large buckets

Byte-range fetches: Download a specific byte range of a large object without fetching the whole thing. Enables parallel download of large files — split into N ranges, fetch concurrently, reassemble.

GET /bucket/large-file.parquet
Range: bytes=0-10485759      # first 10 MB

GET /bucket/large-file.parquet
Range: bytes=10485760-20971519  # second 10 MB

This is how Spark and Athena read Parquet column groups — they fetch only the column byte ranges they need, not the full file.

Multipart Upload

For objects larger than 100 MB, multipart upload is recommended (required above 5 GB). It enables parallel upload, reduced retry scope on failures, and streaming uploads where total size is unknown.

sequenceDiagram
    participant C as Client
    participant S as S3

    C->>S: POST /bucket/key?uploads (Initiate)
    S->>C: UploadId: "abc123"

    par Upload parts in parallel (any order)
        C->>S: PUT /key?partNumber=1&uploadId=abc123
        S->>C: ETag: "e1"
        C->>S: PUT /key?partNumber=2&uploadId=abc123
        S->>C: ETag: "e2"
        C->>S: PUT /key?partNumber=3&uploadId=abc123
        S->>C: ETag: "e3"
    end

    C->>S: POST /key?uploadId=abc123 (Complete, with ETag list)
    Note over S: Assembles parts server-side
    S->>C: 200 OK — object now readable

    Note over C,S: On failure: DELETE /key?uploadId=abc123 to abort

Part sizing: Minimum 5 MB per part (except last). Maximum 10,000 parts per upload. For a 50 GB file: 10,000 parts × 5 MB = 50 GB — exactly at the limit. For 100 GB files, use 10 MB parts.

Resume on failure: Each part is independently checksummed (ETag = MD5 of part bytes). If part 7 fails, only part 7 is re-uploaded. The other parts remain staged on S3.

Upload directly from client (presigned URL): Generate a presigned PUT URL server-side; the client uploads directly to S3 without routing bytes through your application servers.

# Server generates presigned URL (valid for 15 minutes)
url = s3.generate_presigned_url(
    "put_object",
    Params={"Bucket": "my-bucket", "Key": "uploads/photo.jpg",
            "ContentType": "image/jpeg"},
    ExpiresIn=900
)
# → https://my-bucket.s3.amazonaws.com/uploads/photo.jpg?X-Amz-Signature=...

# Client uploads directly:
PUT https://my-bucket.s3.amazonaws.com/uploads/photo.jpg?X-Amz-Signature=...
Content-Type: image/jpeg
[binary body]

This offloads upload bandwidth from your servers entirely. A 4K video upload goes directly from the user’s browser to S3.

Consistency Model

S3 provides strong read-after-write consistency for all operations (since December 2020):

  • A successful PUT is immediately visible to subsequent GET and LIST requests
  • A successful DELETE is immediately not visible
  • LIST after PUT reflects the new object

This replaced the previous eventual consistency model for overwrites and deletes. There is no longer a window where a stale object or missing key could be returned after a successful write.

One exception: cross-region replication lag. If you configure cross-region replication (CRR), the replicated bucket in the secondary region is eventually consistent with the source — there is a replication lag (typically seconds to minutes).

Durability and Replication

S3 Standard achieves 99.999999999% (11 nines) durability — losing an object is essentially impossible in practice. This is achieved through erasure coding, not raw replication.

Erasure coding (Reed-Solomon): An object is split into data chunks and parity chunks. S3 Standard uses an erasure coding scheme across multiple Availability Zones. Even if an entire AZ fails (losing several nodes), the object can be fully reconstructed from the remaining data and parity chunks.

Object → split into chunks
         [D1] [D2] [D3] [D4]   ← data chunks (AZ-a, AZ-b, AZ-c, AZ-d)
         [P1] [P2]             ← parity chunks (different AZs)

AZ failure → lose D2
         [D1] [  ] [D3] [D4] [P1] [P2]
→ reconstruct D2 from remaining chunks — object fully recoverable

Availability: S3 Standard SLA is 99.99% availability (52 minutes of downtime/year). This is different from durability — availability is “can I access it right now?” while durability is “will the bytes still exist?”

Storage Classes

Objects are not all equally hot. S3 offers tiered storage classes with different cost and retrieval tradeoff.

Storage ClassUse caseRetrievalStorage costRetrieval cost
S3 StandardActively accessed dataMillisecondsHighestNone
S3 Intelligent-TieringUnknown or changing access patternMillisecondsVariable (auto-tier)None
S3 Standard-IAInfrequently accessed, rapid retrievalMillisecondsLowerPer-GB retrieval fee
S3 One Zone-IAIA but single AZ only — lower durabilityMillisecondsLowerPer-GB retrieval fee
S3 Glacier InstantArchive with instant retrievalMillisecondsLowHigher
S3 Glacier FlexibleArchive, retrieval in hours1–12 hoursVery lowRetrieval fee
S3 Glacier Deep ArchiveLong-term archive, rarely accessed12–48 hoursLowestRetrieval fee

Lifecycle policies automate transitions:

{
  "Rules": [{
    "Filter": { "Prefix": "logs/" },
    "Transitions": [
      { "Days": 30,  "StorageClass": "STANDARD_IA" },
      { "Days": 90,  "StorageClass": "GLACIER" },
      { "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
    ],
    "Expiration": { "Days": 2555 }   // delete after 7 years
  }]
}

Event-Driven Patterns

S3 can trigger downstream systems on object creation, deletion, or restore completion.

S3 PUT → EventBridge / SNS → Lambda (thumbnail generation)
                           → SQS   (async processing queue)
                           → SQS   (data pipeline trigger)

Common pattern — async media processing:

sequenceDiagram
    participant C as Client (Browser)
    participant App as App Server
    participant S as S3
    participant Q as SQS
    participant W as Worker (Lambda)

    C->>App: Request upload URL
    App->>S: Generate presigned PUT URL (expires 15min)
    App->>C: Presigned URL
    C->>S: PUT video.mp4 (direct upload, bypasses App)
    S->>Q: S3 Event Notification (object created)
    Q->>W: Deliver message
    W->>S: GET video.mp4 (fetch original)
    Note over W: Transcode to HLS segments
    W->>S: PUT seg-001.ts, seg-002.ts, manifest.m3u8

S3 Select: Execute SQL-like queries directly on S3 objects (CSV, JSON, Parquet) without downloading the full file. Useful for filtering large log files or sampling Parquet columns.

SELECT * FROM S3Object WHERE status = 'error' LIMIT 1000

Object Storage vs Block Storage vs File Storage

Object Storage (S3)Block Storage (EBS)File Storage (EFS/NFS)
AccessHTTP API (GET/PUT)Raw block device (mounted as disk)POSIX filesystem (mount, read, write, seek)
MutabilityImmutable (replace whole object)In-place read/write at any offsetIn-place read/write, append
LatencyTens to hundreds of msSub-ms (NVMe SSD)Low ms (network-attached)
ThroughputVery high (parallel objects)Limited per volumeLimited per mount
ScaleUnlimited (exabytes)Up to 64 TB per volumePetabytes
CostVery low (~$0.023/GB)Higher (~$0.10/GB SSD)Higher (~$0.30/GB)
Best forMedia, backups, data lakes, static assetsDatabase volumes, OS disksShared config, CMS, legacy apps needing POSIX

Use Cases and Anti-Patterns

Good fits:

  • User-uploaded media — photos, videos, documents; presigned URLs for direct uploads
  • Static asset delivery — JS bundles, images; serve via CloudFront CDN
  • Data lake — raw and processed Parquet/ORC files; queried by Athena/Spark
  • Backups and snapshots — database dumps, VM snapshots with Glacier lifecycle
  • ML training data — petabyte-scale datasets; S3 → training cluster via high-throughput parallel reads
  • Software distribution — firmware, installers; versioned and globally available

Poor fits:

  • Database storage — S3 latency (ms) vs block storage (μs); databases need in-place writes
  • Frequently mutated files — each update replaces the entire object; overhead for small changes
  • POSIX-dependent applications — file locks, atomic rename, directory traversal — S3 doesn’t support these
  • Small objects at massive volume — millions of tiny objects (< 1 KB) incur high per-request API cost relative to storage cost; consider batching into archives or using a database instead
ℹ️

The key design principle: size objects for your read unit. If you always read a day’s worth of logs together, store them as one object per day — not one object per log line. The per-request API cost and latency make many tiny objects expensive; fewer large objects with byte-range fetches is almost always more efficient.

ℹ️

Interview tip: When the design involves user uploads, media, or a data lake, I’d say: “I’d put it in S3 with presigned PUT URLs so client uploads bypass our app servers entirely — direct browser-to-S3 with no bandwidth through us.” For large files I’d use multipart upload (5 MB minimum part size, 10K part limit) so a failed part doesn’t restart the whole transfer. I’d push back on using S3 as a database — its latency is tens of ms versus microseconds for block storage, and there’s no in-place update — but I’d use it as the durable backing for data lakes (Parquet with byte-range fetches), backups, and ML training sets. For cost I’d use lifecycle policies to transition cold data to Glacier and call out that strong read-after-write consistency is the post-2020 default, so we don’t need to design around eventual consistency anymore — except for cross-region replication.

Test Your Understanding

A client uploads a 5GB file to S3 using a single PUT request. The upload fails at 4.8GB due to a network glitch. The entire upload must restart. How do you prevent this?

Multipart upload. Split the file into parts (minimum 5 MB, up to 10,000 parts). Upload each part independently — if one fails, retry only that part. After all parts succeed, call CompleteMultipartUpload to assemble them. Parts can also upload in parallel for faster throughput.

For browser-to-S3 uploads, generate a pre-signed PUT URL so the client uploads directly to S3, bypassing your application servers entirely.

Your data lake stores 10 billion small objects (100 bytes each) in S3. Listing and querying is extremely slow and expensive. What’s wrong?

Per-object overhead dominates. Each S3 API call has per-request cost and ~50-100ms latency. 10 billion tiny objects means 10M LIST calls just to enumerate (1000 objects/page), and sequential GET is impossibly slow.

Fix: Batch small records into larger objects — one per day/hour in Parquet format. Use byte-range fetches to read specific sections. Query engines (Athena, Spark) read Parquet column statistics to skip irrelevant row groups without scanning.

S3 provides strong read-after-write consistency (since Dec 2020). Does this mean you can use S3 as a database?
No. Strong consistency means a GET after PUT returns the latest data. But S3 still lacks: atomic read-modify-write, in-place updates (must rewrite entire objects), low latency (50-200ms vs microseconds), transactions, and query indexes. S3 is durable, cheap, infinitely scalable storage — ideal for data lakes and backups, not transactional workloads.
You use S3 lifecycle policies to move objects to Glacier after 90 days. A user requests an archived object. What happens?

Glacier is not instant-access storage. Retrieval takes minutes to hours: Expedited (1-5 min), Standard (3-5 hours), Bulk (5-12 hours). The request either fails or blocks.

Your application must handle this asynchronously: accept the request, initiate a restore, notify the user when available. Alternatives: S3 Intelligent-Tiering (automatic tier movement, no retrieval delays) or Glacier Instant Retrieval (millisecond access for archival data accessed ~quarterly).