Object Storage (S3)

Object storage is the backbone of modern data infrastructure — not just for user uploads, but for data lakes, ML training sets, backups, and static asset delivery at any scale. S3 is the canonical implementation; understanding its design explains why it works differently from block or file storage.

Core Model: Flat Namespace

Object storage has no real directory hierarchy. A bucket is a flat keyspace. The apparent “folder” structure (images/2024/jan/photo.jpg) is just a key string — the / delimiter is cosmetic. There are no directories to create or delete.

Bucket: my-app-assets
  Key: images/2024/jan/photo.jpg    ← one atomic object
  Key: images/2024/feb/photo.jpg    ← another atomic object
  Key: videos/intro.mp4

# "images/2024/" is not a directory — it is a key prefix.
# Listing objects with prefix "images/2024/jan/" scans the keyspace.

Three components per object:

Key — the unique identifier within a bucket (max 1,024 bytes UTF-8)
Data — the object body (bytes, up to 5 TB per object)
Metadata — system metadata (Content-Type, ETag, Last-Modified) + up to 2 KB of user-defined key-value pairs; stored separately from data, returned on HEAD requests without fetching the body

Objects are immutable — you cannot append or partially update. Every write creates a new version of the object. This immutability is what makes consistency and replication tractable at global scale.

Internal Architecture

                        ┌─────────────────────────────┐
Client ──── HTTPS ──────►      S3 Frontend (API layer)  │
                        └────────────┬────────────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │   Metadata Service               │
                    │   (bucket → object key → data    │
                    │    node location mapping)        │
                    └────────────────┬────────────────┘
                                     │ locate data nodes
           ┌─────────────────────────▼──────────────────────────┐
           │              Data Layer (distributed storage nodes)  │
           │   Node A     Node B     Node C    Node D    Node E  │
           │  [chunk 1]  [chunk 2]  [chunk 3] [parity1] [parity2]│
           └─────────────────────────────────────────────────────┘

The metadata service (a distributed key-value store internally) maps (bucket, key) → object location on data nodes. The data layer stores the actual bytes, split into chunks and protected with erasure coding. A read requires one metadata lookup followed by data fetches from the appropriate nodes.

Object Operations and Semantics

Operation	HTTP	Behavior
`PUT Object`	`PUT /bucket/key`	Upload object (up to 5 GB in one request; use multipart above 100 MB)
`GET Object`	`GET /bucket/key`	Download full object; use `Range` header for partial fetch
`HEAD Object`	`HEAD /bucket/key`	Returns metadata only — no body transfer; use for existence checks
`DELETE Object`	`DELETE /bucket/key`	With versioning: adds a delete marker. Without: permanent deletion
`COPY Object`	`PUT` with `x-amz-copy-source`	Server-side copy — no data leaves AWS; instant for same-region
`LIST Objects`	`GET /bucket?prefix=&delimiter=`	Paginated key listing; expensive on large buckets

Byte-range fetches: Download a specific byte range of a large object without fetching the whole thing. Enables parallel download of large files — split into N ranges, fetch concurrently, reassemble.

GET /bucket/large-file.parquet
Range: bytes=0-10485759      # first 10 MB

GET /bucket/large-file.parquet
Range: bytes=10485760-20971519  # second 10 MB

This is how Spark and Athena read Parquet column groups — they fetch only the column byte ranges they need, not the full file.

Multipart Upload

For objects larger than 100 MB, multipart upload is recommended (required above 5 GB). It enables parallel upload, reduced retry scope on failures, and streaming uploads where total size is unknown.

sequenceDiagram
    participant C as Client
    participant S as S3

    C->>S: POST /bucket/key?uploads (Initiate)
    S->>C: UploadId: "abc123"

    par Upload parts in parallel (any order)
        C->>S: PUT /key?partNumber=1&uploadId=abc123
        S->>C: ETag: "e1"
        C->>S: PUT /key?partNumber=2&uploadId=abc123
        S->>C: ETag: "e2"
        C->>S: PUT /key?partNumber=3&uploadId=abc123
        S->>C: ETag: "e3"
    end

    C->>S: POST /key?uploadId=abc123 (Complete, with ETag list)
    Note over S: Assembles parts server-side
    S->>C: 200 OK — object now readable

    Note over C,S: On failure: DELETE /key?uploadId=abc123 to abort

Part sizing: Minimum 5 MB per part (except last). Maximum 10,000 parts per upload. For a 50 GB file: 10,000 parts × 5 MB = 50 GB — exactly at the limit. For 100 GB files, use 10 MB parts.

Resume on failure: Each part is independently checksummed (ETag = MD5 of part bytes). If part 7 fails, only part 7 is re-uploaded. The other parts remain staged on S3.

Upload directly from client (presigned URL): Generate a presigned PUT URL server-side; the client uploads directly to S3 without routing bytes through your application servers.

# Server generates presigned URL (valid for 15 minutes)
url = s3.generate_presigned_url(
    "put_object",
    Params={"Bucket": "my-bucket", "Key": "uploads/photo.jpg",
            "ContentType": "image/jpeg"},
    ExpiresIn=900
)
# → https://my-bucket.s3.amazonaws.com/uploads/photo.jpg?X-Amz-Signature=...

# Client uploads directly:
PUT https://my-bucket.s3.amazonaws.com/uploads/photo.jpg?X-Amz-Signature=...
Content-Type: image/jpeg
[binary body]

This offloads upload bandwidth from your servers entirely. A 4K video upload goes directly from the user’s browser to S3.

Consistency Model

S3 provides strong read-after-write consistency for all operations (since December 2020):

A successful PUT is immediately visible to subsequent GET and LIST requests
A successful DELETE is immediately not visible
LIST after PUT reflects the new object

This replaced the previous eventual consistency model for overwrites and deletes. There is no longer a window where a stale object or missing key could be returned after a successful write.

One exception: cross-region replication lag. If you configure cross-region replication (CRR), the replicated bucket in the secondary region is eventually consistent with the source — there is a replication lag (typically seconds to minutes).

Durability and Replication

S3 Standard achieves 99.999999999% (11 nines) durability — losing an object is essentially impossible in practice. This is achieved through erasure coding, not raw replication.

Erasure coding (Reed-Solomon): An object is split into data chunks and parity chunks. S3 Standard uses an erasure coding scheme across multiple Availability Zones. Even if an entire AZ fails (losing several nodes), the object can be fully reconstructed from the remaining data and parity chunks.

Object → split into chunks
         [D1] [D2] [D3] [D4]   ← data chunks (AZ-a, AZ-b, AZ-c, AZ-d)
         [P1] [P2]             ← parity chunks (different AZs)

AZ failure → lose D2
         [D1] [  ] [D3] [D4] [P1] [P2]
→ reconstruct D2 from remaining chunks — object fully recoverable

Availability: S3 Standard SLA is 99.99% availability (52 minutes of downtime/year). This is different from durability — availability is “can I access it right now?” while durability is “will the bytes still exist?”

Storage Classes

Objects are not all equally hot. S3 offers tiered storage classes with different cost and retrieval tradeoff.

Storage Class	Use case	Retrieval	Storage cost	Retrieval cost
S3 Standard	Actively accessed data	Milliseconds	Highest	None
S3 Intelligent-Tiering	Unknown or changing access pattern	Milliseconds	Variable (auto-tier)	None
S3 Standard-IA	Infrequently accessed, rapid retrieval	Milliseconds	Lower	Per-GB retrieval fee
S3 One Zone-IA	IA but single AZ only — lower durability	Milliseconds	Lower	Per-GB retrieval fee
S3 Glacier Instant	Archive with instant retrieval	Milliseconds	Low	Higher
S3 Glacier Flexible	Archive, retrieval in hours	1–12 hours	Very low	Retrieval fee
S3 Glacier Deep Archive	Long-term archive, rarely accessed	12–48 hours	Lowest	Retrieval fee

Lifecycle policies automate transitions:

{
  "Rules": [{
    "Filter": { "Prefix": "logs/" },
    "Transitions": [
      { "Days": 30,  "StorageClass": "STANDARD_IA" },
      { "Days": 90,  "StorageClass": "GLACIER" },
      { "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
    ],
    "Expiration": { "Days": 2555 }   // delete after 7 years
  }]
}

Event-Driven Patterns

S3 can trigger downstream systems on object creation, deletion, or restore completion.

S3 PUT → EventBridge / SNS → Lambda (thumbnail generation)
                           → SQS   (async processing queue)
                           → SQS   (data pipeline trigger)

Common pattern — async media processing:

sequenceDiagram
    participant C as Client (Browser)
    participant App as App Server
    participant S as S3
    participant Q as SQS
    participant W as Worker (Lambda)

    C->>App: Request upload URL
    App->>S: Generate presigned PUT URL (expires 15min)
    App->>C: Presigned URL
    C->>S: PUT video.mp4 (direct upload, bypasses App)
    S->>Q: S3 Event Notification (object created)
    Q->>W: Deliver message
    W->>S: GET video.mp4 (fetch original)
    Note over W: Transcode to HLS segments
    W->>S: PUT seg-001.ts, seg-002.ts, manifest.m3u8

S3 Select: Execute SQL-like queries directly on S3 objects (CSV, JSON, Parquet) without downloading the full file. Useful for filtering large log files or sampling Parquet columns.

SELECT * FROM S3Object WHERE status = 'error' LIMIT 1000

Object Storage vs Block Storage vs File Storage

	Object Storage (S3)	Block Storage (EBS)	File Storage (EFS/NFS)
Access	HTTP API (GET/PUT)	Raw block device (mounted as disk)	POSIX filesystem (mount, read, write, seek)
Mutability	Immutable (replace whole object)	In-place read/write at any offset	In-place read/write, append
Latency	Tens to hundreds of ms	Sub-ms (NVMe SSD)	Low ms (network-attached)
Throughput	Very high (parallel objects)	Limited per volume	Limited per mount
Scale	Unlimited (exabytes)	Up to 64 TB per volume	Petabytes
Cost	Very low (~$0.023/GB)	Higher (~$0.10/GB SSD)	Higher (~$0.30/GB)
Best for	Media, backups, data lakes, static assets	Database volumes, OS disks	Shared config, CMS, legacy apps needing POSIX

Use Cases and Anti-Patterns

Good fits:

User-uploaded media — photos, videos, documents; presigned URLs for direct uploads
Static asset delivery — JS bundles, images; serve via CloudFront CDN
Data lake — raw and processed Parquet/ORC files; queried by Athena/Spark
Backups and snapshots — database dumps, VM snapshots with Glacier lifecycle
ML training data — petabyte-scale datasets; S3 → training cluster via high-throughput parallel reads
Software distribution — firmware, installers; versioned and globally available

Poor fits:

Database storage — S3 latency (ms) vs block storage (μs); databases need in-place writes
Frequently mutated files — each update replaces the entire object; overhead for small changes
POSIX-dependent applications — file locks, atomic rename, directory traversal — S3 doesn’t support these
Small objects at massive volume — millions of tiny objects (< 1 KB) incur high per-request API cost relative to storage cost; consider batching into archives or using a database instead

ℹ️

The key design principle: size objects for your read unit. If you always read a day’s worth of logs together, store them as one object per day — not one object per log line. The per-request API cost and latency make many tiny objects expensive; fewer large objects with byte-range fetches is almost always more efficient.

ℹ️

Interview tip: When the design involves user uploads, media, or a data lake, I’d say: “I’d put it in S3 with presigned PUT URLs so client uploads bypass our app servers entirely — direct browser-to-S3 with no bandwidth through us.” For large files I’d use multipart upload (5 MB minimum part size, 10K part limit) so a failed part doesn’t restart the whole transfer. I’d push back on using S3 as a database — its latency is tens of ms versus microseconds for block storage, and there’s no in-place update — but I’d use it as the durable backing for data lakes (Parquet with byte-range fetches), backups, and ML training sets. For cost I’d use lifecycle policies to transition cold data to Glacier and call out that strong read-after-write consistency is the post-2020 default, so we don’t need to design around eventual consistency anymore — except for cross-region replication.

Test Your Understanding

A client uploads a 5GB file to S3 using a single PUT request. The upload fails at 4.8GB due to a network glitch. The entire upload must restart. How do you prevent this?

Multipart upload. Split the file into parts (minimum 5 MB, up to 10,000 parts). Upload each part independently — if one fails, retry only that part. After all parts succeed, call CompleteMultipartUpload to assemble them. Parts can also upload in parallel for faster throughput.

For browser-to-S3 uploads, generate a pre-signed PUT URL so the client uploads directly to S3, bypassing your application servers entirely.

Your data lake stores 10 billion small objects (100 bytes each) in S3. Listing and querying is extremely slow and expensive. What’s wrong?

Per-object overhead dominates. Each S3 API call has per-request cost and ~50-100ms latency. 10 billion tiny objects means 10M LIST calls just to enumerate (1000 objects/page), and sequential GET is impossibly slow.

Fix: Batch small records into larger objects — one per day/hour in Parquet format. Use byte-range fetches to read specific sections. Query engines (Athena, Spark) read Parquet column statistics to skip irrelevant row groups without scanning.

S3 provides strong read-after-write consistency (since Dec 2020). Does this mean you can use S3 as a database?

No. Strong consistency means a GET after PUT returns the latest data. But S3 still lacks: atomic read-modify-write, in-place updates (must rewrite entire objects), low latency (50-200ms vs microseconds), transactions, and query indexes. S3 is durable, cheap, infinitely scalable storage — ideal for data lakes and backups, not transactional workloads.

You use S3 lifecycle policies to move objects to Glacier after 90 days. A user requests an archived object. What happens?

Glacier is not instant-access storage. Retrieval takes minutes to hours: Expedited (1-5 min), Standard (3-5 hours), Bulk (5-12 hours). The request either fails or blocks.

Your application must handle this asynchronously: accept the request, initiate a restore, notify the user when available. Alternatives: S3 Intelligent-Tiering (automatic tier movement, no retrieval delays) or Glacier Instant Retrieval (millisecond access for archival data accessed ~quarterly).

Document Stores (MongoDB)Time-Series Databases