Object Storage (S3)

Object storage is the backbone of modern data infrastructure — not just for user uploads, but for data lakes, ML training sets, backups, and static asset delivery at any scale. S3 is the canonical implementation; understanding its design explains why it works differently from block or file storage.

Core Model: Flat Namespace

Object storage has no real directory hierarchy. A bucket is a flat keyspace. The apparent “folder” structure (images/2024/jan/photo.jpg) is just a key string — the / delimiter is cosmetic. There are no directories to create or delete.

Bucket: my-app-assets
  Key: images/2024/jan/photo.jpg    ← one atomic object
  Key: images/2024/feb/photo.jpg    ← another atomic object
  Key: videos/intro.mp4

# "images/2024/" is not a directory — it is a key prefix.
# Listing objects with prefix "images/2024/jan/" scans the keyspace.

Three components per object:

  • Key — the unique identifier within a bucket (max 1,024 bytes UTF-8)
  • Data — the object body (bytes, up to 5 TB per object)
  • Metadata — system metadata (Content-Type, ETag, Last-Modified) + up to 2 KB of user-defined key-value pairs; stored separately from data, returned on HEAD requests without fetching the body

Objects are immutable — you cannot append or partially update. Every write creates a new version of the object. This immutability is what makes consistency and replication tractable at global scale.

Internal Architecture

                        ┌─────────────────────────────┐
Client ──── HTTPS ──────►      S3 Frontend (API layer)  │
                        └────────────┬────────────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │   Metadata Service               │
                    │   (bucket → object key → data    │
                    │    node location mapping)        │
                    └────────────────┬────────────────┘
                                     │ locate data nodes
           ┌─────────────────────────▼──────────────────────────┐
           │              Data Layer (distributed storage nodes)  │
           │   Node A     Node B     Node C    Node D    Node E  │
           │  [chunk 1]  [chunk 2]  [chunk 3] [parity1] [parity2]│
           └─────────────────────────────────────────────────────┘

The metadata service (a distributed key-value store internally) maps (bucket, key) → object location on data nodes. The data layer stores the actual bytes, split into chunks and protected with erasure coding. A read requires one metadata lookup followed by data fetches from the appropriate nodes.

Object Operations and Semantics

OperationHTTPBehavior
PUT ObjectPUT /bucket/keyUpload object (up to 5 GB in one request; use multipart above 100 MB)
GET ObjectGET /bucket/keyDownload full object; use Range header for partial fetch
HEAD ObjectHEAD /bucket/keyReturns metadata only — no body transfer; use for existence checks
DELETE ObjectDELETE /bucket/keyWith versioning: adds a delete marker. Without: permanent deletion
COPY ObjectPUT with x-amz-copy-sourceServer-side copy — no data leaves AWS; instant for same-region
LIST ObjectsGET /bucket?prefix=&delimiter=Paginated key listing; expensive on large buckets

Byte-range fetches: Download a specific byte range of a large object without fetching the whole thing. Enables parallel download of large files — split into N ranges, fetch concurrently, reassemble.

GET /bucket/large-file.parquet
Range: bytes=0-10485759      # first 10 MB

GET /bucket/large-file.parquet
Range: bytes=10485760-20971519  # second 10 MB

This is how Spark and Athena read Parquet column groups — they fetch only the column byte ranges they need, not the full file.

Multipart Upload

For objects larger than 100 MB, multipart upload is recommended (required above 5 GB). It enables parallel upload, reduced retry scope on failures, and streaming uploads where total size is unknown.

sequenceDiagram
    participant C as Client
    participant S as S3

    C->>S: POST /bucket/key?uploads (Initiate)
    S->>C: UploadId: "abc123"

    par Upload parts in parallel (any order)
        C->>S: PUT /key?partNumber=1&uploadId=abc123
        S->>C: ETag: "e1"
        C->>S: PUT /key?partNumber=2&uploadId=abc123
        S->>C: ETag: "e2"
        C->>S: PUT /key?partNumber=3&uploadId=abc123
        S->>C: ETag: "e3"
    end

    C->>S: POST /key?uploadId=abc123 (Complete, with ETag list)
    Note over S: Assembles parts server-side
    S->>C: 200 OK — object now readable

    Note over C,S: On failure: DELETE /key?uploadId=abc123 to abort

Part sizing: Minimum 5 MB per part (except last). Maximum 10,000 parts per upload. For a 50 GB file: 10,000 parts × 5 MB = 50 GB — exactly at the limit. For 100 GB files, use 10 MB parts.

Resume on failure: Each part is independently checksummed (ETag = MD5 of part bytes). If part 7 fails, only part 7 is re-uploaded. The other parts remain staged on S3.

Upload directly from client (presigned URL): Generate a presigned PUT URL server-side; the client uploads directly to S3 without routing bytes through your application servers.

# Server generates presigned URL (valid for 15 minutes)
url = s3.generate_presigned_url(
    "put_object",
    Params={"Bucket": "my-bucket", "Key": "uploads/photo.jpg",
            "ContentType": "image/jpeg"},
    ExpiresIn=900
)
# → https://my-bucket.s3.amazonaws.com/uploads/photo.jpg?X-Amz-Signature=...

# Client uploads directly:
PUT https://my-bucket.s3.amazonaws.com/uploads/photo.jpg?X-Amz-Signature=...
Content-Type: image/jpeg
[binary body]

This offloads upload bandwidth from your servers entirely. A 4K video upload goes directly from the user’s browser to S3.

Consistency Model

S3 provides strong read-after-write consistency for all operations (since December 2020):

  • A successful PUT is immediately visible to subsequent GET and LIST requests
  • A successful DELETE is immediately not visible
  • LIST after PUT reflects the new object

This replaced the previous eventual consistency model for overwrites and deletes. There is no longer a window where a stale object or missing key could be returned after a successful write.

One exception: cross-region replication lag. If you configure cross-region replication (CRR), the replicated bucket in the secondary region is eventually consistent with the source — there is a replication lag (typically seconds to minutes).

Durability and Replication

S3 Standard achieves 99.999999999% (11 nines) durability — losing an object is essentially impossible in practice. This is achieved through erasure coding, not raw replication.

Erasure coding (Reed-Solomon): An object is split into data chunks and parity chunks. S3 Standard uses an erasure coding scheme across multiple Availability Zones. Even if an entire AZ fails (losing several nodes), the object can be fully reconstructed from the remaining data and parity chunks.

Object → split into chunks
         [D1] [D2] [D3] [D4]   ← data chunks (AZ-a, AZ-b, AZ-c, AZ-d)
         [P1] [P2]             ← parity chunks (different AZs)

AZ failure → lose D2
         [D1] [  ] [D3] [D4] [P1] [P2]
→ reconstruct D2 from remaining chunks — object fully recoverable

Availability: S3 Standard SLA is 99.99% availability (52 minutes of downtime/year). This is different from durability — availability is “can I access it right now?” while durability is “will the bytes still exist?”

Storage Classes

Objects are not all equally hot. S3 offers tiered storage classes with different cost and retrieval tradeoff.

Storage ClassUse caseRetrievalStorage costRetrieval cost
S3 StandardActively accessed dataMillisecondsHighestNone
S3 Intelligent-TieringUnknown or changing access patternMillisecondsVariable (auto-tier)None
S3 Standard-IAInfrequently accessed, rapid retrievalMillisecondsLowerPer-GB retrieval fee
S3 One Zone-IAIA but single AZ only — lower durabilityMillisecondsLowerPer-GB retrieval fee
S3 Glacier InstantArchive with instant retrievalMillisecondsLowHigher
S3 Glacier FlexibleArchive, retrieval in hours1–12 hoursVery lowRetrieval fee
S3 Glacier Deep ArchiveLong-term archive, rarely accessed12–48 hoursLowestRetrieval fee

Lifecycle policies automate transitions:

{
  "Rules": [{
    "Filter": { "Prefix": "logs/" },
    "Transitions": [
      { "Days": 30,  "StorageClass": "STANDARD_IA" },
      { "Days": 90,  "StorageClass": "GLACIER" },
      { "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
    ],
    "Expiration": { "Days": 2555 }   // delete after 7 years
  }]
}

Event-Driven Patterns

S3 can trigger downstream systems on object creation, deletion, or restore completion.

S3 PUT → EventBridge / SNS → Lambda (thumbnail generation)
                           → SQS   (async processing queue)
                           → SQS   (data pipeline trigger)

Common pattern — async media processing:

sequenceDiagram
    participant C as Client (Browser)
    participant App as App Server
    participant S as S3
    participant Q as SQS
    participant W as Worker (Lambda)

    C->>App: Request upload URL
    App->>S: Generate presigned PUT URL (expires 15min)
    App->>C: Presigned URL
    C->>S: PUT video.mp4 (direct upload, bypasses App)
    S->>Q: S3 Event Notification (object created)
    Q->>W: Deliver message
    W->>S: GET video.mp4 (fetch original)
    Note over W: Transcode to HLS segments
    W->>S: PUT seg-001.ts, seg-002.ts, manifest.m3u8

S3 Select: Execute SQL-like queries directly on S3 objects (CSV, JSON, Parquet) without downloading the full file. Useful for filtering large log files or sampling Parquet columns.

SELECT * FROM S3Object WHERE status = 'error' LIMIT 1000

Object Storage vs Block Storage vs File Storage

Object Storage (S3)Block Storage (EBS)File Storage (EFS/NFS)
AccessHTTP API (GET/PUT)Raw block device (mounted as disk)POSIX filesystem (mount, read, write, seek)
MutabilityImmutable (replace whole object)In-place read/write at any offsetIn-place read/write, append
LatencyTens to hundreds of msSub-ms (NVMe SSD)Low ms (network-attached)
ThroughputVery high (parallel objects)Limited per volumeLimited per mount
ScaleUnlimited (exabytes)Up to 64 TB per volumePetabytes
CostVery low (~$0.023/GB)Higher (~$0.10/GB SSD)Higher (~$0.30/GB)
Best forMedia, backups, data lakes, static assetsDatabase volumes, OS disksShared config, CMS, legacy apps needing POSIX

Use Cases and Anti-Patterns

Good fits:

  • User-uploaded media — photos, videos, documents; presigned URLs for direct uploads
  • Static asset delivery — JS bundles, images; serve via CloudFront CDN
  • Data lake — raw and processed Parquet/ORC files; queried by Athena/Spark
  • Backups and snapshots — database dumps, VM snapshots with Glacier lifecycle
  • ML training data — petabyte-scale datasets; S3 → training cluster via high-throughput parallel reads
  • Software distribution — firmware, installers; versioned and globally available

Poor fits:

  • Database storage — S3 latency (ms) vs block storage (μs); databases need in-place writes
  • Frequently mutated files — each update replaces the entire object; overhead for small changes
  • POSIX-dependent applications — file locks, atomic rename, directory traversal — S3 doesn’t support these
  • Small objects at massive volume — millions of tiny objects (< 1 KB) incur high per-request API cost relative to storage cost; consider batching into archives or using a database instead
ℹ️

The key design principle: size objects for your read unit. If you always read a day’s worth of logs together, store them as one object per day — not one object per log line. The per-request API cost and latency make many tiny objects expensive; fewer large objects with byte-range fetches is almost always more efficient.