ML Platform Design
A company has 30 ML models in production: a fraud classifier, a recommender, a search ranker, a churn predictor, a dozen of risk scores, several LLM-based summarizers. Each was built by a different team. One uses Jupyter notebooks for training; another uses a custom Airflow DAG. One serves through a Flask app on EC2; another through SageMaker; another through a Kubernetes deployment of FastAPI. Each team rebuilt monitoring, deployment, model versioning, and feature pipelines from scratch. Now a regulator asks: “Show me which version of which model produced the prediction that denied this loan three months ago, and which training data it used.” Nobody can answer. An ML platform is the shared infrastructure that turns ad-hoc model development into a repeatable, governed, observable process — so that training, serving, monitoring, and rollout look the same regardless of which team or model.
What an ML Platform Provides
flowchart LR
subgraph "Without ML Platform"
Team1[Team A
Jupyter + Flask]
Team2[Team B
SageMaker]
Team3[Team C
K8s + custom serving]
Team1 -.-> Chaos[Inconsistent monitoring
no shared lineage
no governance
30× operational burden]
Team2 -.-> Chaos
Team3 -.-> Chaos
end
subgraph "With ML Platform"
Pipe[Training pipelines]
Reg[Model registry]
Serve[Serving infra]
Mon[Monitoring]
Pipe & Reg & Serve & Mon --> Out[Consistent: deploy,
rollback, audit, monitor
across all 30 models]
endA platform provides shared infrastructure for the lifecycle stages every model goes through:
| Stage | What the platform provides |
|---|---|
| Data → features | Feature store with shared definitions, point-in-time correctness |
| Training | Pipeline orchestration, distributed compute, experiment tracking, reproducibility |
| Registration | Versioned model registry with metadata, lineage, approval workflows |
| Deployment | Standard serving runtimes (online, batch, streaming) with canary, shadow, rollback |
| Monitoring | Drift detection, performance tracking, alerting — same dashboards for every model |
| Governance | Lineage graphs, approval gates, audit logs for compliance |
Training Pipeline
A reproducible, automated path from data to a trained model artifact.
flowchart LR
Source[(Data sources
warehouse / lake)] --> Ingest[Ingest
filter, sample]
Ingest --> FE[Feature engineering
via feature store]
FE --> Split[Train / val / test split]
Split --> Train[Distributed training
GPU cluster]
Train --> Eval[Evaluate
metrics, fairness, slices]
Eval -->|pass thresholds| Reg[Register in
model registry]
Eval -->|fail| Stop[Stop & alert]Orchestration
| Tool | Strengths | Best for |
|---|---|---|
| Airflow | Mature, huge ecosystem, simple DAG model | Generic data + ML pipelines; what most companies start with |
| Kubeflow Pipelines | Kubernetes-native, ML-aware (artifact tracking, caching), supports parallel trials | Kubernetes shops; large training jobs |
| Metaflow (Netflix) | Pythonic, opinionated, excellent UX for data scientists | Small/medium teams wanting low friction |
| Argo Workflows | Lightweight Kubernetes-native DAG engine | Teams already on K8s without Kubeflow’s complexity |
| Vertex AI Pipelines / SageMaker Pipelines | Managed; integrate with cloud ML services | Cloud-native teams |
Reproducibility Requirements
A trained model is reproducible only if all of these are versioned:
For a single training run, the registry records:
- Code version (git commit hash + branch)
- Data version (data lake snapshot or query as-of date)
- Feature definitions version (from feature store registry)
- Hyperparameters
- Random seed
- Container image (ML libraries, CUDA version, OS)
- Hardware (GPU type, count)
- Output: model artifact + evaluation metrics + run logsWithout all of these, “rerun this training a year from now and get the same model” is impossible — and impossible reproduction means impossible debugging.
Experiment Tracking
| Tool | Strengths |
|---|---|
| MLflow | Open source, language-agnostic, integrates with anything; tracks runs, metrics, artifacts |
| Weights & Biases | Polished UX, strong visualization, hosted SaaS |
| Neptune.ai | Similar to W&B; flexible self-hosting |
| Comet | Strong on real-time dashboards |
The pattern is consistent: log every experiment’s hyperparameters, metrics, and artifacts; compare runs visually; promote winners.
Model Registry
The single source of truth for every model artifact in the organization.
flowchart TB
subgraph "Model Registry"
M1[fraud-classifier
v3 → staging
v2 → production
v1 → archived]
M2[search-ranker
v8 → production
v7 → archived]
M3[churn-predictor
v5 → canary
v4 → production]
end
subgraph "For each model version"
Art[Artifact: weights, config
S3 / GCS]
Meta[Metadata: hyperparams,
metrics, training data hash]
Lin[Lineage: data sources,
feature versions, code commit]
App[Approvals: who reviewed,
signed off, when]
end
M1 -.-> Art & Meta & Lin & AppWhat a Registry Tracks
| Field | Why it matters |
|---|---|
| Model name + version | Stable identifier downstream services pin to |
| Stage | staging, canary, production, archived — gates deployment |
| Artifact location | S3/GCS path to weights + config |
| Metrics | Offline: AUC, RMSE, fairness slices. Online: ongoing performance |
| Lineage | Which dataset, which feature versions, which code commit |
| Hyperparameters | Full training config |
| Owner / approver | Who built it, who approved promotion |
| Container image | Reproducible inference environment |
| Schemas | Input feature schema + output schema (so consumers can validate) |
Promotion Workflow
Newly trained → staging
↓ (offline metrics pass + automated checks pass)
staging → canary (1–5% of traffic)
↓ (online metrics match offline; no guardrail regression)
canary → production (full traffic)
↓ (replaced by next version)
production → archived (kept for audit, not served)Every transition is logged with a reason and approver, generating an audit trail. For regulated domains (lending, healthcare, hiring) this audit trail is a hard requirement.
Tools
| Tool | Notes |
|---|---|
| MLflow Model Registry | Open source, paired with MLflow tracking; common starting point |
| Vertex AI Model Registry | Managed on GCP; tight Vertex Pipelines integration |
| SageMaker Model Registry | Managed on AWS; integrated with SageMaker pipelines and endpoints |
| In-house | Many large orgs build a custom registry on top of Postgres + S3 to fit their governance needs |
Model Serving
The runtime that turns a registered model into served predictions. Three serving modes cover almost every production model.
Online Serving (Real-Time Inference)
Per-request prediction behind a REST or gRPC endpoint. Used for recommendations, fraud scoring, search, ads — anything user-facing.
flowchart LR
Client --> LB[Load Balancer]
LB --> R1[Model Replica 1
GPU/CPU]
LB --> R2[Model Replica 2]
LB --> R3[Model Replica N]
R1 & R2 & R3 --> FS[(Feature Store
online)]
R1 & R2 & R3 --> Logs[Prediction logs
Kafka]| Concern | Approach |
|---|---|
| Latency target | p99 typically <100ms end-to-end; tighter for ads, looser for batch-like calls |
| Throughput | Horizontal scaling: more replicas; autoscale on QPS or queue depth |
| Hardware | CPU is fine for trees and small models; GPU for DNNs; specialized accelerators (TPU, Inferentia, Triton) for heavy DL |
| Batching | Dynamic micro-batching: collect requests for ~5–20ms, batch through one GPU forward pass — 10× higher throughput at slightly higher tail latency |
| Frameworks | TensorFlow Serving, TorchServe, NVIDIA Triton Inference Server, Seldon Core, KServe, BentoML, Ray Serve |
| Caching | Cache predictions for stable inputs; useful for content recos where the same user requests the same items repeatedly |
Batch Serving (Offline Scoring)
Score millions of records on a schedule. Used for: nightly user segment refreshes, lookalike audience generation, ML-derived columns in the warehouse.
# Conceptual: Spark batch scoring
df = spark.read.parquet("s3://data/users/")
df_with_features = df.join(feature_table, on="user_id")
predictions = model.transform(df_with_features)
predictions.write.parquet("s3://predictions/churn_score/dt=2025-05-02/")| Concern | Approach |
|---|---|
| Latency | Hours acceptable |
| Throughput | Spark / Beam parallelism over partitions |
| Cost | Spot / preemptible instances; GPU only if model truly needs it |
| Output | Written to warehouse / lake; downstream services read snapshots |
Streaming Serving (Per-Event Inference)
A streaming consumer applies the model to each event as it arrives. Used for: real-time fraud, content moderation triage, real-time personalization.
flowchart LR
Kafka[Kafka
events] --> Flink[Flink job
loads model
scores per event]
Flink --> KafkaOut[Kafka
scored events]
Flink --> FS[(Feature store
per-event features)]| Concern | Approach |
|---|---|
| Latency | Single-digit to tens of ms per event |
| Model loading | Loaded once into the streaming worker; refreshed on a schedule or via control message |
| Scaling | Partition events; scale streaming workers with partitions |
| State | Streaming engine maintains feature state (windowed aggregates) co-located with scoring |
Choosing the Right Mode
| Need | Use |
|---|---|
| User-facing prediction during a request | Online |
| Score-everyone, refresh occasionally | Batch |
| React to events with very low latency, every event scored | Streaming |
| Hybrid (most cases) | Online + Batch — batch produces stable snapshots; online layers personalization on top |
Deployment Strategies
Replacing a production model is a high-risk operation. Four strategies, in order of safety:
1. Shadow Mode
flowchart LR
Request --> Old[Old model
returns response]
Request -.->|"copy"| New[New model
response logged, not returned]
Old --> User[User sees response]
New --> Compare[Offline comparison
old vs new predictions]Run the new model in parallel; discard its output. Log predictions and compare distributions to the production model. Catches:
- Latency regressions (new model too slow)
- Distributional drift (new model predicts very differently than old)
- Crashes / errors that wouldn’t appear in offline eval
Costs 2× compute but is the safest first step. Run for a few days before any user-facing test.
2. Canary
Route 1–5% of traffic to the new model; the rest to the old one. Monitor:
- Online metrics (CTR, conversion, revenue per request) match offline expectations
- Guardrails (latency p99, error rate, fairness slices) don’t regress
- Sample ratio is what was configured (sanity check)
Gradually increase: 5% → 10% → 25% → 50% → 100% over hours or days, depending on traffic volume needed for statistical significance.
3. A/B Test
Like canary, but with a longer evaluation window designed for hypothesis testing on a north-star metric. Typical: 50/50 split for 1–2 weeks.
| Strategy | Goal | Duration | Sample size |
|---|---|---|---|
| Shadow | “Does it work at all in prod?” | Hours–days | All traffic, output discarded |
| Canary | “Does it cause acute regressions?” | Hours–days | 1–10% |
| A/B test | “Does it move the north-star metric?” | 1–4 weeks | 50/50 typical |
| Full rollout | After A/B win | Permanent | 100% |
4. Rollback
Every promotion path must have a fast rollback. The previous model version stays warm in the registry; a single config change returns serving traffic to it.
Rollback is the most underinvested part of ML deployment. Teams build elaborate canary frameworks but discover at 3am that rolling back a model takes 30 minutes because the previous artifact isn’t cached on the serving nodes. Rollback should be one config flip and effective within the time it takes a feature flag to propagate (seconds).
Monitoring: Three Kinds of Drift
Models in production silently degrade. Detecting why requires monitoring three different things.
flowchart TB
subgraph "What can go wrong"
D1[Data drift
input distribution
changes]
D2[Concept drift
relationship between
inputs and outputs
changes]
D3[Performance drift
actual labels show
accuracy is dropping]
end
D1 -->|"early warning"| Alert
D2 -->|"requires labels"| Alert
D3 -->|"ground truth"| Alert1. Data Drift (Input Distribution Drift)
The features the model sees at inference time differ from what it trained on. Doesn’t require labels — just compare distributions.
| Metric | What it detects | Tools |
|---|---|---|
| PSI (Population Stability Index) | Numeric / categorical feature distribution shift; >0.2 is significant | Standard in finance; easy to compute |
| KL divergence / JS divergence | Distribution drift, more sensitive | More math; same idea |
| Mean / variance / null rate per feature | Cheap and catches obvious upstream breakage | Always log, always alert |
| Schema check | New column, missing column, type change | Catch immediately at the API boundary |
Example: PSI on `user_country` feature
Training: US 60%, IN 20%, BR 10%, other 10%
Production (week 4): US 30%, IN 50%, BR 5%, other 15%
PSI = sum over buckets of (prod% - train%) × ln(prod% / train%) ≈ 0.4
Alert: model trained mostly on US users now serves mostly IN users2. Concept Drift (Output Drift)
The relationship between features and the label has changed — even with the same input distribution, the right answer is now different. Examples: a fraud pattern evolves, a price-elasticity model sees post-pandemic shopping behavior, an LLM’s expected response changes after a UX update.
Detection requires labels, which usually arrive late. Proxies:
- Drift in the predicted label distribution (e.g., fraud rate predicted by model rises 3×)
- Drift in prediction confidence (model becomes less confident)
- A/B-tested online metrics (CTR, conversion) decline
3. Performance Drift (Ground Truth Available)
Once labels arrive (clicks, fraud confirmations, returns), measure accuracy directly.
| Metric | Use |
|---|---|
| Accuracy / AUC / RMSE over rolling window | Direct performance |
| Calibration | Are predicted probabilities accurate? P(model says 80% → actually true 80%)? |
| Sliced metrics | Performance on segments (region, age, device) — catches fairness regressions |
| Latency p50/p95/p99 | Serving health |
Alerting Discipline
| Alert | Severity |
|---|---|
| Schema mismatch / null rate spike | P0 — page immediately |
| Latency p99 regression | P1 — page during business hours |
| Data drift on critical feature (PSI > 0.2) | P2 — investigate within day |
| Sliced accuracy drop on protected group | P0 (legal/compliance) |
| Slow accuracy decay | P3 — retrain queue |
End-to-End Architecture
flowchart TB
subgraph "Sources"
Data[(Data warehouse /
data lake)]
Stream[Event streams
Kafka]
end
subgraph "Feature Platform"
FS[(Feature store
online + offline)]
end
subgraph "Training"
Pipe[Training pipeline
Kubeflow / Airflow]
Track[Experiment tracking
MLflow]
Pipe --> Track
end
subgraph "Registry"
Reg[Model registry
versions + lineage
+ approvals]
end
subgraph "Serving"
Online[Online serving
KServe / Triton]
Batch[Batch scoring
Spark]
Stream2[Streaming scoring
Flink]
end
subgraph "Observability"
Mon[Drift + perf
monitoring]
Logs[Prediction logs
+ feature logs]
end
Data --> FS
Stream --> FS
FS --> Pipe
Pipe --> Reg
Reg --> Online
Reg --> Batch
Reg --> Stream2
Online --> Logs
Batch --> Logs
Stream2 --> Logs
Logs --> Mon
FS --> Mon
Mon -->|retrain trigger| PipeThe closed loop matters: monitoring feeds back into retraining. When drift is detected, a new training run is automatically triggered (or queued for human approval), and the cycle repeats.
Build vs. Buy
| Component | Build it yourself | Use a managed service |
|---|---|---|
| Feature store | At very large scale or unusual semantics | Most teams: Feast or Tecton |
| Training orchestration | Rarely worth it | Airflow / Kubeflow / SageMaker Pipelines |
| Model registry | When governance is custom (regulated industries) | MLflow / Vertex AI / SageMaker |
| Serving | When latency / cost / hardware needs are extreme | KServe, Triton, BentoML, SageMaker Endpoints |
| Monitoring | When you need very specific metrics | Evidently, Arize, WhyLabs, Fiddler |
Rule of thumb: for the first 5–10 models, lean entirely on managed/open-source pieces wired together. Build custom platform components only when a specific pain point can’t be solved by configuration of existing tools.
Production Examples
| Company | Platform notes |
|---|---|
| Uber Michelangelo | Pioneer of the in-house ML platform — feature store, training, registry, serving, monitoring as one product |
| Airbnb Bighead / Chronon | Strong feature consistency; Chronon enforces single-definition batch + streaming |
| Meta FBLearner Flow | Workflow-graph training; one of the earliest internal ML platforms |
| Netflix Metaflow + custom serving | Very pythonic training UX; custom Lambda/EC2 serving for personalization |
| Spotify Hendrix | Built on Kubeflow + Beam; emphasis on standardized monitoring |
| Google Vertex AI / TFX | Managed equivalent of internal “TFX” used inside Google |
Interview tip: Frame the ML platform around what can go wrong without one and the lifecycle stages it standardizes. Say: “An ML platform turns one-off model development into a repeatable governed process. The key components are a training pipeline (orchestrated in Kubeflow or Airflow) that pulls features from a feature store and produces a versioned artifact, a model registry that records the artifact along with its lineage — code commit, data snapshot, feature versions, hyperparameters, and approver — so any production prediction can be traced back to its origin, and a serving layer that supports online (real-time, KServe / Triton), batch (Spark scoring jobs), and streaming (Flink consumers) modes from the same registered artifact. Deployment uses shadow mode first to catch latency or distributional regressions, then canary at 1–5% of traffic, then a 1–2 week A/B test before full rollout, with one-click rollback. Monitoring tracks three kinds of drift: data drift via PSI on input features, concept drift via prediction distribution shifts, and performance drift via accuracy on labels once they arrive — with alerting that pages on schema breakage or fairness regressions and feeds back into automated retraining triggers. The platform earns its complexity once you have several production models — one team’s bespoke setup is fine; thirty teams’ bespoke setups are an audit and outage waiting to happen.”
Test Your Understanding
Your ML platform deploys a new fraud detection model directly to production. Within 2 hours, false positive rate triples — legitimate transactions are being blocked. The team didn’t run a shadow deployment. What should the deployment pipeline enforce, and how does shadow mode work?
Shadow mode (dark launch) runs the new model in parallel with the current production model, scoring every request with both, but only serving predictions from the current model. The new model’s predictions are logged but not acted upon.
After 1-3 days in shadow mode, compare:
- Prediction distribution: Does the new model’s output distribution match the old model’s? A sudden shift in the score distribution signals a problem.
- Latency: Does the new model meet the P99 latency SLA?
- Disagreement rate: How often do the two models disagree? High disagreement on known-good cases = regression.
The deployment pipeline should enforce: shadow mode (1-3 days) → canary (1-5% traffic, 1-2 days) → A/B test (50/50, 1-2 weeks with statistical significance) → full rollout. Each stage has automated gates: if the false positive rate exceeds threshold, the pipeline halts and rolls back automatically.
Your recommendation model in production suddenly starts recommending irrelevant items. The model hasn’t been retrained in 2 weeks. The data team confirms no code changes. Prediction accuracy dropped 15% over 14 days. What happened, and how should the platform detect this?
Data drift or concept drift. Two common causes:
Data drift: The distribution of input features has shifted. Example: a new marketing campaign drives a different user demographic; feature distributions (age, location, device) no longer match what the model was trained on. Detect via Population Stability Index (PSI) on input features — alert when PSI > 0.2.
Concept drift: The relationship between features and labels has changed. Example: user preferences shifted (seasonal trends, cultural events). The same features now predict different outcomes. Detect by monitoring prediction-to-label agreement once labels arrive (e.g., clicks within 24 hours of recommendation).
Platform requirements:
- Compute PSI daily on all input features; alert on drift.
- Track prediction distribution shifts (mean, variance, percentiles) in real-time.
- Trigger automated retraining when drift exceeds thresholds.
- Maintain a “freshness SLA” per model (e.g., retrain weekly for recommendation, daily for fraud).
A production model starts returning errors for 0.5% of requests. Investigation reveals: a new column was added to the feature table, shifting column indices. The feature extraction code in the serving pipeline expected column 7 to be ‘user_age’ but now it’s ‘user_country’. The model silently consumed the wrong features. How do you prevent this?
Schema validation at the serving boundary. The platform must enforce:
- Named features, not positional. Never reference features by column index. Use a feature store or feature schema that maps feature names to values. The serving pipeline requests
user_ageby name, not by position. - Schema contract in the model registry. When a model is registered, its expected input schema (feature names, types, ranges) is stored as metadata. At serving time, the infrastructure validates that the incoming feature vector matches the registered schema before passing it to the model.
- Type and range checks. If
user_ageis expected to be an integer in [0, 150] and the serving pipeline sends the string “US”, the request should fail loudly, not silently produce garbage predictions.
The one-click rollback the platform provides is the immediate mitigation: revert to the previous model version while the schema issue is fixed. Without it, the 0.5% error rate continues for hours while engineers debug.