AI Platform Engineering & MLOps Series · Part 5 of 34
The ML lifecycle, end to end, in production
Walk the eight canonical stages from problem framing to model retirement — and see exactly what changes at each stage when you move from a notebook to a production system.
A model that goes from training to serving without ever looping back to retraining is not an ML system in production — it is a one-shot batch job. The distinction matters because data distributions shift, user behaviour evolves, and the world the model was trained on drifts away from the world the model is asked to predict in. The discipline of closing that loop, reliably and repeatably, is what separates an MLOps practice from a notebook-to-API pipeline.
This article walks the eight canonical stages of the ML lifecycle — problem framing, data preparation, training, evaluation, registry, serving, monitoring, and retraining — and names, for each stage, the input it consumes, the output it produces, the failure mode that most commonly kills it, and the one decision that defines its quality. It closes with four lifecycle anti-patterns and how to recognise them before they cost you.
The lifecycle as a closed loop
Google's architecture guidance formalises three automation properties: continuous integration (CI) of code and data, continuous delivery (CD) of trained models to serving, and continuous training (CT) — the property unique to ML systems that automatically retrains and re-evaluates models when data conditions change [1]. CI and CD are familiar from software delivery; CT is the leg most organisations build last and break first. The monitoring-to-retraining arc — stages seven and eight — requires three independently functioning surfaces: a monitoring layer that detects drift, a labelling or data-refresh mechanism, and a retraining pipeline that still passes evaluation after months of quiescence. Each of those surfaces fails in its own way.
The failure to close the loop is the most common reason ML systems degrade without triggering an explicit alert. Sculley et al.'s foundational survey of ML technical debt [2] identifies feedback loops, undeclared consumers, and pipeline rot as the structural sources of this degradation — all of them downstream of a monitoring stage that was never given an actionable escalation path.
The explorer below lets you click any of the eight stages and toggle between “In a Notebook” and “In Production” — making the production additions concrete for each stage.
Lifecycle Journey Explorer
Select a stage to see what it looks like in a notebook versus in a production MLOps system.
Stage 1: Problem Framing
A rough hypothesis and a dataset.
- Stated goal in a README or notebook header
- Single metric chosen informally
- No baseline comparison
- Ad-hoc success criteria
Primary failure mode
Skipping the baseline. A sophisticated model benchmarked only against itself — no proof it beats a simple rule.
Quality decision
Does the team have a measurable success criterion that a non-technical stakeholder can verify independently of the engineering team?
Stages 1–4: from problem to evaluated model
Stage 1 — Problem framing
Input: a business objective stated in natural language. Output: a measurable ML problem with a defined target metric, a baseline (typically a simple heuristic or the current rule-based system), and an explicit decision on whether ML is warranted at all.
Failure mode: skipping the baseline. A team trains a sophisticated model, benchmarks it against itself, and ships it — never establishing whether a rules-based system or a simple regression would have served equally well at a fraction of the ongoing operational cost. Without a baseline, you cannot know whether the model is adding value or merely adding complexity.
Stage 2 — Data preparation
Input: raw data sources. Output: versioned, validated train/validation/test splits with a documented schema and transformation logic.
Failure mode: training-serving skew. The feature transformation applied at training time is not identical to the transformation applied at inference time. This is among the most insidious failure modes because it produces a model that evaluates well offline and underperforms silently in production — the performance delta is invisible to any test that runs on the training distribution. Sculley et al. name this class of problem explicitly as a source of ML-specific technical debt arising from data dependencies.
Stage 3 — Training
Input: versioned dataset, experiment configuration. Output: a trained model artefact with tracked metadata — hyperparameters, evaluation metrics, dataset version, and a pointer back to the experiment run.
Failure mode: experiment debt. Hundreds of runs tracked inconsistently — or not tracked at all — make it impossible to reproduce the model that scored best or to understand what changed between versions. The fix is treating the experiment tracker as a first-class system of record from day one, not retrofitting it after the team has accumulated entropy across six months of ad-hoc notebooks.
For distributed training on Kubernetes, the platform substrate — gang scheduling, distributed training operators, GPU quota enforcement — becomes relevant here. The training stage is where workload shape (single-node vs multi-node, GPU-bound vs CPU-bound) most directly constrains platform design. Those infrastructure choices are covered in Part 3 of this series.
Stage 4 — Evaluation
Input: trained model artefact, held-out test set. Output: a signed-off evaluation report covering aggregate metrics, slice analysis, fairness checks, and adversarial probing — plus a model card documenting the results. Breck et al.'s ML Test Score [3] provides 28 specific tests across four categories — data tests, model tests, ML infrastructure tests, and monitoring tests — as a structured rubric for what a production-ready evaluation suite must cover.
Failure mode: aggregate metric tunnelling. A team optimises a single headline metric (accuracy, AUC, F1) and never examines slices. A model that achieves 92% overall accuracy while performing at 61% on a minority demographic slice will pass every automated gate and fail every ethical review. Slice analysis is not optional for systems whose outputs affect people.
Stages 5–8: from registry to closed loop
Stage 5 — Registry
Input: evaluated model artefact with attached metadata. Output: a versioned, registered model with a defined lifecycle state (experimental → staging → production → archived) and a promotion gate that must be passed before a model reaches production.
Failure mode: model-of-record drift. Production is running a model that cannot be identified in the registry, whose training run metadata has been lost, and whose training data version is unknown. This is the most dangerous silent failure in the lifecycle — it means you cannot answer the four questions a regulator or incident responder will ask: what model is serving, where did it come from, who approved it, and is the served artefact the artefact that was evaluated?
The registry also serves as the GitOps trigger: when a model transitions to the Production state, an automated handoff writes an updated serving manifest to the GitOps repository. This seam — registry promotion to infrastructure reconciliation — is the most under-documented in the standard lifecycle. Part 4 of this series covers registry patterns, lifecycle states, and the curation-policy-as-code pattern in depth.
Stage 6 — Serving
Input: a promoted model artefact from the registry, a serving configuration. Output: a containerised inference endpoint with defined SLAs, a deployment strategy (canary, blue-green, or rolling), and a rollback path.
Failure mode: shadow debt. A model is deployed manually — via a direct kubectl command or a one-off script — and exists outside any GitOps loop. The next release has no safe rollback path because the baseline state was never declared as code. Shadow deployments accumulate silently: engineers move on, the original deployer forgets, and the model is effectively orphaned with no known owner and no documented rollback procedure.
Stage 7 — Monitoring
Input: live prediction requests and outcomes, ground-truth labels (where available), and system telemetry. Output: drift alerts, performance degradation signals, and — critically — a retraining trigger. Gama et al.'s comprehensive survey of concept drift adaptation [4] distinguishes three drift types the monitoring layer must handle separately: covariate shift (input distribution changes, relationship holds), concept drift (relationship between input and target changes), and label drift (target label distribution shifts). Each requires a different detection strategy and a different remediation response.
Failure mode: alert-only monitoring without an actionable response. Alerts fire, no one owns the on-call rotation for ML quality, the alert is silenced, and the model continues degrading. A monitoring layer without a defined owner, an escalation path, and a retraining trigger is logging theatre — it generates the appearance of observability without the operational capability to act on it.
A critical constraint in this stage is ground truth lag: the delay between prediction and true label arrival. For fraud detection it may be hours; for long-horizon forecasting it may be months. The monitoring strategy must account for this lag or it will fire on statistical noise rather than genuine degradation.
Stage 8 — Retraining
Input: a retraining trigger (scheduled, drift-triggered, or manual), refreshed data. Output: a new candidate model that has passed the same evaluation suite as the original and been promoted through the registry.
Failure mode: pipeline rot. The retraining pipeline was written during the original project, never maintained as a production service, and fails silently when triggered months later because a dependency has changed, a data source has moved, or the infrastructure configuration has drifted from the environment the pipeline was written for. The retraining pipeline must be treated as a production service — with tests, versioning, and on-call ownership — not as a script that worked once.
The retraining pipeline should be the same artefact as the training pipeline — not a parallel script. If retraining requires a separate code path, that path will diverge from the original and the divergence will be discovered at the worst possible moment: when a production model needs to be replaced urgently. The retirement path — routing a model to end-of-life— is also managed at this stage via the registry's archived state.
The tracer below animates how a production signal propagates backwards through the lifecycle — demonstrating why the lifecycle is a cycle, not a line.
Feedback Loop Tracer
Select a production signal and trace how it propagates backwards through the lifecycle — demonstrating that the ML lifecycle is a cycle, not a line.
Monitoring detects covariate shift: the input distribution has drifted from training. The signal propagates backwards — Monitoring → Retraining → Training → Evaluation → Registry → Serving.
How the lifecycle shifts across deployment contexts
The eight-stage lifecycle is universal. What changes across the deployment-context spectrum — pure-cloud, on-premises, hybrid, air-gapped — is where each stage executes, who operates it, and what constraints apply. In a pure-cloud context, most pipeline infrastructure is managed; in an on-premises or air-gapped context, every runner, registry, and monitoring backend is self-hosted and self-maintained. The lifecycle itself does not change; the operational burden at each stage does.
Two stages are most visibly affected by deployment context. Data preparation splits along data-residency lines in hybrid and regulated environments — some features may only be computed on the on-premises side, creating a pipeline that spans an interconnect boundary. Monitoring is affected in air-gapped environments because telemetry cannot leave the perimeter, so every observability backend — metrics, logs, traces, drift detection — must run inside the perimeter.
Managed pipelines, serverless training, hosted registries. Focus is on cost governance and egress control.
Self-hosted every component. Operational burden highest; often offset by data-residency or latency requirements.
Some features computed on-prem, others in cloud. The interconnect boundary is a seam to manage explicitly.
All telemetry stays inside the perimeter. Monitoring and drift detection must be entirely self-contained.
Four lifecycle anti-patterns
These four anti-patterns appear consistently in ML systems that fail in production. Recognising them early is cheaper than diagnosing them after a degradation incident.
1. The open loop
The model is deployed and the team moves on. There is no monitoring, no drift detection, and no retraining trigger. The model degrades silently until a business stakeholder notices that something has gone wrong — typically months after the model started failing. This is the most common lifecycle anti-pattern and the easiest to prevent: deploy monitoring at the same time as the model, not afterwards.
2. The frozen pipeline
Monitoring is deployed but the retraining pipeline has not been maintained. Drift alerts fire, the on-call engineer acknowledges them, and then discovers that the retraining pipeline fails for an unrelated reason — a broken dependency, a changed data schema, a rotated credential. The fix is continuous smoke-testing of the retraining pipeline on a schedule, independent of whether a drift signal has been received.
3. The unregistered deployment
A model is deployed outside the registry — directly to a serving endpoint, via a manual script, or by copying an artefact from a shared drive. The registry state and the serving state diverge. The next engineer to investigate a production issue cannot determine which model version is running or trace it back to a training run. This anti-pattern often originates from a well-intentioned hotfix that was never formalised.
4. The dual codepath
The training pipeline and the retraining pipeline are separate scripts that share no code. The transformation logic diverges between them over time. The model trained by the retraining pipeline produces different outputs than the model trained by the original pipeline on the same data — not because the model has been intentionally changed, but because the two codepaths have silently drifted apart. The fix is a single pipeline with a parameter that controls whether the run is an initial training run or a retraining run.
What this series carries forward
The eight stages and their failure modes are the shared vocabulary for the rest of this series. Part 2 continues with the organisational patterns for owning the lifecycle — because the lifecycle's failure modes do not all arise from technical choices. Many arise from unclear ownership at the stage boundaries: who owns the monitoring-to-retraining handoff, who owns the registry-to-serving handoff, and what happens when a stage has no named owner.
Part 3 goes deep on training workloads on Kubernetes and Part 4 covers registry patterns and lifecycle state management in depth.
Business objective → measurable ML problem + baseline
Versioned, validated splits with shared transform artefact
Tracked experiments; reproducible from tracker alone
Aggregate + slice + fairness; automated repeatable gate
Lifecycle states; promotion gate; GitOps trigger
GitOps-declared; canary strategy; tested rollback path
Covariate, concept, label drift; named owner; trigger
Same pipeline; smoke-tested on schedule; retirement path
References
- [1] Google Cloud Architecture Center. “MLOps: Continuous delivery and automation pipelines in machine learning.” Google Cloud Documentation, 2020 (updated 2024).
- [2] D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, Dan Dennison. “Hidden Technical Debt in Machine Learning Systems.” Advances in Neural Information Processing Systems 28 (NeurIPS), 2015.
- [3] E. Breck, S. Cai, E. Nielsen, M. Salib, D. Sculley. “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction.” IEEE Big Data 2017.
- [4] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, A. Bouchachia. “A survey on concept drift adaptation.” ACM Computing Surveys, 46(4), Article 44, 2014. DOI: 10.1145/2523813.
Continue the Journey
What is MLOps in 2026? A defensible working definition
The vocabulary article that introduces the eight-stage lifecycle and the four reference framings this series builds on.
Read articleAI PlatformML Workload Taxonomy
How training, batch inference, and online serving differ as workload shapes — and why it matters for platform design.
Read articleInteractiveThe Architectural Saga of Kubernetes
The platform substrate most MLOps stacks run on — an interactive guide to k8s architecture.
Read article