AI Platform Engineering & MLOps Series  ·  Part 5 of 34

The ML lifecycle, end to end, in production

Walk the eight canonical stages from problem framing to model retirement — and see exactly what changes at each stage when you move from a notebook to a production system.

10 min read·3 interactive components·4 references
Retirement1ProblemFraming2DataPrep3Training4Evaluation5Registry6Serving7Monitoring8RetrainingMLLifecycleclosed loopAI Platform Engineering & MLOps — Part 5 of 34
FrameDataTrainEvalRegistryServeMonitorRetrain

A model that goes from training to serving without ever looping back to retraining is not an ML system in production — it is a one-shot batch job. The distinction matters because data distributions shift, user behaviour evolves, and the world the model was trained on drifts away from the world the model is asked to predict in. The discipline of closing that loop, reliably and repeatably, is what separates an MLOps practice from a notebook-to-API pipeline.

This article walks the eight canonical stages of the ML lifecycle problem framing, data preparation, training, evaluation, registry, serving, monitoring, and retraining — and names, for each stage, the input it consumes, the output it produces, the failure mode that most commonly kills it, and the one decision that defines its quality. It closes with four lifecycle anti-patterns and how to recognise them before they cost you.

The lifecycle as a closed loop

Google's architecture guidance formalises three automation properties: continuous integration (CI) of code and data, continuous delivery (CD) of trained models to serving, and continuous training (CT) — the property unique to ML systems that automatically retrains and re-evaluates models when data conditions change [1]. CI and CD are familiar from software delivery; CT is the leg most organisations build last and break first. The monitoring-to-retraining arc — stages seven and eight — requires three independently functioning surfaces: a monitoring layer that detects drift, a labelling or data-refresh mechanism, and a retraining pipeline that still passes evaluation after months of quiescence. Each of those surfaces fails in its own way.

The failure to close the loop is the most common reason ML systems degrade without triggering an explicit alert. Sculley et al.'s foundational survey of ML technical debt [2] identifies feedback loops, undeclared consumers, and pipeline rot as the structural sources of this degradation — all of them downstream of a monitoring stage that was never given an actionable escalation path.

The explorer below lets you click any of the eight stages and toggle between “In a Notebook” and “In Production” — making the production additions concrete for each stage.

Lifecycle Journey Explorer

Select a stage to see what it looks like in a notebook versus in a production MLOps system.

🎯

Stage 1: Problem Framing

A rough hypothesis and a dataset.

  • Stated goal in a README or notebook header
  • Single metric chosen informally
  • No baseline comparison
  • Ad-hoc success criteria

Primary failure mode

Skipping the baseline. A sophisticated model benchmarked only against itself — no proof it beats a simple rule.

Quality decision

Does the team have a measurable success criterion that a non-technical stakeholder can verify independently of the engineering team?

Stages 1–4: from problem to evaluated model

Stage 1 — Problem framing

Input: a business objective stated in natural language. Output: a measurable ML problem with a defined target metric, a baseline (typically a simple heuristic or the current rule-based system), and an explicit decision on whether ML is warranted at all.

Failure mode: skipping the baseline. A team trains a sophisticated model, benchmarks it against itself, and ships it — never establishing whether a rules-based system or a simple regression would have served equally well at a fraction of the ongoing operational cost. Without a baseline, you cannot know whether the model is adding value or merely adding complexity.

Quality decision: Does the team have a measurable success criterion that a non-technical stakeholder can verify independently of the engineering team's claims?

Stage 2 — Data preparation

Input: raw data sources. Output: versioned, validated train/validation/test splits with a documented schema and transformation logic.

Failure mode: training-serving skew. The feature transformation applied at training time is not identical to the transformation applied at inference time. This is among the most insidious failure modes because it produces a model that evaluates well offline and underperforms silently in production — the performance delta is invisible to any test that runs on the training distribution. Sculley et al. name this class of problem explicitly as a source of ML-specific technical debt arising from data dependencies.

Quality decision: Is the transformation code that runs at training time the same artefact that runs at inference time, verifiably, or are there two codepaths that are assumed to be equivalent?

Stage 3 — Training

Input: versioned dataset, experiment configuration. Output: a trained model artefact with tracked metadata — hyperparameters, evaluation metrics, dataset version, and a pointer back to the experiment run.

Failure mode: experiment debt. Hundreds of runs tracked inconsistently — or not tracked at all — make it impossible to reproduce the model that scored best or to understand what changed between versions. The fix is treating the experiment tracker as a first-class system of record from day one, not retrofitting it after the team has accumulated entropy across six months of ad-hoc notebooks.

For distributed training on Kubernetes, the platform substrate — gang scheduling, distributed training operators, GPU quota enforcement — becomes relevant here. The training stage is where workload shape (single-node vs multi-node, GPU-bound vs CPU-bound) most directly constrains platform design. Those infrastructure choices are covered in Part 3 of this series.

Quality decision: Can any engineer on the team reproduce the best model from a cold start, using only the experiment tracker as the source of truth?

Stage 4 — Evaluation

Input: trained model artefact, held-out test set. Output: a signed-off evaluation report covering aggregate metrics, slice analysis, fairness checks, and adversarial probing — plus a model card documenting the results. Breck et al.'s ML Test Score [3] provides 28 specific tests across four categories — data tests, model tests, ML infrastructure tests, and monitoring tests — as a structured rubric for what a production-ready evaluation suite must cover.

Failure mode: aggregate metric tunnelling. A team optimises a single headline metric (accuracy, AUC, F1) and never examines slices. A model that achieves 92% overall accuracy while performing at 61% on a minority demographic slice will pass every automated gate and fail every ethical review. Slice analysis is not optional for systems whose outputs affect people.

Quality decision: Does the evaluation report include slice analysis broken down by the dimensions that matter for the use case, or only an aggregate score?

Stages 5–8: from registry to closed loop

Stage 5 — Registry

Input: evaluated model artefact with attached metadata. Output: a versioned, registered model with a defined lifecycle state (experimental → staging → production → archived) and a promotion gate that must be passed before a model reaches production.

Failure mode: model-of-record drift. Production is running a model that cannot be identified in the registry, whose training run metadata has been lost, and whose training data version is unknown. This is the most dangerous silent failure in the lifecycle — it means you cannot answer the four questions a regulator or incident responder will ask: what model is serving, where did it come from, who approved it, and is the served artefact the artefact that was evaluated?

The registry also serves as the GitOps trigger: when a model transitions to the Production state, an automated handoff writes an updated serving manifest to the GitOps repository. This seam — registry promotion to infrastructure reconciliation — is the most under-documented in the standard lifecycle. Part 4 of this series covers registry patterns, lifecycle states, and the curation-policy-as-code pattern in depth.

Quality decision: Can you trace a running model in production back to its exact training dataset version, training run hash, and the person who approved its promotion — in under five minutes, from a cold start?

Stage 6 — Serving

Input: a promoted model artefact from the registry, a serving configuration. Output: a containerised inference endpoint with defined SLAs, a deployment strategy (canary, blue-green, or rolling), and a rollback path.

Failure mode: shadow debt. A model is deployed manually — via a direct kubectl command or a one-off script — and exists outside any GitOps loop. The next release has no safe rollback path because the baseline state was never declared as code. Shadow deployments accumulate silently: engineers move on, the original deployer forgets, and the model is effectively orphaned with no known owner and no documented rollback procedure.

Quality decision: Is every production model deployment declared as code in a GitOps repository, with a documented and tested rollback path?

Stage 7 — Monitoring

Input: live prediction requests and outcomes, ground-truth labels (where available), and system telemetry. Output: drift alerts, performance degradation signals, and — critically — a retraining trigger. Gama et al.'s comprehensive survey of concept drift adaptation [4] distinguishes three drift types the monitoring layer must handle separately: covariate shift (input distribution changes, relationship holds), concept drift (relationship between input and target changes), and label drift (target label distribution shifts). Each requires a different detection strategy and a different remediation response.

Failure mode: alert-only monitoring without an actionable response. Alerts fire, no one owns the on-call rotation for ML quality, the alert is silenced, and the model continues degrading. A monitoring layer without a defined owner, an escalation path, and a retraining trigger is logging theatre — it generates the appearance of observability without the operational capability to act on it.

A critical constraint in this stage is ground truth lag: the delay between prediction and true label arrival. For fraud detection it may be hours; for long-horizon forecasting it may be months. The monitoring strategy must account for this lag or it will fire on statistical noise rather than genuine degradation.

Quality decision: Does every drift alert have a named owner, a defined escalation path, and a retraining trigger — or do alerts accumulate in a dashboard that no one is on call to read?

Stage 8 — Retraining

Input: a retraining trigger (scheduled, drift-triggered, or manual), refreshed data. Output: a new candidate model that has passed the same evaluation suite as the original and been promoted through the registry.

Failure mode: pipeline rot. The retraining pipeline was written during the original project, never maintained as a production service, and fails silently when triggered months later because a dependency has changed, a data source has moved, or the infrastructure configuration has drifted from the environment the pipeline was written for. The retraining pipeline must be treated as a production service — with tests, versioning, and on-call ownership — not as a script that worked once.

The retraining pipeline should be the same artefact as the training pipeline — not a parallel script. If retraining requires a separate code path, that path will diverge from the original and the divergence will be discovered at the worst possible moment: when a production model needs to be replaced urgently. The retirement path — routing a model to end-of-life— is also managed at this stage via the registry's archived state.

Quality decision: Is the retraining pipeline tested on a schedule independently of whether a retraining trigger has fired — so that pipeline rot is detected before it matters?

The tracer below animates how a production signal propagates backwards through the lifecycle — demonstrating why the lifecycle is a cycle, not a line.

Feedback Loop Tracer

Select a production signal and trace how it propagates backwards through the lifecycle — demonstrating that the ML lifecycle is a cycle, not a line.

Monitoring detects covariate shift: the input distribution has drifted from training. The signal propagates backwards — Monitoring → Retraining → Training → Evaluation → Registry → Serving.

1ProblemFraming
2DataPrep
3Training
4Evaluation
5Registry
6Serving
7Monitoring
8Retraining

How the lifecycle shifts across deployment contexts

The eight-stage lifecycle is universal. What changes across the deployment-context spectrum — pure-cloud, on-premises, hybrid, air-gapped — is where each stage executes, who operates it, and what constraints apply. In a pure-cloud context, most pipeline infrastructure is managed; in an on-premises or air-gapped context, every runner, registry, and monitoring backend is self-hosted and self-maintained. The lifecycle itself does not change; the operational burden at each stage does.

Two stages are most visibly affected by deployment context. Data preparation splits along data-residency lines in hybrid and regulated environments — some features may only be computed on the on-premises side, creating a pipeline that spans an interconnect boundary. Monitoring is affected in air-gapped environments because telemetry cannot leave the perimeter, so every observability backend — metrics, logs, traces, drift detection — must run inside the perimeter.

Pure CloudLow

Managed pipelines, serverless training, hosted registries. Focus is on cost governance and egress control.

On-PremisesHigh

Self-hosted every component. Operational burden highest; often offset by data-residency or latency requirements.

HybridMedium

Some features computed on-prem, others in cloud. The interconnect boundary is a seam to manage explicitly.

Air-GappedVery High

All telemetry stays inside the perimeter. Monitoring and drift detection must be entirely self-contained.

Four lifecycle anti-patterns

These four anti-patterns appear consistently in ML systems that fail in production. Recognising them early is cheaper than diagnosing them after a degradation incident.

1. The open loop

The model is deployed and the team moves on. There is no monitoring, no drift detection, and no retraining trigger. The model degrades silently until a business stakeholder notices that something has gone wrong — typically months after the model started failing. This is the most common lifecycle anti-pattern and the easiest to prevent: deploy monitoring at the same time as the model, not afterwards.

2. The frozen pipeline

Monitoring is deployed but the retraining pipeline has not been maintained. Drift alerts fire, the on-call engineer acknowledges them, and then discovers that the retraining pipeline fails for an unrelated reason — a broken dependency, a changed data schema, a rotated credential. The fix is continuous smoke-testing of the retraining pipeline on a schedule, independent of whether a drift signal has been received.

3. The unregistered deployment

A model is deployed outside the registry — directly to a serving endpoint, via a manual script, or by copying an artefact from a shared drive. The registry state and the serving state diverge. The next engineer to investigate a production issue cannot determine which model version is running or trace it back to a training run. This anti-pattern often originates from a well-intentioned hotfix that was never formalised.

4. The dual codepath

The training pipeline and the retraining pipeline are separate scripts that share no code. The transformation logic diverges between them over time. The model trained by the retraining pipeline produces different outputs than the model trained by the original pipeline on the same data — not because the model has been intentionally changed, but because the two codepaths have silently drifted apart. The fix is a single pipeline with a parameter that controls whether the run is an initial training run or a retraining run.

Practical Implication: Anti-patterns 1 and 2 are often discovered in the same incident: the model has been degrading (anti-pattern 1), the team deploys monitoring, drift alerts fire, and then the retraining pipeline fails (anti-pattern 2). The cost is a degraded model in production for weeks while two separate problems are debugged simultaneously.

What this series carries forward

The eight stages and their failure modes are the shared vocabulary for the rest of this series. Part 2 continues with the organisational patterns for owning the lifecycle — because the lifecycle's failure modes do not all arise from technical choices. Many arise from unclear ownership at the stage boundaries: who owns the monitoring-to-retraining handoff, who owns the registry-to-serving handoff, and what happens when a stage has no named owner.

Part 3 goes deep on training workloads on Kubernetes and Part 4 covers registry patterns and lifecycle state management in depth.

1Problem Framing

Business objective → measurable ML problem + baseline

2Data Preparation

Versioned, validated splits with shared transform artefact

3Training

Tracked experiments; reproducible from tracker alone

4Evaluation

Aggregate + slice + fairness; automated repeatable gate

5Registry

Lifecycle states; promotion gate; GitOps trigger

6Serving

GitOps-declared; canary strategy; tested rollback path

7Monitoring

Covariate, concept, label drift; named owner; trigger

8Retraining

Same pipeline; smoke-tested on schedule; retirement path

References

  1. [1] Google Cloud Architecture Center. “MLOps: Continuous delivery and automation pipelines in machine learning.” Google Cloud Documentation, 2020 (updated 2024).
  2. [2] D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, Dan Dennison. “Hidden Technical Debt in Machine Learning Systems.” Advances in Neural Information Processing Systems 28 (NeurIPS), 2015.
  3. [3] E. Breck, S. Cai, E. Nielsen, M. Salib, D. Sculley. “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction.” IEEE Big Data 2017.
  4. [4] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, A. Bouchachia. “A survey on concept drift adaptation.” ACM Computing Surveys, 46(4), Article 44, 2014. DOI: 10.1145/2523813.

Continue the Journey