AI Platform Engineering & MLOps · Part XXII of 34

Golden paths for ML

Paved-road templates that survive contact with users — how platform teams define, template, and evolve the three canonical ML golden paths, and why the deprecation contract matters as much as the path itself.

12 min read·2 interactive components·5 references

Batch PipelineModel ServingGenAI / RAG✓Gate checkpoint

Platform teams are in the business of removing decisions. Every time a data scientist has to figure out which container registry to push to, which experiment-tracking endpoint to configure, or which serving framework to use, they are spending cognitive budget on infrastructure rather than on the model. The golden path pattern— first articulated publicly by Spotify’s engineering team and formalised in Skelton and Pais’s team interaction model — addresses this directly: define a paved, opinionated workflow for the common cases, pre-wire the integrations, and let product teams walk the path without understanding what is under it.

Spotify’s 2020 engineering blog post “How We Use Golden Paths to Solve Fragmentation in Our Software Ecosystem” describes the paved-road metaphorprecisely: a path is not a mandate. A team that needs to diverge can do so, but they leave the path and take on the maintenance burden of whatever they build instead. The CNCF TAG App Delivery Platforms Whitepaper (2023) formalises this at the industry level, describing the platform’s job as offering “a bundle often described as a golden path” accompanied by an initial project template and documentation. Both framings share a key discipline: a golden path is only golden if it is kept up to date. A stale path is worse than no path — it channels teams into known-bad configurations.

This article defines the three canonical golden paths for an ML platform, the mechanism used to stamp them out as templates, the governance gates wired into each path, and — critically — the deprecation contract that keeps a path trustworthy over time.

What makes a path “golden”

Skelton and Pais’s Team Topologies (IT Revolution Press, 2019) frames the platform team as a stream-aligned team’s internal supplier. The platform team’s primary output is not running services — it is reducing the cognitive load on product teams. The paved road is the primary mechanism: an opinionated, tested, integrated path that a product team can follow without needing to understand the platform in depth.

Three properties distinguish a golden path from a mere tutorial:

Scaffolded, not described. A team starting the path runs one command (or clicks one button in an internal developer portal) and receives a working repository skeleton, pre-wired CI, and pre-configured integrations. Documentation exists, but the path does not require the team to read it before getting started.
Enforces a gate. At some point in the path — typically at promotion time or deployment time — an automated gate runs checks (eval scores, model card completeness, latency regressions, security scans). A path without a gate is a convenience; a path with a gate is a quality mechanism.
Versioned and deprecatable. Platform teams evolve their stack. A path is a contract with its consumers. Consumers deserve a defined notice window — typically measured in weeks to months — and a migration script when a path is deprecated. Without this contract, teams fear using the path at all.

The three canonical ML golden paths

Three paths address the bulk of ML workloads on a modern AI platform. Each is described by its trigger (what causes a team to walk this path), its key inputs and outputs, and the gate it enforces.

Batch inference pipeline

Scheduled predictions on a corpus

Model serving: real-time inference

Request-time predictions for a live service

GenAI feature with vector index

RAG-backed search or Q&A surface

Path 1 — Batch inference pipeline

Trigger: an ML team has a trained model that produces predictions on a schedule — nightly fraud scores, weekly recommendations, monthly risk ratings — rather than in real time.

Input:a trained model artefact promoted to the model registry’s staging stage, and a data source reference (a feature store view, a data-lake partition, or a streaming-snapshot export).

Output: predictions written to an output store (object storage, a database table, or a downstream event stream), with row-count and schema assertions confirming the run succeeded.

Gate: an output-validation step inside the pipeline — schema check, row-count assertion, and a lightweight quality metric check — that blocks promotionof the batch job to the model registry’s production stage if any assertion fails. A GitOps controller (e.g. Argo CD or Flux) then detects the production-stage promotion and syncs the scheduled-job manifest to the cluster. Downstream systems see only production-stage outputs.

The pipeline definition lives in a scaffolded Git repository. The scaffold (produced by an internal developer portal template or a Backstage Software Template) pre-wires the experiment tracker, the model registry credential, the output-store path convention, and the CI pipeline that validates the pipeline definition itself before it runs in production.

Path 2 — Model serving: real-time inference

Trigger: an ML team has a model that must produce predictions at request time — fraud detection on a payment, ranking on a search query, content moderation on a submitted post.

Input: a model artefact in the registry, plus a serving manifest (an InferenceService definition for a serving runtime such as KServe, BentoML, or Seldon Core) authored by the ML engineer and committed to a deployment Git repository.

Output: a stable, versioned prediction endpoint consumed by application engineers. The endpoint URI does not change across model revisions — only the model revision behind it changes.

Gate: a CI gate on the deployment repository PR that runs three checks: (1) model-card completeness— the model card must document intended use, training data provenance, and known limitations; (2) eval-score threshold — the model’s offline evaluation score must exceed the team’s configured minimum; (3) latency regression test — shadow inference against a canary endpoint must show P95 latency within the configured tolerance of the current production model.

After the gate passes and the PR is merged, the GitOps controller syncs the InferenceService manifest. Traffic is initially split — for example, 5% to the new revision, 95% to the previous. A progressive-delivery controller (e.g. Argo Rollouts or Flagger) watches prediction latency, error rate, and prediction-quality metrics. If metrics stay within bounds across a configurable observation window, traffic advances to 100% for the new revision. If metrics breach bounds, the rollout is automatically aborted and the previous revision retakes full traffic. The Argo Rollouts project documents the AnalysisRun and Rollout resource types that implement this pattern.

Path 3 — GenAI feature with a vector index

Trigger: an application team wants to add retrieval-augmented generation (RAG)to a product — a search surface, a Q&A interface, a document assistant.

Input: a document corpus with a defined data access credential (scoped read-only), and a choice of embedding endpointfrom the platform’s model catalogue.

Output: a running RAG feature backed by a scheduled indexing pipeline and a vector store query endpoint (e.g. pgvector, Qdrant, Weaviate, or Milvus). The LLM inference endpoint is provided by the platform — either self-hosted or a proxied external API — so the application team does not manage model serving directly.

Gate: an offline evaluation harness that runs on the indexing pipeline’s output — measuring retrieval recall on a ground-truth question-answer set — and an online evaluation surface (explicit feedback signals captured in the application layer). The offline recall gate blocks promotion to production if recall falls below threshold; the online gate feeds a monitoring dashboard rather than blocking deployment, since production traffic is the only source of real query distribution.

The templating mechanism

A golden pathis not a document — it is an executable template. The CNCF Platforms Whitepaper describes this as offering an “initial project template and documentation, a bundle often described as a golden path.” The Backstage Software Templates specification (API version scaffolder.backstage.io/v1beta3) is one widely adopted mechanism: a YAML Template document with a spec.parameters section (the inputs the user provides — project name, team, data source reference) and a spec.steps section (the actions the scaffolder runs: fetch a skeleton, render files from a template, open a repository, register the new component in the catalog).

scaffolder-template-batch-model.yaml

apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: batch-model-pipeline
  title: Batch Model Pipeline
  description: Golden path for scheduling batch inference jobs
spec:
  owner: platform-team
  type: ml-pipeline

  parameters:
    - title: Project details
      required: [modelName, teamSlug, outputStorePrefix]
      properties:
        modelName:
          type: string
          description: Name of the model (must match registry slug)
        teamSlug:
          type: string
          description: Your team identifier for RBAC and labelling
        outputStorePrefix:
          type: string
          description: Object-store prefix for batch output (e.g. s3://data/predictions/)

  steps:
    - id: fetch-skeleton
      name: Fetch pipeline skeleton
      action: fetch:template
      input:
        url: ./skeleton
        values:
          modelName: ${{ parameters.modelName }}
          teamSlug: ${{ parameters.teamSlug }}
          outputStorePrefix: ${{ parameters.outputStorePrefix }}

    - id: publish
      name: Create Git repository
      action: publish:github
      input:
        repoUrl: github.com?owner=${{ parameters.teamSlug }}&repo=${{ parameters.modelName }}-pipeline

    - id: register
      name: Register in catalog
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps.publish.output.repoContentsUrl }}

For teams not running an internal developer portal with a scaffolding engine, the same outcome is achievable with Argo CD ApplicationSets using the Cluster generator pattern: a single ApplicationSet template is parameterised from registered cluster Secrets, stamping out one Application per cluster (or per environment) without manual duplication. The Argo CD documentation describes the Cluster generator as the primary mechanism for multi-cluster template instantiation. Kustomize base-plus-overlay provides the per-environment patch layer in both cases — a base directory holds the canonical manifest, and an overlay directory for each environment (dev, staging, production) holds only the values that differ.

applicationset-batch-model.yaml

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: batch-model-pipeline
  namespace: argocd
spec:
  generators:
    - clusters: {}   # one Application per registered cluster Secret
  template:
    metadata:
      name: '{{name}}-batch-model'
    spec:
      project: ml-workloads
      source:
        repoURL: https://git.example.com/platform/batch-model-base
        targetRevision: HEAD
        path: overlays/{{metadata.labels.env}}
      destination:
        server: '{{server}}'
        namespace: ml-inference
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

The choice of mechanism — portal scaffold, ApplicationSet, or Kustomize overlay— depends on what the platform already operates. The key discipline is the same in all cases: the template is the canonical source of truth for the path. A team that modifies the generated skeleton is drifting from the path; the platform team’s tooling detects that drift via GitOps sync-status checks and surfaces it to both teams.

The composer below lets you assemble a paved-road template from building blocks and see the support, drift-reduction, and velocity trade-offs in real time.

Golden Path Composer

Toggle building blocks to assemble your paved-road template. Watch the support, drift reduction, and velocity trade-offs update in real time.

Scaffold

Gate

Delivery

Deprecation

Trade-off readout

Platform support7/10

Drift reduction10/10

Team velocity10/10

Production-ready golden path

Consider a deprecation contract — consumers need a defined notice window to trust the path.

3 blocks selected

Governance gates wired into the path

A golden path is valuable because it enforces good defaults automatically. The gates that matter most for ML workloads sit at three points:

1Registry promotion gate. Before a model artefact moves from staging to production in the model registry, it must pass automated checks: minimum eval score, model card completeness, and (for regulated industries) an explicit reviewer signoff. The model registry's webhook or event integration triggers the CI gate; the gate's pass/fail result is written back to the registry as a metadata annotation. This makes the gate auditable — any downstream system can query whether a given model version passed all gates.
2Deployment PR gate. When an ML engineer opens a PR against the deployment repository, the CI pipeline runs the model-card check, eval-score threshold, and latency regression test (Path 2) or recall-on-ground-truth check (Path 3). This gate runs in the CI system — not in the cluster — so it fails fast, before the GitOps controller ever sees the manifest.
3Runtime rollout gate. After deployment, the progressive-delivery controller observes live metrics. For serving models (Path 2), this means request latency and error rate from the serving layer's metrics endpoint, plus any model-quality signal the application emits. For batch models (Path 1), this means the output-validation step in the pipeline itself. The rollout gate is the safety net for the cases the CI gate did not catch — distribution shift detected only under real traffic, latency regression that appears only at production request volumes.

Gate	Where it runs	What it checks	Applies to
Registry promotion	CI (registry webhook)	Eval score, model card, reviewer signoff	All paths
Deployment PR	CI (PR pipeline)	Model card, eval threshold, latency regression / recall	Path 2, Path 3
Runtime rollout	Cluster (progressive delivery)	Latency, error rate, prediction quality, output validation	All paths

The deprecation contract

A golden path that cannot be deprecated safely becomes technical debt. Platform teams that skip the deprecation contract find themselves maintaining old path versions indefinitely — because consumers are stuck on them, because no migration tooling was provided, because the notice window was too short. The pattern for a trustworthy deprecation contract has four steps:

Announce with a defined notice window. Consumers of the path get a notice period — typically measured in weeks to months — before the old path is removed. No standard mandates a specific number; the appropriate window depends on the consumer's release cadence and the complexity of the migration.
Provide a migration script or automated PR. The platform team does not announce a deprecation and leave consumers to figure out the migration themselves. The scaffolding system opens automated PRs against consumer repositories — replacing old template references with the new version, updating dependency pins, adjusting CI configuration. Backstage's Software Templates and the scaffolder action system support this pattern natively.
Track adoption. The platform team maintains an inventory of which repositories are on which path version — sourced from the IDP catalog or from Git metadata. Deprecation is not complete until every consumer has migrated or has been deliberately granted an extension.
Remove on schedule. The old path version is removed at the end of the notice window. Exceptions are tracked explicitly and have an expiry date. An exception that has no expiry date is a permanent fork — the condition that the deprecation contract exists to prevent.

The deprecation contract is also the primary argument for investing in golden pathsat all. A team that is not confident the platform will maintain its paths will build their own infrastructure — defeating the consolidation goal. Trust in the path’s stability is a prerequisite for adoption.

The timeline below shows what happens when a team leaves the paved road. Choose an off-ramp point and play the simulation to see how drift, support, and maintenance cost compound over time.

Template Drift Timeline

Choose when a team leaves the paved road, then play the timeline to see how support, drift, and maintenance cost evolve compared to a team that stays on-path.

When does the team take the off-ramp?

Timeline

LaunchMonth 2Month 4Month 6Month 9Month 12

On-pathLaunch

Team scaffolds from the golden path template.

Platform support9/10

Drift from template0/10

Maintenance cost1/10

Diverge point: Month 4— click any dot to jump to that point

The off-ramp and when to take it

Golden paths address the majority of workloads, not all of them. Teams encounter off-rampswhen their requirements exceed the path’s design envelope:

Path 1 off-ramps include multi-node distributed training jobs, non-standard output destinations, and pipeline dependencies on systems the platform does not yet integrate with.
Path 2 off-ramps include streaming inference (event-triggered prediction), multi-model ensembles, and custom pre/post-processing pipelines that do not fit the serving runtime's transformer abstraction.
Path 3 off-ramps include hybrid search (keyword plus semantic), custom re-ranking pipelines, and multi-turn agent loops with tool use — which extend beyond simple RAG into agentic infrastructure.

The discipline at the off-ramp matters more than the path itself. When a team hits an off-ramp, the platform team has three options: extend the path (add the capability to the template), document the divergence pattern (add it to an extension catalogue), or accept the team building independently (with explicit acknowledgement that they own the maintenance). Which option applies depends on how many teams share the need. A one-team requirement is a candidate for independent build; a requirement shared by three or more teams is a candidate for path extension.

1 team needs it

Accept independent build

Explicit acknowledgement of maintenance ownership.

2 teams need it

Document divergence pattern

Add to extension catalogue; watch for a third team.

3+ teams need it

Extend the path

Add the capability to the template itself.

Connecting the three paths to the broader platform

The three paths are built on top of platform capabilities described elsewhere in this series. The toolchain that makes path scaffolding possible — experiment trackers, model registries, serving runtimes, vector stores — is covered in the composable AI toolchain article. The GitOps machinery that makes the deployment step in Paths 1 and 2 work — the controller, the manifest conventions, the sync policies — is covered in the CI/CD and GitOps article. A golden path is not a platform feature in isolation — it is the orchestrated composition of several platform capabilities into an end-to-end workflow a product team can actually use.

The discoverability of the paths is equally important. A golden path that is not surfaced in the internal developer portal is a golden path that most teams will not find. The IDP catalog — whether Backstage-based or another portal — should surface the available templates, the version each team is on, and the status of any active deprecations. Discoverability is not a UX concern; it is a platform adoption concern.

References

[1] Spotify Engineering. “How We Use Golden Paths to Solve Fragmentation in Our Software Ecosystem.” 2020. engineering.atspotify.com
[2] Skelton, M. & Pais, M. Team Topologies: Organizing Business and Technology Teams for Fast Flow. IT Revolution Press, 2019. teamtopologies.com
[3] CNCF TAG App Delivery. “Platforms Whitepaper.” 2023. tag-app-delivery.cncf.io
[4] Backstage.io. “Writing Software Templates” (scaffolder.backstage.io/v1beta3). Backstage documentation. backstage.io/docs
[5] Argo CD project. “ApplicationSet Cluster Generator.” Argo CD documentation. argo-cd.readthedocs.io

Continue the Journey

AI Platform