AI Platform Engineering & MLOps · Part II of 34

What an AI Platform Team Actually Owns

And what it does not. A working definition, the three-layer ownership stack, the accountability boundary, and how the team hands off to the MLOps engineer.

9 min read·2 interactive components·AI Platform Engineering series

Layer 3 — Application SupportLayer 2 — ML PrimitivesLayer 1 — SubstrateCentral Platform (consumed)

Ask five engineers what an AI Platform team does and you will get five different job descriptions. Some say it is a DevOps team that also manages GPUs. Others say it is the team that deploys Kubeflow. A few say it is whoever fields the data scientists' infrastructure tickets. None of those are wrong, but none are precise enough to tell you what the team should refuse to do — and that boundary is the whole point.

This article builds a working definition, draws the accountability boundary against the central platform team, the MLOps function, and the data org, and shows the three-layer ownership stack the team maintains. The previous article in this series established what MLOps is; this article establishes who owns the platform that MLOps practice runs on. The next article maps the six roles and skill sets across the full team.

A working definition

An AI Platform team is the internal team that makes it cheap and safe for other teams to ship machine-learning systems on the organisation's infrastructure. It is a platform team in the Team Topologies sense: its product is internal, its customers are other engineers, and its primary interaction mode with those internal customers is X-as-a-Service — publishing consumable APIs, templates, and managed services that product teams consume without deep collaboration on every use case.

Skelton and Pais define four fundamental team types in Team Topologies (IT Revolution Press, 2019): stream-aligned, enabling, complicated-subsystem, and platform. Platform teams create the internal services that reduce cognitive load on stream-aligned teams — those building directly for end users. The AI Platform team is a platform team whose product domain is ML primitives: GPU scheduling, model registries, serving runtimes, evaluation harnesses, and training pipeline templates, built on top of whatever the central platform team provides as the compute and networking substrate.

A platform team is explicitly not a shared services team that builds things on request. The distinction matters: a shared services model creates a bottleneck and reduces the platform to consultancy-on-payroll. A platform team publishes self-service capabilities and measures adoption. Product teams choose the paved road because it is faster and safer than building the infrastructure themselves — not because leadership mandates it.

The platform-as-a-product model

The CNCF Platforms Working Group's 2023 whitepaper, Platforms for Cloud Native Computing, identifies treating the platform as a product as the number-one quality for a successful internal platform. The same group's Platform Engineering Maturity Model (November 2023) operationalises this: platform engineering is the practice of offering an internal capability as a product through investment in people, processes, policy, and technology. Applied to an AI platform, this means:

The team publishes a paved road — defined, opinionated ways to train, evaluate, register, and serve a model — and invests in those paths rather than fielding one-off requests.
The team maintains an internal roadmap, names design partners, and measures adoption of the paved road as a leading indicator. Pain points become backlog items.
The team explicitly accepts that some teams will deviate from the paved road. The cost of deviation is paid by the deviating team. The platform team studies deviations to inform the next roadmap iteration.

The DORA Accelerate State of DevOps Report 2024 quantified the impact of internal developer platforms: teams using a well-designed IDP reported +8% individual productivity, +10% team performance, and +6% organisational software delivery performance. The same report documented a counter-intuitive cost: a −8% reduction in change throughput and −14% reduction in deployment stability during the adoption curve. This is not a reason to avoid building a platform; it is a signal that platforms behave like any organisational transformation — they create drag before they compound returns, and the compounding only arrives if the team runs product discipline throughout.

The three-layer ownership stack

An AI Platform team's responsibilities organise into three layers. Understanding the layers clarifies which work belongs to the platform team, which belongs to consuming teams, and which belongs to the central platform team below.

┌─────────────────────────────────────────────────────┐
│          Layer 3 — Application Support              │
│  Eval harness · rollout machinery · LLMOps tooling  │
│  Observability for model quality · cost attribution  │
├─────────────────────────────────────────────────────┤
│          Layer 2 — ML Primitives                    │
│  Model registry · serving runtimes · training       │
│  pipeline templates · feature store boundary        │
├─────────────────────────────────────────────────────┤
│          Layer 1 — Substrate                        │
│  GPU scheduling · multi-tenant namespacing          │
│  GPU Operator · quota enforcement · storage         │
├─────────────────────────────────────────────────────┤
│     Central platform team (consumed, not owned)     │
│  Kubernetes control plane · networking · identity   │
│  supply chain · base observability backplane        │
└─────────────────────────────────────────────────────┘

Layer 1 — Substrate

The substrate layer is the AI Platform team's lowest ownership boundary. It covers GPU node pools and GPU Operator configuration, multi-tenant namespace and quota enforcement, gang-scheduled batch admission (via schedulers such as Kueue or Volcano), storage backends for model artifacts and datasets, and the AI gateway that fronts all outbound LLM API traffic. The substrate is the layer that translates raw cluster capacity — provided by the central platform team — into a safe, multi-tenant compute surface that ML workloads can use without contending for resources.

The substrate does NOT include the Kubernetes control plane, the base networking stack, the certificate authority, or the identity provider. Those belong to the central platform team. The AI Platform team consumes them. Small organisations often merge the two roles; even then, the accountabilities stay separate so they can be split when the organisation scales.

Layer 2 — ML Primitives

ML primitives are the shared services the platform team deploys once so no individual product team has to build them from scratch. The canonical set includes: a model registry (versioned artifact storage with stage gates — development, staging, production — and promotion logic tied to evaluation results), serving runtimes (deployed as managed services; illustrative options include KServe, vLLM, Triton, and BentoML, chosen and packaged by the platform team rather than each product team), and reusable training pipeline templates (e.g., a Kubeflow or Argo Workflows template a product team forks rather than authors from scratch). For organisations running LLM workloads, the ML primitives layer also includes a prompt registry and alias-resolution mechanism.

The model registry deserves emphasis because it is the most common casualty of an under-resourced AI Platform team. It gets stood up once and then abandoned — nobody owns the lifecycle policy, stage promotions stop being enforced, and consumers stop trusting it. Product teams then fork their own registries and the shared primitive collapses. The registry must be treated as a live product with an owner and a lifecycle, not a one-time deployment artefact.

Layer 3 — Application Support

The application support layer is the part of the stack that the Sculley et al. Hidden Technical Debt in Machine Learning Systems (NeurIPS 2015) paper documented so precisely: the model code is a small fraction of a production ML system. The surrounding infrastructure — configuration management, data validation, serving pipelines, monitoring, and rollout control — constitutes the bulk of the operational surface. The AI Platform team owns the shared portions of that surface:

The evaluation harness and rollout gate: the shared mechanism that decides whether a new model version clears automated quality checks before receiving production traffic.
ML-layer observability: drift detection dashboards, prediction quality signals, GPU utilisation tracking, and — for LLM workloads — per-request token cost attribution and hallucination sampling infrastructure.
LLMOps tooling: the prompt registry, the AI gateway with guardrail enforcement, and the token-aware autoscaling configuration for LLM serving workloads.

The platform team provides these capabilities as shared infrastructure. It does not operate the product team's specific model, write the product team's evaluation rubric, or decide what constitutes an acceptable accuracy threshold for a particular use case. Those are product team decisions.

What the team owns and what it does not

The accountability boundary is more important than the responsibilities list. The most common failure mode for an AI Platform team is accepting work that belongs elsewhere until the team is too thin to maintain the primitives it already owns.

Ownership Boundary Explorer

Click any item to see which side of the accountability boundary it belongs on — and why.

AI Platform Team

Product / MLOps Team

Central Platform Team

Select an item above to see the accountability rationale.

Three things the team owns

The GPU and compute substrate. Scheduling policy, quota enforcement, operator lifecycle, and the runtime environment for all ML workloads in the organisation. Without a named owner, GPU resources are allocated by whoever submits tickets first, utilisation collapses, and training jobs block serving workloads.
The shared ML primitives. The model registry, the blessed serving runtimes, and the reusable training templates are the platform team's core product. If they are not owned by a named team, each product team builds its own version and the organisation accrues duplicated, inconsistent, unauditable tooling.
The evaluation and rollout machinery. The gate that promotes a model from staging to production, and the traffic-shifting mechanism that gets it there safely, must be owned by one team with a single quality bar. Without a shared rollout mechanism, each product team implements its own cutover logic — some with canaries, some without, some with automated rollback, some without.

Three things the team does not own

The models themselves. Architecture choice, training data curation, hyperparameter decisions, and fine-tuning strategy are all product team calls. The platform team provides the compute, the templates, and the registry; it does not choose the model. Crossing this boundary converts the platform team into an ML consultancy, which prevents it from maintaining shared infrastructure at scale.
The business outcome.Whether the model is improving the right product metric is the product team's accountability. The platform team's KPIs are infrastructure reliability, paved-road adoption rate, and cycle time from experiment to production — not whether the recommendation model lifts conversion.
The base cluster. The Kubernetes control plane, the base networking fabric, the certificate authority, and the identity provider belong to the central platform team. The AI Platform team consumes these as a foundation. Conflating the two creates an accountability void: when the base cluster has an outage, nobody knows whether it is an AI Platform issue or a central platform issue.

Org-chart positioning

Where the AI Platform team sits in the organisational structure shapes what it gets to do. Three patterns are common:

Inside central platform engineering. The AI Platform team is a sub-team of the broader platform org, peers with the SRE team and the developer-experience team. This maximises shared primitives (one Kubernetes story, one identity story) but risks ML-specific concerns being under-prioritised against general platform work.
Inside the data or ML organisation. The team reports through the head of ML or data, creating tight feedback loops with data scientists. The risk is drift away from the rest of platform engineering — the team reinvents networking and identity in subtly incompatible ways.
A standalone enabling team. Reports to engineering leadership, bridging platform engineering and the ML organisation. This works best when both the central platform team and the ML org are large enough to produce the coordination overhead that a bridge team resolves.

The signal that the placement is wrong is two parallel paved roads: one from the central platform team for applications, one from the AI Platform team for ML workloads, with no shared primitives between them. When identity, networking, observability, and supply-chain security are running as two separate stacks side by side, the placement needs to change.

How the team interfaces with its neighbours

Team Topologies describes three interaction modes: X-as-a-Service (self-service consumption of a platform capability), Collaboration (time-boxed joint work to discover a new pattern), and Facilitation (an enabling team helping a stream-aligned team build a new skill). A mature AI Platform team uses all three — predominantly X-as-a-Service, shifting temporarily into Collaboration when onboarding a new use case, and into Facilitation via office hours for teams building their first ML workload.

With the central platform team, the AI Platform team is a consumer upstream — it takes the Kubernetes substrate, the identity story, the supply-chain attestation, and the base observability backplane as given, then extends them with ML-specific primitives. Joint roadmap reviews keep the extensions consistent with the base platform's direction.

With data engineering, the boundary is usually the feature store or the data warehouse. Both teams co-own the data-quality contract for ML workloads: drift detection sits on the AI Platform side, but pipeline reliability and schema governance sit with the data engineering team.

With product teams, the AI Platform team is the provider. The clearest tell that this boundary has a leak is product teams calling the model's underlying serving runtime directly, bypassing the platform-managed endpoint. When that happens, the platform loses its ability to enforce rollout controls, rate limits, and cost attribution — and the product team carries maintenance burden for a capability that belongs to the platform.

Minimum disciplines on the team

DORA's platform engineering capability research (2025) found that 76% of organisations now have a dedicated platform team, and that the platform capability most correlated with a positive developer experience is giving clear, timely feedback on the outcome of tasks — not the breadth of services offered. A small, focused team that closes the feedback loop reliably outperforms a large team with a sprawling catalogue that product teams cannot predict.

The minimum set of disciplines needed to cover the three ownership layers without critical gaps:

Platform and infrastructure depth: Kubernetes, GPU Operator, storage, and networking at the substrate layer.
ML systems depth: model registry lifecycle, serving runtime operations, training pipeline authoring and maintenance.
Data path fluency: the boundary with data engineering, feature store operations if present, and the lineage contract between data assets and model artifacts.
Evaluation and observability: the eval harness, rollout gate, drift monitoring, and — for LLM workloads — token cost tracking, latency distribution, and hallucination sampling.

The detailed role-by-role breakdown — what each of the six roles on a mature team does, what skills each requires, and what the anti-patterns look like when a role is mis-hired — is the subject of the next article in this series.

The hand-off to the MLOps engineer

The AI Platform team and the MLOps function operate at different layers of abstraction. Conflating them is the single most common structural mistake in organisations building their first ML production capability.

The AI Platform team owns the substrate: the GPU operator, the scheduler, the model registry deployment, the serving runtime, the eval harness infrastructure. Their output is infrastructure that other ML teams consume. Their measure of success is paved-road adoption rate and infrastructure reliability.

The MLOps engineer owns the practice: the training pipelines for specific models, the retraining cadence, the data lineage chain for this product team's use case, the deployment of this model version. Their output is production models behaving correctly and reliably. Their measure of success is model-system reliability and cycle time from experiment to production.

The hand-off happens at a defined contract: the AI Platform team publishes a model registry API, a serving endpoint contract, an eval harness interface, and a set of rollout templates. The MLOps engineer on a product team consumes those contracts. They do not operate the registry or the serving runtime themselves — they submit to it, read from it, and rely on the platform team to maintain it.

In a small organisation, one person may carry both sets of accountabilities. That is a staffing constraint, not a definition. The accountabilities remain separate even when carried by one person — because when the organisation scales, the split point is already clear.

Platform → MLOps Handoff Simulator

Step through a model deployment. Toggle the switch to see the same flow without a platform team.

No Platform Team

With a platform team, the MLOps engineer consumes shared primitives at each step. Their focus is the model and its business outcome — not the infrastructure it runs on. The platform team maintains the shared surface for every product team in the organisation.

References

Skelton, M. & Pais, M. (2019). Team Topologies: Organizing Business and Technology Teams for Fast Flow. IT Revolution Press. ISBN 978-1942788812.
CNCF TAG App Delivery. (2023, April). Platforms for Cloud Native Computing (whitepaper).
CNCF TAG App Delivery. (2023, November). Platform Engineering Maturity Model.
DORA / Google. (2024). Accelerate State of DevOps Report 2024.
DORA. (2025). Capabilities: Platform Engineering.
Sculley, D., Holt, G., Golovin, D., et al. (2015). Hidden Technical Debt in Machine Learning Systems. NeurIPS 2015, pp. 2503–2511.
Google Cloud. (2021). MLOps: Continuous delivery and automation pipelines in machine learning. Cloud Architecture Center.

Continue the Journey

AI Platform

Part of the AI Platform Engineering & MLOps series

← Back to Articles