AI Platform Engineering & MLOps · Part III of 34
The six roles on an AI platform
What each role does, what each is fluent in, and the anti-patterns that appear when an organisation hires for one role and uses the person as another.
Role titles in AI and ML infrastructure are not standardised. The same responsibilities appear under “AI Platform Engineer”, “ML Infra Engineer”, “AI Reliability Engineer”, and half a dozen other labels depending on the company. LinkedIn’s 2026 Jobs on the Rise report found four of the five fastest-growing roles in the U.S. were AI-related — and AI/ML job postings surged 163% from 2024 to 2025, reaching roughly 49,000 open positions in the U.S. alone. That volume of demand, met with non-standardised titling, produces systematic mis-hiring.
This article maps the six roles that appear on a working AI platform team. Four are core: Platform Engineer (AI Platform), MLOps Engineer, ML Engineer, and Data Scientist. Two are habitually conflated with the core four: DevOps/SRE and Data Engineer. For each role this article gives the job-to-be-done, the day-one and senior skill bar, the tools expected at each level, and — the most useful part — the anti-patterns that appear when an organisation hires for one role and uses the person as another.
Why role clarity matters more than title uniformity
The problem is not that companies use different titles — it is that they conflate distinct optimisation functions. A DevOps engineer asked to “also handle MLOps” is being asked to operate a model lifecycle on top of a general infrastructure role, which means one of the two halves rots. An ML engineer hired to run the cluster is being asked to spend cognitive budget on GPU driver versions and network policy instead of on model architecture. The skill overlap is real; the optimisation function is not the same.
The SFIA 9 framework (Skills Framework for the Information Age, published by BCS and the SFIA Foundation) is the most widely used international standard for mapping digital skills to responsibility levels. It defines the Machine Learning skill (code MLNG) across seven levels and explicitly separates the “building and training models” competency from the “operationalising ML pipelines” competency. That separation is the clearest industry-framework signal that MLOps and ML engineering are distinct practices, not two names for the same thing. The LinkedIn Jobs on the Rise 2026 report similarly treats MLOps engineers and AI infrastructure engineers as separate categories in the fastest-growing roles list, as does the CNCF Platforms White Paper, which distinguishes platform capability providers from platform users.
Skills matter more than titles. The six sections below use industry-generic role labels. Your org chart may say something different; what matters is whether the optimisation function described here matches the person you hired.
Role Matrix Explorer
Select a role to see its job-to-be-done, skill bar, tools, and anti-pattern.
Platform Engineer (AI Platform)
Owns the Kubernetes-and-GPU substrate. Stands up and maintains the GPU operator, job scheduler, model registry deployment, serving runtime, and observability.
Skill Bar
Senior Kubernetes admin level — Helm, kubectl, RBAC, network policy, production cluster experience.
Shipped a GPU job-queueing setup. Owns NVIDIA GPU Operator. Working model registry with consumers.
Designed multi-tenant fairness across shared GPU pool. Rolled out a CNI-level change without downtime. Can debug NCCL from packet capture.
Day-One Tools
NVIDIA GPU Operator · Volcano (gang scheduling) · Kueue · Argo CD / Flux CD · Helm · Cilium (eBPF CNI)
Anti-Pattern
Asking them to write training code. That is the ML Engineer's responsibility. They will leave or produce poor model code.
The Core Four
Role 1: Platform Engineer (AI Platform)
Job-to-be-done
Owns the Kubernetes-and-GPU substrate that the rest of the ML stack runs on. Stands up and maintains the GPU operator, the job scheduler, the model registry deployment, the serving runtime, and the observability that watches all of it. Measure of success: paved-road adoption and substrate reliability.
Skill bar
- Day one: Senior Kubernetes admin level. Comfortable with Helm, kubectl, RBAC, and network policy. Has run a production cluster beyond a local development environment.
- Month three: Has shipped a GPU job-queueing setup on the team’s GPU pool. Owns the NVIDIA GPU Operator deployment. Has a working model registry deployed with consumers connecting from at least one cluster.
- Senior: Has designed the multi-tenant fairness story across competing teams sharing the same GPU pool. Has rolled out a CNI-level change without downtime. Can debug NCCL on a multi-node training job from packet capture.
Core skills
- Kubernetes, deep — operators, CRDs, custom controllers
- GPU resource model — NVIDIA device plugin, MIG, MPS, Dynamic Resource Allocation (DRA)
- Helm and Kustomize for release packaging
- Kubernetes networking — CNI, network policy, eBPF-based network observability
- GitOps (e.g. Argo CD, Flux CD) — reconciliation model, ApplicationSet, drift detection
- Linux performance tuning for GPU workloads — NUMA topology, PCIe bandwidth, GPU-direct RDMA
Tool fluency expected
NVIDIA GPU Operator; a gang-scheduling solution (e.g. Volcano) for multi-node training; a cluster-level queue and quota system (e.g. Kueue); a GitOps controller (e.g. Argo CD, Flux CD); Helm; an eBPF-based CNI (e.g. Cilium). Awareness of serving runtimes and registry tooling as substrates they make available to the rest of the org.
Role 2: MLOps Engineer
Job-to-be-done
Owns the lifecycle of production models on the platform — the training pipelines, the model registry contract, the deployment promotion gates, the retraining triggers, and the rollback path. Measure of success: model-system reliability — uptime, prediction quality, retraining cadence.
Skill bar
- Day one: Has shipped at least one production ML pipeline. Comfortable with Python, Docker, a workflow engine, and a model registry.
- Month three:Has built or significantly extended a training pipeline on the team’s cluster. Owns the registry contract (versioning, lifecycle states) and has promoted at least one model end-to-end via GitOps.
- Senior: Has built the eval-in-the-loop rollout gate. Owns the drift-detection story end-to-end. Has rolled back a bad model without taking the serving layer down.
Core skills
- Python at production quality — typing, error handling, testing
- Workflow orchestration — DAG construction, retry logic, parameter sweeps
- Model registry mechanics — versioning, artifact storage, lifecycle state machine
- Container image promotion and CI/CD integration
- Basic Kubernetes at consumer level — enough to write a Job manifest and interpret pod logs, not to operate the cluster
- Evaluation methodology — offline metrics, canary analysis, data drift detection
Tool fluency expected
A workflow orchestrator (e.g. Argo Workflows, Kubeflow Pipelines, Airflow); an experiment and model registry (e.g. MLflow, Weights & Biases); a GitOps controller as a deployment consumer; a progressive delivery tool (e.g. Argo Rollouts, Flagger) for canary and rollback; a CI system.
Role 3: ML Engineer (a.k.a. ML Software Engineer)
Job-to-be-done
Owns the model code — training scripts, inference handlers, evaluation harnesses, and the production-grade Python that turns research into a deployable artefact. Sits between the data scientist and the platform.
Skill bar
- Day one: Senior Python developer. Has shipped a model into production. Comfortable with PyTorch or TensorFlow, distributed training basics (DDP, FSDP), and at least one serving runtime.
- Month three: Has refactored a notebook prototype into production code with proper logging, error handling, and registry integration. Has run at least one distributed training job.
- Senior: Has built a custom inference handler. Has optimised a training job for throughput — DDP tuning, FSDP sharding, mixed precision. Can read a paper and ship a working implementation.
Core skills
- Deep Python — typing, performance profiling, production error handling
- Deep learning frameworks — PyTorch primarily, TensorFlow secondarily
- Distributed training — DDP, FSDP, and pipeline parallelism patterns
- Portable inference formats — ONNX, TorchScript, and serialisation trade-offs
- GPU mechanics at the code level — memory layout, mixed precision, kernel selection
- Evaluation methodology — offline metrics, ablation design, statistical validity
Tool fluency expected
PyTorch; a training-loop abstraction library (e.g. Hugging Face Accelerate, PyTorch Lightning); a Kubernetes distributed training operator (e.g. Kubeflow Training Operator — PyTorchJob/MPIJob); the model registry client (e.g. MLflow client); an LLM serving runtime (e.g. vLLM, Triton Inference Server) at deployer depth; an experiment tracking tool (e.g. MLflow, Weights & Biases).
Role 4: Data Scientist
Job-to-be-done
Owns model design and selection — research the right approach, build the prototype, prove the metric, document the experiment. Hands off to ML Engineering for productionisation or to MLOps for deployment. Measure of success: the quality and defensibility of the model decision.
Skill bar
- Day one: Strong statistics and ML fundamentals. Comfortable with Python, Jupyter, and the major framework ecosystem. Has shipped at least one model that influenced a business decision.
- Month three: Owns the modelling decisions for at least one product. Has run a rigorous experimental design — offline evaluation, A/B test, or equivalent. Knows when to use a tree model and when to use a neural net.
- Senior: Designs evaluation methodology that distinguishes “the model is better” from “the test set is leaking”. Can challenge a product team on whether ML is the right solution for the problem.
Core skills
- Statistics and ML theory — distributions, hypothesis testing, overfitting analysis
- Python — production-comfortable but not necessarily production-quality
- Experimental design — treatment/control framing, power analysis, confound identification
- Data exploration — pandas, polars, SQL at analytical depth
- Model interpretation — SHAP, partial dependence plots, ablations
- Domain expertise in at least one problem area
Tool fluency expected
Jupyter (or equivalent notebook environment); pandas and/or polars; scikit-learn; PyTorch; the model registry tracking client (e.g. MLflow client-side, Weights & Biases) for experiment logging.
The Adjacent Two (related, not the same)
Role 5: DevOps / SRE
Job-to-be-done
Owns the general infrastructure — CI runners, application platforms, networking, identity, and on-call rotation for non-ML services. Conflated with AI Platform Engineering because both deal with Kubernetes; distinct because the AI Platform Engineer carries GPU-specific depth that a general SRE does not.
Skill bar
- Day one: Standard SRE skill set — Linux, Kubernetes, infrastructure-as-code, CI/CD, on-call experience.
- Month three: Owns the application platform that the rest of the organisation consumes.
- Senior: Designs the failure-mode story across the application platform — incident response, runbook coverage, SLO definition.
Core skills
- Linux — systemd, cgroups, kernel namespaces
- Kubernetes at admin level — not GPU-specialised, but cluster operations and RBAC depth
- Infrastructure as code — Terraform or equivalent
- CI/CD systems
- Observability — metrics, logs, traces using open standards (e.g. Prometheus, OpenTelemetry)
- Incident response and identity (OIDC, federation)
Tool fluency expected
Terraform or an equivalent IaC tool; a GitOps controller; a metrics stack (e.g. Prometheus, Grafana); a log aggregation tool; an identity provider integration.
Role 6: Data Engineer
Job-to-be-done
Owns the data pipelines — ETL/ELT into the warehouse, feature store population, and the data-quality contracts that downstream consumers depend on. Conflated with MLOps because both run pipelines; distinct because the output artefact is data, not a model.
Skill bar
- Day one: SQL and data-modelling depth. Comfortable with a warehouse engine and at least one orchestration tool.
- Month three: Owns a meaningful slice of the data layer feeding ML workloads.
- Senior: Designs the data-quality contract and SLA across producers and consumers. Understands model training data requirements well enough to write the schema contract.
Core skills
- SQL — deep analytical and data modelling depth (Kimball or data-vault patterns)
- Warehouse engine — at least one of Snowflake, BigQuery, Databricks, or equivalent
- Pipeline orchestration — Airflow, dbt, Dagster, or equivalent
- Data quality tooling — dbt tests, Great Expectations, or equivalent contract testing
- Streaming fundamentals — Kafka and, at senior levels, Flink or Spark Structured Streaming
Tool fluency expected
dbt; Airflow or Dagster; the team’s warehouse; a data quality tool. The lakeFS 2025 State of Data and AI Engineering report notes the Data Engineer skill mix is shifting toward real-time pipeline patterns and governance work that supports AI at scale — feature store population and ML data contract ownership are now common responsibilities.
Team Composer
Set your team size and platform maturity to see a recommended role mix and staffing gaps.
Live models serving production traffic. Reliability matters.
Recommended role mix (6 of 6 headcount allocated)
On-call substrate reliability.
Production lifecycle — pipelines, registry, rollout gates.
Model code, training optimisation, inference handlers.
Ongoing model improvement and experiment design.
Quality gate: eval-in-the-loop before every deployment.
Staffing gaps flagged
No dedicated data engineer: the ML team rebuilds data pipelines ad-hoc. The feature store has no schema contract or SLA.
The most common conflation: MLOps vs ML Engineering
The role most frequently mis-hired is MLOps Engineer, because the title suggests both model code and operations. Multiple industry analyses of job postings in 2025 and 2026 note that hiring managers conflate the two, and that job descriptions routinely bundle model development, deployment, and infrastructure under a single title — a pattern that SFIA 9 explicitly separates across different competency clusters.
The operational split:
- MLOps Engineer owns the lifecycle of the model on the platform — pipelines, registry, rollouts, retraining triggers, observability. They may never write a training loop.
- ML Engineer owns the production-grade model code — training scripts, inference handlers, evaluation harnesses, optimisation. They may never touch GitOps.
In a small team one person does both. In any team larger than five, splitting them is what makes both halves work. Myticas Consulting’s 2026 analysis frames this practically: hire an MLOps Engineer when the organisation is ready to move models into production at scale; hire an ML Engineer when the organisation is in active model development. Both needs exist in a mature team simultaneously.
The role-by-role matrix in summary
Each entry: role / job-to-be-done / primary skill domain / tool category / the mistake to avoid.
What comes next in this series
The next article, The deployment-context spectrum, introduces the four deployment contexts — pure-cloud, on-prem, hybrid, and air-gapped — that every role described here will encounter. The tools change across those contexts; the role definitions above hold across all four.
References
- 1. LinkedIn News — LinkedIn Jobs on the Rise 2026: The 25 fastest-growing roles in the U.S. (LinkedIn, 2026). Source for AI/ML job-posting growth figures (163% surge 2024–2025; 49,200 U.S. positions).
- 2. SFIA Foundation — Machine Learning (MLNG) skill definition, SFIA 9 (BCS / SFIA Foundation, 2024).
- 3. SFIA Foundation — SFIA: a framework for AI skills (BCS / SFIA Foundation, 2024).
- 4. Myticas Consulting — MLOps vs ML Engineer: Which Should You Hire in 2026? (Myticas Consulting, 2026).
- 5. lakeFS — The State of Data and AI Engineering 2025 (lakeFS, May 2025).
- 6. CNCF TAG App Delivery — Platforms Definition White Paper (CNCF, 2023).
- 7. Turkovic, I. — AI Job Titles in 2026: A CTO’s Guide to the Naming Chaos (April 2026).
Continue the Journey
What an AI Platform Team Actually Owns — and What It Does Not
The team-level view: responsibilities, boundaries, and what an AI Platform team is not.
Read articleAI PlatformFour Organisational Patterns for Shipping ML — and When Each One Breaks
How to structure ML teams across embedded, centralised, and hub-spoke models.
Read articleInteractiveA Parallel Workflow for Multi-Agent Claude Projects
How to decompose large AI tasks across parallel Claude agents for 3-5× throughput gains.
Read articlePart III of 34 in the AI Platform Engineering & MLOps series
← Back to Articles