AI Platform Engineering & MLOps · Part III of 34

The six roles on an AI platform

What each role does, what each is fluent in, and the anti-patterns that appear when an organisation hires for one role and uses the person as another.

12 min read·6 roles defined·2 interactive tools

Platform EngineerMLOps EngineerML EngineerData ScientistDevOps / SREData Engineer

Role titles in AI and ML infrastructure are not standardised. The same responsibilities appear under “AI Platform Engineer”, “ML Infra Engineer”, “AI Reliability Engineer”, and half a dozen other labels depending on the company. LinkedIn’s 2026 Jobs on the Rise report found four of the five fastest-growing roles in the U.S. were AI-related — and AI/ML job postings surged 163% from 2024 to 2025, reaching roughly 49,000 open positions in the U.S. alone. That volume of demand, met with non-standardised titling, produces systematic mis-hiring.

This article maps the six roles that appear on a working AI platform team. Four are core: Platform Engineer (AI Platform), MLOps Engineer, ML Engineer, and Data Scientist. Two are habitually conflated with the core four: DevOps/SRE and Data Engineer. For each role this article gives the job-to-be-done, the day-one and senior skill bar, the tools expected at each level, and — the most useful part — the anti-patterns that appear when an organisation hires for one role and uses the person as another.

Why role clarity matters more than title uniformity

The problem is not that companies use different titles — it is that they conflate distinct optimisation functions. A DevOps engineer asked to “also handle MLOps” is being asked to operate a model lifecycle on top of a general infrastructure role, which means one of the two halves rots. An ML engineer hired to run the cluster is being asked to spend cognitive budget on GPU driver versions and network policy instead of on model architecture. The skill overlap is real; the optimisation function is not the same.

The SFIA 9 framework (Skills Framework for the Information Age, published by BCS and the SFIA Foundation) is the most widely used international standard for mapping digital skills to responsibility levels. It defines the Machine Learning skill (code MLNG) across seven levels and explicitly separates the “building and training models” competency from the “operationalising ML pipelines” competency. That separation is the clearest industry-framework signal that MLOps and ML engineering are distinct practices, not two names for the same thing. The LinkedIn Jobs on the Rise 2026 report similarly treats MLOps engineers and AI infrastructure engineers as separate categories in the fastest-growing roles list, as does the CNCF Platforms White Paper, which distinguishes platform capability providers from platform users.

Skills matter more than titles. The six sections below use industry-generic role labels. Your org chart may say something different; what matters is whether the optimisation function described here matches the person you hired.

Role Matrix Explorer

Select a role to see its job-to-be-done, skill bar, tools, and anti-pattern.

Platform Engineer (AI Platform)

Owns the Kubernetes-and-GPU substrate. Stands up and maintains the GPU operator, job scheduler, model registry deployment, serving runtime, and observability.

Skill Bar

Day OnedayOne

Senior Kubernetes admin level — Helm, kubectl, RBAC, network policy, production cluster experience.

Month 3monthThree

Shipped a GPU job-queueing setup. Owns NVIDIA GPU Operator. Working model registry with consumers.

Seniorsenior

Designed multi-tenant fairness across shared GPU pool. Rolled out a CNI-level change without downtime. Can debug NCCL from packet capture.

Day-One Tools

NVIDIA GPU Operator · Volcano (gang scheduling) · Kueue · Argo CD / Flux CD · Helm · Cilium (eBPF CNI)

Anti-Pattern

Asking them to write training code. That is the ML Engineer's responsibility. They will leave or produce poor model code.

The Core Four

Role 1: Platform Engineer (AI Platform)

Job-to-be-done

Owns the Kubernetes-and-GPU substrate that the rest of the ML stack runs on. Stands up and maintains the GPU operator, the job scheduler, the model registry deployment, the serving runtime, and the observability that watches all of it. Measure of success: paved-road adoption and substrate reliability.

Skill bar

Day one: Senior Kubernetes admin level. Comfortable with Helm, kubectl, RBAC, and network policy. Has run a production cluster beyond a local development environment.
Month three: Has shipped a GPU job-queueing setup on the team’s GPU pool. Owns the NVIDIA GPU Operator deployment. Has a working model registry deployed with consumers connecting from at least one cluster.
Senior: Has designed the multi-tenant fairness story across competing teams sharing the same GPU pool. Has rolled out a CNI-level change without downtime. Can debug NCCL on a multi-node training job from packet capture.

Core skills

Kubernetes, deep — operators, CRDs, custom controllers
GPU resource model — NVIDIA device plugin, MIG, MPS, Dynamic Resource Allocation (DRA)
Helm and Kustomize for release packaging
Kubernetes networking — CNI, network policy, eBPF-based network observability
GitOps (e.g. Argo CD, Flux CD) — reconciliation model, ApplicationSet, drift detection
Linux performance tuning for GPU workloads — NUMA topology, PCIe bandwidth, GPU-direct RDMA

Tool fluency expected

NVIDIA GPU Operator; a gang-scheduling solution (e.g. Volcano) for multi-node training; a cluster-level queue and quota system (e.g. Kueue); a GitOps controller (e.g. Argo CD, Flux CD); Helm; an eBPF-based CNI (e.g. Cilium). Awareness of serving runtimes and registry tooling as substrates they make available to the rest of the org.

Anti-patterns to avoid: Hiring an AI Platform Engineer to write training code. They will leave or produce poor model code. Training code is an ML Engineer’s responsibility. Also: hiring someone who has not operated a multi-node GPU cluster. GPU concerns — NCCL, topology, MIG partition planning, driver version matrix — are not the same as CPU concerns.

Role 2: MLOps Engineer

Job-to-be-done

Owns the lifecycle of production models on the platform — the training pipelines, the model registry contract, the deployment promotion gates, the retraining triggers, and the rollback path. Measure of success: model-system reliability — uptime, prediction quality, retraining cadence.

Skill bar

Day one: Has shipped at least one production ML pipeline. Comfortable with Python, Docker, a workflow engine, and a model registry.
Month three:Has built or significantly extended a training pipeline on the team’s cluster. Owns the registry contract (versioning, lifecycle states) and has promoted at least one model end-to-end via GitOps.
Senior: Has built the eval-in-the-loop rollout gate. Owns the drift-detection story end-to-end. Has rolled back a bad model without taking the serving layer down.

Core skills

Python at production quality — typing, error handling, testing
Workflow orchestration — DAG construction, retry logic, parameter sweeps
Model registry mechanics — versioning, artifact storage, lifecycle state machine
Container image promotion and CI/CD integration
Basic Kubernetes at consumer level — enough to write a Job manifest and interpret pod logs, not to operate the cluster
Evaluation methodology — offline metrics, canary analysis, data drift detection

Tool fluency expected

A workflow orchestrator (e.g. Argo Workflows, Kubeflow Pipelines, Airflow); an experiment and model registry (e.g. MLflow, Weights & Biases); a GitOps controller as a deployment consumer; a progressive delivery tool (e.g. Argo Rollouts, Flagger) for canary and rollback; a CI system.

Anti-patterns to avoid: Hiring an MLOps Engineer to write training code — MLOps Engineers operate the model lifecycle; ML Engineers write the model code. Also: hiring someone whose background is exclusively data pipelines (ETL, dbt, Airflow) with no model lifecycle exposure. They can ship a pipeline but not a registry contract or a production rollout gate.

Role 3: ML Engineer (a.k.a. ML Software Engineer)

Job-to-be-done

Owns the model code — training scripts, inference handlers, evaluation harnesses, and the production-grade Python that turns research into a deployable artefact. Sits between the data scientist and the platform.

Skill bar

Day one: Senior Python developer. Has shipped a model into production. Comfortable with PyTorch or TensorFlow, distributed training basics (DDP, FSDP), and at least one serving runtime.
Month three: Has refactored a notebook prototype into production code with proper logging, error handling, and registry integration. Has run at least one distributed training job.
Senior: Has built a custom inference handler. Has optimised a training job for throughput — DDP tuning, FSDP sharding, mixed precision. Can read a paper and ship a working implementation.

Core skills

Deep Python — typing, performance profiling, production error handling
Deep learning frameworks — PyTorch primarily, TensorFlow secondarily
Distributed training — DDP, FSDP, and pipeline parallelism patterns
Portable inference formats — ONNX, TorchScript, and serialisation trade-offs
GPU mechanics at the code level — memory layout, mixed precision, kernel selection
Evaluation methodology — offline metrics, ablation design, statistical validity

Tool fluency expected

PyTorch; a training-loop abstraction library (e.g. Hugging Face Accelerate, PyTorch Lightning); a Kubernetes distributed training operator (e.g. Kubeflow Training Operator — PyTorchJob/MPIJob); the model registry client (e.g. MLflow client); an LLM serving runtime (e.g. vLLM, Triton Inference Server) at deployer depth; an experiment tracking tool (e.g. MLflow, Weights & Biases).

Anti-patterns to avoid: Hiring an ML Engineer who is also expected to operate the cluster. They will write good model code on a fragile substrate. Also: hiring a data scientist into an ML Engineer role. Data scientists optimise for what model to build; ML engineers optimise for what model to ship.

Role 4: Data Scientist

Job-to-be-done

Owns model design and selection — research the right approach, build the prototype, prove the metric, document the experiment. Hands off to ML Engineering for productionisation or to MLOps for deployment. Measure of success: the quality and defensibility of the model decision.

Skill bar

Day one: Strong statistics and ML fundamentals. Comfortable with Python, Jupyter, and the major framework ecosystem. Has shipped at least one model that influenced a business decision.
Month three: Owns the modelling decisions for at least one product. Has run a rigorous experimental design — offline evaluation, A/B test, or equivalent. Knows when to use a tree model and when to use a neural net.
Senior: Designs evaluation methodology that distinguishes “the model is better” from “the test set is leaking”. Can challenge a product team on whether ML is the right solution for the problem.

Core skills

Statistics and ML theory — distributions, hypothesis testing, overfitting analysis
Python — production-comfortable but not necessarily production-quality
Experimental design — treatment/control framing, power analysis, confound identification
Data exploration — pandas, polars, SQL at analytical depth
Model interpretation — SHAP, partial dependence plots, ablations
Domain expertise in at least one problem area

Tool fluency expected

Jupyter (or equivalent notebook environment); pandas and/or polars; scikit-learn; PyTorch; the model registry tracking client (e.g. MLflow client-side, Weights & Biases) for experiment logging.

Anti-patterns to avoid: Expecting a Data Scientist to own the production pipeline. They produce the model; the system around the model belongs to ML or MLOps Engineering. Also: hiring a Data Scientist with no statistics background and only ML library experience. They can run a notebook; they cannot tell you whether the result is significant.

The Adjacent Two (related, not the same)

Role 5: DevOps / SRE

Job-to-be-done

Owns the general infrastructure — CI runners, application platforms, networking, identity, and on-call rotation for non-ML services. Conflated with AI Platform Engineering because both deal with Kubernetes; distinct because the AI Platform Engineer carries GPU-specific depth that a general SRE does not.

Skill bar

Day one: Standard SRE skill set — Linux, Kubernetes, infrastructure-as-code, CI/CD, on-call experience.
Month three: Owns the application platform that the rest of the organisation consumes.
Senior: Designs the failure-mode story across the application platform — incident response, runbook coverage, SLO definition.

Core skills

Linux — systemd, cgroups, kernel namespaces
Kubernetes at admin level — not GPU-specialised, but cluster operations and RBAC depth
Infrastructure as code — Terraform or equivalent
CI/CD systems
Observability — metrics, logs, traces using open standards (e.g. Prometheus, OpenTelemetry)
Incident response and identity (OIDC, federation)

Tool fluency expected

Terraform or an equivalent IaC tool; a GitOps controller; a metrics stack (e.g. Prometheus, Grafana); a log aggregation tool; an identity provider integration.

Anti-patterns to avoid: Treating an AI Platform Engineer as “the SRE who knows about GPUs”. A senior SRE without GPU experience needs meaningful ramp time — they are not a drop-in replacement. Also: routing ML drift alerts to the SRE on-call. Drift is an MLOps concern; mixing the two alert queues burns out the SRE on signals they cannot action.

Role 6: Data Engineer

Job-to-be-done

Owns the data pipelines — ETL/ELT into the warehouse, feature store population, and the data-quality contracts that downstream consumers depend on. Conflated with MLOps because both run pipelines; distinct because the output artefact is data, not a model.

Skill bar

Day one: SQL and data-modelling depth. Comfortable with a warehouse engine and at least one orchestration tool.
Month three: Owns a meaningful slice of the data layer feeding ML workloads.
Senior: Designs the data-quality contract and SLA across producers and consumers. Understands model training data requirements well enough to write the schema contract.

Core skills

SQL — deep analytical and data modelling depth (Kimball or data-vault patterns)
Warehouse engine — at least one of Snowflake, BigQuery, Databricks, or equivalent
Pipeline orchestration — Airflow, dbt, Dagster, or equivalent
Data quality tooling — dbt tests, Great Expectations, or equivalent contract testing
Streaming fundamentals — Kafka and, at senior levels, Flink or Spark Structured Streaming

Tool fluency expected

dbt; Airflow or Dagster; the team’s warehouse; a data quality tool. The lakeFS 2025 State of Data and AI Engineering report notes the Data Engineer skill mix is shifting toward real-time pipeline patterns and governance work that supports AI at scale — feature store population and ML data contract ownership are now common responsibilities.

Anti-patterns to avoid: Treating Data Engineering as upstream of MLOps with no shared contract. Either the Data Engineer is part of the ML feedback loop or the ML team rebuilds the data pipeline. The shared contract is the feature store or the feature view — something with a schema and an SLA. Also: hiring a Data Engineer to do MLOps. Data pipelines optimise for data freshness and schema stability; model pipelines optimise for training reproducibility and rollout safety.

Team Composer

Set your team size and platform maturity to see a recommended role mix and staffing gaps.

Team size: 6

310

Platform maturity

Live models serving production traffic. Reliability matters.

Recommended role mix (6 of 6 headcount allocated)

Platform Eng.×1

On-call substrate reliability.

MLOps×1

Production lifecycle — pipelines, registry, rollout gates.

ML Engineer×2

Model code, training optimisation, inference handlers.

Data Scientist×1

Ongoing model improvement and experiment design.

Eval Eng.×1

Quality gate: eval-in-the-loop before every deployment.

Staffing gaps flagged

mediumData Engineer

No dedicated data engineer: the ML team rebuilds data pipelines ad-hoc. The feature store has no schema contract or SLA.

The most common conflation: MLOps vs ML Engineering

The role most frequently mis-hired is MLOps Engineer, because the title suggests both model code and operations. Multiple industry analyses of job postings in 2025 and 2026 note that hiring managers conflate the two, and that job descriptions routinely bundle model development, deployment, and infrastructure under a single title — a pattern that SFIA 9 explicitly separates across different competency clusters.

The operational split:

MLOps Engineer owns the lifecycle of the model on the platform — pipelines, registry, rollouts, retraining triggers, observability. They may never write a training loop.
ML Engineer owns the production-grade model code — training scripts, inference handlers, evaluation harnesses, optimisation. They may never touch GitOps.

In a small team one person does both. In any team larger than five, splitting them is what makes both halves work. Myticas Consulting’s 2026 analysis frames this practically: hire an MLOps Engineer when the organisation is ready to move models into production at scale; hire an ML Engineer when the organisation is in active model development. Both needs exist in a mature team simultaneously.

The role-by-role matrix in summary

Each entry: role / job-to-be-done / primary skill domain / tool category / the mistake to avoid.

Platform Engineer (AI Platform)—substrate reliability/Kubernetes + GPU operations/cluster scheduling + GitOps + CNI/do not ask them to write model code

MLOps Engineer—model lifecycle/Python + pipeline orchestration + registry/workflow engine + model registry + progressive delivery/do not conflate with ML Engineering or data engineering

ML Engineer—model code/deep Python + ML frameworks + distributed training/training framework + serving runtime + experiment tracker/do not ask them to run the cluster

Data Scientist—model design/statistics + ML theory + experimental design/notebook + data exploration + experiment tracker/do not expect them to own the production pipeline

DevOps / SRE—general infrastructure/Linux + Kubernetes + IaC + incident response/Terraform + GitOps + observability stack/not a drop-in for AI Platform Engineering without GPU ramp time

Data Engineer—data pipelines/SQL + data modelling + pipeline orchestration/warehouse + dbt + data quality tooling/not a substitute for MLOps without model lifecycle exposure

What comes next in this series

The next article, The deployment-context spectrum, introduces the four deployment contexts — pure-cloud, on-prem, hybrid, and air-gapped — that every role described here will encounter. The tools change across those contexts; the role definitions above hold across all four.

References

1. LinkedIn News — LinkedIn Jobs on the Rise 2026: The 25 fastest-growing roles in the U.S. (LinkedIn, 2026). Source for AI/ML job-posting growth figures (163% surge 2024–2025; 49,200 U.S. positions).
2. SFIA Foundation — Machine Learning (MLNG) skill definition, SFIA 9 (BCS / SFIA Foundation, 2024).
3. SFIA Foundation — SFIA: a framework for AI skills (BCS / SFIA Foundation, 2024).
4. Myticas Consulting — MLOps vs ML Engineer: Which Should You Hire in 2026? (Myticas Consulting, 2026).
5. lakeFS — The State of Data and AI Engineering 2025 (lakeFS, May 2025).
6. CNCF TAG App Delivery — Platforms Definition White Paper (CNCF, 2023).
7. Turkovic, I. — AI Job Titles in 2026: A CTO’s Guide to the Naming Chaos (April 2026).

Continue the Journey

AI Platform

Part III of 34 in the AI Platform Engineering & MLOps series

← Back to Articles

Why role clarity matters more than title uniformity

Role Matrix Explorer

Platform Engineer (AI Platform)

Skill Bar

Day-One Tools

Anti-Pattern

The Core Four

Role 1: Platform Engineer (AI Platform)

Job-to-be-done

Skill bar

Core skills

Tool fluency expected

Role 2: MLOps Engineer

Job-to-be-done

Skill bar

Core skills

Tool fluency expected

Role 3: ML Engineer (a.k.a. ML Software Engineer)

Job-to-be-done

Skill bar

Core skills

Tool fluency expected

Role 4: Data Scientist

Job-to-be-done

Skill bar

Core skills

Tool fluency expected

The Adjacent Two (related, not the same)

Role 5: DevOps / SRE

Job-to-be-done

Skill bar

Core skills

Tool fluency expected

Role 6: Data Engineer

Job-to-be-done

Skill bar

Core skills

Tool fluency expected

Team Composer

Recommended role mix (6 of 6 headcount allocated)

Staffing gaps flagged

The most common conflation: MLOps vs ML Engineering

The role-by-role matrix in summary

What comes next in this series

References

Continue the Journey

What an AI Platform Team Actually Owns — and What It Does Not

Four Organisational Patterns for Shipping ML — and When Each One Breaks

A Parallel Workflow for Multi-Agent Claude Projects