AI Platform Engineering & MLOps · Part 20 of 34
RAG and agent observability: scoring retrieval and generation separately, then the trajectory
A RAG system that looks healthy on a single end-to-end accuracy number can be silently failing in two independent ways. This article covers the measurement vocabulary for both — and adds a third axis for agent systems: trajectory correctness.
A retrieval-augmented generation (RAG) system that looks healthy on a single end-to-end accuracy number can still be silently failing in two completely independent ways: it might be retrieving the wrong context, or it might be ignoring good context while hallucinating an answer. Conflating those two failure modes in a single metric is operationally useless — fixing retrieval when generation is the bottleneck wastes engineering cycles, and vice versa.
Agent systems add a third axis: trajectory correctness. An agent that reaches the right final answer via a sequence of incorrect or unnecessary tool calls is not a reliable system — it got lucky. Observing agents means grading intermediate steps, not just outcomes. This article covers the measurement vocabulary for both concerns, grounded in the peer-reviewed frameworks and the vendor-neutral OpenTelemetry GenAI semantic conventions that let you capture the signals regardless of which model or orchestrator you run.
The RAG architecture and why it needs decomposed evaluation
Lewis et al. introduced the RAG architecture in “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (NeurIPS 2020) [1] as a way to combine the parametric memory of a language model with non-parametric memory in a retrieval index. The model attends to retrieved documents at generation time rather than trying to recall facts purely from weights — a design that naturally separates the knowledge store from the reasoning component.
That separation is the key architectural fact for observability. Because retrieval and generation are distinct pipeline stages, their failure modes are also distinct. Treating them as one system collapses information you need to debug it.
Three RAG failure modes and their detection signals
Every RAG failure traces back to one of three root causes. Naming them precisely is the first step toward instrumenting them.
1. Retrieval failure
The retrieval stage returns chunks that are not relevant to the question, or returns too few passages to cover the answer. The model then generates against poor grounding material. Detection signal: low context precision and low context recall [2]. Context precision measures whether the retrieved chunks are actually relevant to the query (noise ratio); context recall measures whether the correct evidence was retrieved at all (coverage ratio). A typical starting point for a relevance threshold is in the 0.7–0.8 cosine-similarity range, but this is domain-dependent and should be calibrated on a held-out evaluation set.
2. Context-integration failure
The retriever returns relevant chunks, but the model ignores or misreads them — producing an answer that is not grounded in the provided context. Detection signal: low faithfulness score [2]. Faithfulness is defined as the fraction of claims in the generated answer that can be attributed to the retrieved context. A high-faithfulness answer contains no unsupported claims; a low-faithfulness answer hallucinates even when the evidence is present. Because the context was retrieved correctly, the fault here is a generation problem, not a retrieval problem.
3. Corpus freshness failure
The index itself is stale. Documents have been updated, deprecated, or superseded, but the retrieval corpus has not been re-indexed. The retriever faithfully returns the best match it has — which happens to be wrong. Detection signal: this failure mode does not show up in per-query metrics; it appears as a drift signal. Track answer correctness on a reference test set on a scheduled cadence. A progressive decline in correctness against a pinned golden dataset, without a corresponding decline in context precision or faithfulness, points to corpus staleness rather than a model or retrieval tuning issue.
The three failure modes are independent and require different mitigations: retrieval failure → tune the embedding model, rerank, or adjust chunking strategy; context-integration failure → adjust prompting, context window, or model; corpus freshness failure → automate re-indexing and add corpus-age monitoringto your platform’s alerting.
RAGAS: the four core metrics
Es et al. formalised the evaluation framework in “RAGAS: Automated Evaluation of Retrieval Augmented Generation” (EACL 2024) [2]. RAGAS defines four metrics that span both retrieval and generation quality, each measurable without human annotation for every query:
- Faithfulness — fraction of answer claims supported by the retrieved context. Low faithfulness with high context precision = context-integration failure.
- Answer relevance — how directly the generated answer addresses the question. Measures whether the model stayed on topic.
- Context precision — proportion of retrieved chunks that are relevant to the question. Low context precision = noisy retrieval.
- Context recall — whether all necessary evidence was present in the retrieved set. Low recall = retrieval failure (coverage gap).
The TruLens projectframes the same decomposition as the “RAG Triad”: context relevance, groundedness (equivalent to faithfulness), and answer relevance [3]. The naming differs; the decomposition is the same. Whether you adopt RAGAS, TruLens, or a comparable framework (e.g. DeepEval), the important design decision is that you instrument all three dimensions — treating them as separate signals in your observability pipeline rather than averaging them into a single score.
The inspector below lets you examine pre-canned RAG trace examples and see how each of the four RAGAS dimensions scores — and what failure mode that pattern implies. Select a trace to walk through the diagnosis.
RAG Trace Inspector
Select a trace to inspect retrieval and generation scores separately. See how the pattern of scores identifies the failure mode.
Query
What is the recommended GPU memory for serving a 70B LLM?
Retrieved Chunks (3)
Generated Answer
“Based on the retrieved context, training large models typically requires 80 GB GPUs. For serving, memory requirements depend on batch size.”
RAGAS Scores
Diagnosis
Low context precision and recall with high faithfulness: the model faithfully used what it was given, but the retriever returned training-oriented chunks instead of serving-specific ones. The answer conflates training and serving GPU requirements.
Recommended Fix
Tune the embedding model or chunking strategy so that serving-specific chunks rank above training chunks for serving queries. Consider query expansion or metadata filtering by topic.
OpenTelemetry GenAI semantic conventions: the instrumentation vocabulary
The OpenTelemetry semantic conventions for generative AI systems [4] define a standardised attribute namespace — gen_ai.* — that lets you capture LLM and agent signals in a vendor-neutral way across any OTel-compatible collector. This is the layer that converts per-vendor SDKs into a common observability substrate.
Key attributes from the GenAI namespace that every RAG and agent pipeline should emit:
gen_ai.system— identifies the model provider (e.g. openai, anthropic, vertex_ai). Enables cross-provider comparison in dashboards.gen_ai.request.model / gen_ai.response.model— the model name and version requested. Pairs to detect model routing or fallback events.gen_ai.usage.input_tokens / gen_ai.usage.output_tokens— token counts per call. Essential for cost attribution per RAG query or agent run.gen_ai.operation.name— the operation type (chat, text_completion, embeddings). Lets you separate embedding calls (retrieval) from generation calls on the same span trace.gen_ai.tool.name / gen_ai.tool.call.id— the name and call identifier of a tool invoked by the model. The primary grading unit for agent trajectory evaluation.gen_ai.agent.id / gen_ai.agent.name— defined in the agent spans specification [4b], these identify the agent instance so you can correlate tool calls back to a named agent across a multi-agent trace.
Concretely, a RAG query should emit at minimum two child spans: one for the embedding call (operation.name = embeddings, carrying the query text and token count) and one for the generation call (operation.name = chat, carrying the model, token counts, and the retrieved-context hash or identifier). A span hierarchy of this shape, collected by any OTel-compatible pipeline and stored in a traces backend (e.g. Tempo, Jaeger, or a managed provider), gives you the raw material to compute all four RAGAS dimensions without instrumenting proprietary SDK events.
Agent trajectory evaluation: grading intermediate steps
An agent system executes a plan across multiple steps: it calls tools, processes results, updates its state, and eventually produces a final output. End-to-end outcome evaluation (“did it produce the right answer?”) misses a class of reliability problems that only surface in the trajectory — the sequence of states, tool calls, and reasoning steps the agent took to get there.
What trajectory grading measures
A well-instrumented trajectory evaluation grades three properties independently:
- 1Tool-use correctness — did the agent call the right tools, with the right arguments, in the right order? A trajectory that calls a search tool twice when once is sufficient, or calls a write tool before reading the current state, is wasteful or unsafe even if the final output is correct.
- 2Plan validity — does the sequence of steps reflect a coherent plan for the stated goal? This catches agents that succeed by brute-force retrying rather than by reasoning correctly, and agents that abandon a valid plan mid-execution.
- 3Intermediate-step coherence — at each step, is the model's reasoning consistent with the evidence it has accumulated? This is the per-step analogue of faithfulness: the model should not assert facts at step N that contradict the tool output it received at step N-1.
Three evaluation frameworks and what each grades
AgentBench (Liu et al., ICLR 2024) [5] is a multi-environment benchmark that evaluates LLMs as agents across eight distinct task environments — operating system shell tasks, database manipulation, knowledge graph navigation, web browsing, and others. Its contribution is demonstrating that agent capability is environment-specific: a model that performs well on web tasks does not necessarily perform well on OS-level tasks. AgentBench grades task-completion rate, step efficiency, and error recovery across those environments.
Inspect (UK AI Security Institute) [6] is an open-source evaluation framework designed for structured task grading. It supports custom solvers, multi-turn agent evaluation, and scorer composition — meaning you can define a trajectory scorer that checks intermediate steps, not just the final answer. Inspect is notable for its provenance: it is maintained by a government safety body, which gives it a degree of independence from vendor evaluation incentives.
LangSmith is a managed tracing and evaluation platform that captures agent runs as typed traces, enabling human annotation of individual steps alongside automated LLM-as-judgescoring. It is vendor-backed (LangChain); comparable open alternatives include Weave (Weights & Biases) and Phoenix (Arize). The evaluation pattern — capturing step-level traces, defining a scorer per step type, and aggregating to a run-level report — is portable across these tools.
The choice of framework is secondary to the commitment to capturing trajectory data. If your agent emits OTel spans with the gen_ai.tool.name and gen_ai.tool.call.id attributes on every tool invocation, and gen_ai.agent.id on every agent span, you have the raw data to feed any of these evaluation frameworks — or to build a lightweight custom scorer tuned to your specific task domain.
The explorer below steps through pre-canned agent runs — one well-formed and one problematic. For each step you can see the tool call, the output, and the per-step verdict across all three trajectory dimensions.
Agent Trajectory Explorer
Step through a multi-step agent run and see per-step verdicts across three trajectory dimensions: tool-use correctness, plan validity, and step coherence.
Tool Called
search_kb({ "query": "KEDA autoscaling vLLM", "top_k": 5 })Tool Output
Found 5 relevant chunks: vllm:num_requests_waiting metric, KEDA ScaledObject config, …
Model Reasoning
“The user asked about autoscaling vLLM. I should first search the knowledge base for relevant documentation before answering.”
Trajectory Verdicts
Correct tool for the task, appropriate arguments, called once.
Search-before-answer is the correct plan shape for a knowledge-retrieval task.
Reasoning is consistent with the user request and leads logically to a search call.
Connecting evaluation to the observability pipeline
Evaluation scores are only operationally useful if they flow into the same observability pipeline as infrastructure metrics. The pattern is to emit RAGAS and trajectory scores as OTel metrics (using a counter or histogram instrument) tagged with the pipeline name, model version, and query category. This makes them queryable alongside latency and token-cost metrics in the same dashboards and alerting rules — so you can correlate, for example, a spike in context-integration failures with a model rollout, or a drop in context precision with an index rebuild.
A minimal RAG observability instrumentation checklist:
# Minimum OTel attributes per RAG query span
embedding_span:
attributes:
- gen_ai.system # e.g. "openai"
- gen_ai.operation.name # "embeddings"
- gen_ai.request.model # e.g. "text-embedding-3-small"
- gen_ai.usage.input_tokens
generation_span:
attributes:
- gen_ai.system
- gen_ai.operation.name # "chat"
- gen_ai.request.model
- gen_ai.response.model # detect routing/fallback
- gen_ai.usage.input_tokens
- gen_ai.usage.output_tokens
# Custom evaluation metrics (emit as OTel metrics, not spans)
evaluation_metrics:
- rag.faithfulness # float 0-1
- rag.context_precision # float 0-1
- rag.context_recall # float 0-1
- rag.answer_relevance # float 0-1
- agent.tool_call_correctness # float 0-1, per agent run
- agent.step_coherence # float 0-1, per stepStoring these as OTel metrics with a Prometheus-compatible scrape endpoint means you can write alerting rulesof the form “alert when rag.faithfulness p50 drops below 0.7 over a 1-hour window” — the same pattern you would use for API error-rate alerts. Evaluation quality becomes a first-class operational signal, not an offline analysis step.
Separating the failure modes in practice
A useful diagnostic pattern when a RAG system degrades is to compute all four RAGAS metrics on the failing queries simultaneously and read the pattern:
| Score pattern | Failure mode | Fix |
|---|---|---|
| Low context precision + low recall, high faithfulness | Retrieval failure | Embedding model, chunking strategy, or query expansion |
| High context precision + high recall, low faithfulness | Context-integration failure | Prompt engineering, context window ordering, or model choice |
| All metrics stable, answer correctness drifting on golden set | Corpus freshness failure | Re-index corpus, add corpus-age monitoring |
| Low answer relevance with high faithfulness + precision | Query understanding / prompt-framing | Query rewriting or prompt adjustment for the decision framing |
- Low context precision + low context recall, high faithfulness → retrieval failure. The model is faithfully using what it was given; it was just given the wrong material. Fix: embedding model, chunking strategy, or query expansion.
- High context precision + high context recall, low faithfulness → context-integration failure. Good context was retrieved; the model ignored it. Fix: prompt engineering, reducing context window noise, or model choice.
- All metrics stable, but answer correctness drifting on a golden test set → corpus freshness failure. Re-index the corpus and update staleness monitoring.
- Low answer relevance with high faithfulness and precision → the model is staying grounded but answering a different question than the one asked. Usually a query understanding or prompt-framing issue.
What this means for platform design
For platform engineers supporting RAG and agent workloads, the practical implications are:
- 1Require OTel GenAI instrumentation at the workload level. RAG pipelines and agent frameworks that do not emit gen_ai.* spans cannot be observed. Make it a platform onboarding requirement, not an optional add-on.
- 2Provide evaluation pipelines as platform primitives. Running RAGAS or Inspect on a scheduled basis against a golden test set should be a platform-managed job — similar to how integration tests run in CI — not something each application team re-implements independently.
- 3Store trajectory data for offline analysis. Agent span traces are large; store the full trace for a sample fraction (e.g. 10%) and aggregate metrics for the rest. The full traces are needed when you need to debug a failure mode that only appears in multi-step sequences.
- 4Treat corpus freshness as an infrastructure metric, not an application concern. Index age, re-indexing job success rate, and document-staleness histograms belong in the same SLO framework as database replication lag.
References
- [1] Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. arXiv:2005.11401
- [2] Es, S. et al. (2024). RAGAS: Automated Evaluation of Retrieval Augmented Generation. EACL 2024 (demo track). ACL Anthology: 2024.eacl-demo.16
- [3] TruLens / TruEra. RAG Triad — core concepts documentation. trulens.org
- [4] OpenTelemetry. Semantic conventions for generative AI systems —
gen_ai.*namespace. opentelemetry.io/docs/specs/semconv/gen-ai/ (and agent spans: gen-ai-agent-spans/) - [5] Liu, X. et al. (2024). AgentBench: Evaluating LLMs as Agents. ICLR 2024. arXiv:2308.03688
- [6] UK AI Security Institute. Inspect: LLM Evaluation Framework. inspect.aisi.org.uk — GitHub: UKGovernmentBEIS/inspect_ai
Continue the Journey
Eval as the New Test Suite
LLM-as-judge in CI without flaky merges — the pre-merge gate that complements the production observability pipeline built in this article.
Read articleAI PlatformObservability for the GenAI Stack
Traces, token costs, and quality signals across the full serving path — the broader observability platform this article's metrics live inside.
Read articleAI PlatformPrompts and Tools Are Code
Versioning, registries, and the rollback story for the system prompts and tool definitions whose trajectory scores you are now grading.
Read articleAI PlatformThe AI Gateway
The layer in front of your LLM that can enforce routing, rate limiting, and token budgets — the operational context for cost-attribution via gen_ai.usage.* attributes.
Read article