AI Platform Engineering & MLOps · Part 20 of 34

RAG and agent observability: scoring retrieval and generation separately, then the trajectory

A RAG system that looks healthy on a single end-to-end accuracy number can be silently failing in two independent ways. This article covers the measurement vocabulary for both — and adds a third axis for agent systems: trajectory correctness.

12 min read·2 interactive components·6 references

Retrieval stageGeneration stageRAGAS metricsAgent trajectory

A retrieval-augmented generation (RAG) system that looks healthy on a single end-to-end accuracy number can still be silently failing in two completely independent ways: it might be retrieving the wrong context, or it might be ignoring good context while hallucinating an answer. Conflating those two failure modes in a single metric is operationally useless — fixing retrieval when generation is the bottleneck wastes engineering cycles, and vice versa.

Agent systems add a third axis: trajectory correctness. An agent that reaches the right final answer via a sequence of incorrect or unnecessary tool calls is not a reliable system — it got lucky. Observing agents means grading intermediate steps, not just outcomes. This article covers the measurement vocabulary for both concerns, grounded in the peer-reviewed frameworks and the vendor-neutral OpenTelemetry GenAI semantic conventions that let you capture the signals regardless of which model or orchestrator you run.

The RAG architecture and why it needs decomposed evaluation

Lewis et al. introduced the RAG architecture in “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (NeurIPS 2020) [1] as a way to combine the parametric memory of a language model with non-parametric memory in a retrieval index. The model attends to retrieved documents at generation time rather than trying to recall facts purely from weights — a design that naturally separates the knowledge store from the reasoning component.

That separation is the key architectural fact for observability. Because retrieval and generation are distinct pipeline stages, their failure modes are also distinct. Treating them as one system collapses information you need to debug it.

Three RAG failure modes and their detection signals

Every RAG failure traces back to one of three root causes. Naming them precisely is the first step toward instrumenting them.

1. Retrieval failure

The retrieval stage returns chunks that are not relevant to the question, or returns too few passages to cover the answer. The model then generates against poor grounding material. Detection signal: low context precision and low context recall [2]. Context precision measures whether the retrieved chunks are actually relevant to the query (noise ratio); context recall measures whether the correct evidence was retrieved at all (coverage ratio). A typical starting point for a relevance threshold is in the 0.7–0.8 cosine-similarity range, but this is domain-dependent and should be calibrated on a held-out evaluation set.

2. Context-integration failure

The retriever returns relevant chunks, but the model ignores or misreads them — producing an answer that is not grounded in the provided context. Detection signal: low faithfulness score [2]. Faithfulness is defined as the fraction of claims in the generated answer that can be attributed to the retrieved context. A high-faithfulness answer contains no unsupported claims; a low-faithfulness answer hallucinates even when the evidence is present. Because the context was retrieved correctly, the fault here is a generation problem, not a retrieval problem.

3. Corpus freshness failure

The index itself is stale. Documents have been updated, deprecated, or superseded, but the retrieval corpus has not been re-indexed. The retriever faithfully returns the best match it has — which happens to be wrong. Detection signal: this failure mode does not show up in per-query metrics; it appears as a drift signal. Track answer correctness on a reference test set on a scheduled cadence. A progressive decline in correctness against a pinned golden dataset, without a corresponding decline in context precision or faithfulness, points to corpus staleness rather than a model or retrieval tuning issue.

The three failure modes are independent and require different mitigations: retrieval failure → tune the embedding model, rerank, or adjust chunking strategy; context-integration failure → adjust prompting, context window, or model; corpus freshness failure → automate re-indexing and add corpus-age monitoringto your platform’s alerting.

RAGAS: the four core metrics

Es et al. formalised the evaluation framework in “RAGAS: Automated Evaluation of Retrieval Augmented Generation” (EACL 2024) [2]. RAGAS defines four metrics that span both retrieval and generation quality, each measurable without human annotation for every query:

Faithfulness — fraction of answer claims supported by the retrieved context. Low faithfulness with high context precision = context-integration failure.
Answer relevance — how directly the generated answer addresses the question. Measures whether the model stayed on topic.
Context precision — proportion of retrieved chunks that are relevant to the question. Low context precision = noisy retrieval.
Context recall — whether all necessary evidence was present in the retrieved set. Low recall = retrieval failure (coverage gap).

The TruLens projectframes the same decomposition as the “RAG Triad”: context relevance, groundedness (equivalent to faithfulness), and answer relevance [3]. The naming differs; the decomposition is the same. Whether you adopt RAGAS, TruLens, or a comparable framework (e.g. DeepEval), the important design decision is that you instrument all three dimensions — treating them as separate signals in your observability pipeline rather than averaging them into a single score.

The inspector below lets you examine pre-canned RAG trace examples and see how each of the four RAGAS dimensions scores — and what failure mode that pattern implies. Select a trace to walk through the diagnosis.

RAG Trace Inspector

Select a trace to inspect retrieval and generation scores separately. See how the pattern of scores identifies the failure mode.

Query

What is the recommended GPU memory for serving a 70B LLM?

Retrieved Chunks (3)

#1GPU memory requirements vary by workload type and batch size.

#2Training jobs typically need 80 GB GPUs for large models.

#3KV cache grows with sequence length and concurrent requests.

Generated Answer

“Based on the retrieved context, training large models typically requires 80 GB GPUs. For serving, memory requirements depend on batch size.”

RAGAS Scores

Context Precision0.31

Context Recall0.28

Faithfulness0.88

Answer Relevance0.52

Retrieval failure

Diagnosis

Low context precision and recall with high faithfulness: the model faithfully used what it was given, but the retriever returned training-oriented chunks instead of serving-specific ones. The answer conflates training and serving GPU requirements.

Recommended Fix

Tune the embedding model or chunking strategy so that serving-specific chunks rank above training chunks for serving queries. Consider query expansion or metadata filtering by topic.

OpenTelemetry GenAI semantic conventions: the instrumentation vocabulary

The OpenTelemetry semantic conventions for generative AI systems [4] define a standardised attribute namespace — gen_ai.* — that lets you capture LLM and agent signals in a vendor-neutral way across any OTel-compatible collector. This is the layer that converts per-vendor SDKs into a common observability substrate.

Key attributes from the GenAI namespace that every RAG and agent pipeline should emit:

gen_ai.system — identifies the model provider (e.g. openai, anthropic, vertex_ai). Enables cross-provider comparison in dashboards.
gen_ai.request.model / gen_ai.response.model — the model name and version requested. Pairs to detect model routing or fallback events.
gen_ai.usage.input_tokens / gen_ai.usage.output_tokens — token counts per call. Essential for cost attribution per RAG query or agent run.
gen_ai.operation.name — the operation type (chat, text_completion, embeddings). Lets you separate embedding calls (retrieval) from generation calls on the same span trace.
gen_ai.tool.name / gen_ai.tool.call.id — the name and call identifier of a tool invoked by the model. The primary grading unit for agent trajectory evaluation.
gen_ai.agent.id / gen_ai.agent.name — defined in the agent spans specification [4b], these identify the agent instance so you can correlate tool calls back to a named agent across a multi-agent trace.

Concretely, a RAG query should emit at minimum two child spans: one for the embedding call (operation.name = embeddings, carrying the query text and token count) and one for the generation call (operation.name = chat, carrying the model, token counts, and the retrieved-context hash or identifier). A span hierarchy of this shape, collected by any OTel-compatible pipeline and stored in a traces backend (e.g. Tempo, Jaeger, or a managed provider), gives you the raw material to compute all four RAGAS dimensions without instrumenting proprietary SDK events.

Agent trajectory evaluation: grading intermediate steps

An agent system executes a plan across multiple steps: it calls tools, processes results, updates its state, and eventually produces a final output. End-to-end outcome evaluation (“did it produce the right answer?”) misses a class of reliability problems that only surface in the trajectory — the sequence of states, tool calls, and reasoning steps the agent took to get there.

What trajectory grading measures

A well-instrumented trajectory evaluation grades three properties independently:

1Tool-use correctness — did the agent call the right tools, with the right arguments, in the right order? A trajectory that calls a search tool twice when once is sufficient, or calls a write tool before reading the current state, is wasteful or unsafe even if the final output is correct.
2Plan validity — does the sequence of steps reflect a coherent plan for the stated goal? This catches agents that succeed by brute-force retrying rather than by reasoning correctly, and agents that abandon a valid plan mid-execution.
3Intermediate-step coherence — at each step, is the model's reasoning consistent with the evidence it has accumulated? This is the per-step analogue of faithfulness: the model should not assert facts at step N that contradict the tool output it received at step N-1.

Three evaluation frameworks and what each grades

AgentBench (Liu et al., ICLR 2024) [5] is a multi-environment benchmark that evaluates LLMs as agents across eight distinct task environments — operating system shell tasks, database manipulation, knowledge graph navigation, web browsing, and others. Its contribution is demonstrating that agent capability is environment-specific: a model that performs well on web tasks does not necessarily perform well on OS-level tasks. AgentBench grades task-completion rate, step efficiency, and error recovery across those environments.

Inspect (UK AI Security Institute) [6] is an open-source evaluation framework designed for structured task grading. It supports custom solvers, multi-turn agent evaluation, and scorer composition — meaning you can define a trajectory scorer that checks intermediate steps, not just the final answer. Inspect is notable for its provenance: it is maintained by a government safety body, which gives it a degree of independence from vendor evaluation incentives.

LangSmith is a managed tracing and evaluation platform that captures agent runs as typed traces, enabling human annotation of individual steps alongside automated LLM-as-judgescoring. It is vendor-backed (LangChain); comparable open alternatives include Weave (Weights & Biases) and Phoenix (Arize). The evaluation pattern — capturing step-level traces, defining a scorer per step type, and aggregating to a run-level report — is portable across these tools.

The choice of framework is secondary to the commitment to capturing trajectory data. If your agent emits OTel spans with the gen_ai.tool.name and gen_ai.tool.call.id attributes on every tool invocation, and gen_ai.agent.id on every agent span, you have the raw data to feed any of these evaluation frameworks — or to build a lightweight custom scorer tuned to your specific task domain.

The explorer below steps through pre-canned agent runs — one well-formed and one problematic. For each step you can see the tool call, the output, and the per-step verdict across all three trajectory dimensions.

Agent Trajectory Explorer

Step through a multi-step agent run and see per-step verdicts across three trajectory dimensions: tool-use correctness, plan validity, and step coherence.

Progress

Step 1 / 2

1Search knowledge base

Tool Called

search_kb({ "query": "KEDA autoscaling vLLM", "top_k": 5 })

Tool Output

Found 5 relevant chunks: vllm:num_requests_waiting metric, KEDA ScaledObject config, …

Model Reasoning

“The user asked about autoscaling vLLM. I should first search the knowledge base for relevant documentation before answering.”

Trajectory Verdicts

Tool-use correctness✓0.96

Correct tool for the task, appropriate arguments, called once.

Plan validity✓0.93

Search-before-answer is the correct plan shape for a knowledge-retrieval task.

Step coherence✓0.95

Reasoning is consistent with the user request and leads logically to a search call.

Connecting evaluation to the observability pipeline

Evaluation scores are only operationally useful if they flow into the same observability pipeline as infrastructure metrics. The pattern is to emit RAGAS and trajectory scores as OTel metrics (using a counter or histogram instrument) tagged with the pipeline name, model version, and query category. This makes them queryable alongside latency and token-cost metrics in the same dashboards and alerting rules — so you can correlate, for example, a spike in context-integration failures with a model rollout, or a drop in context precision with an index rebuild.

A minimal RAG observability instrumentation checklist:

rag-otel-checklist.yaml

# Minimum OTel attributes per RAG query span
embedding_span:
  attributes:
    - gen_ai.system          # e.g. "openai"
    - gen_ai.operation.name  # "embeddings"
    - gen_ai.request.model   # e.g. "text-embedding-3-small"
    - gen_ai.usage.input_tokens

generation_span:
  attributes:
    - gen_ai.system
    - gen_ai.operation.name  # "chat"
    - gen_ai.request.model
    - gen_ai.response.model  # detect routing/fallback
    - gen_ai.usage.input_tokens
    - gen_ai.usage.output_tokens

# Custom evaluation metrics (emit as OTel metrics, not spans)
evaluation_metrics:
  - rag.faithfulness          # float 0-1
  - rag.context_precision     # float 0-1
  - rag.context_recall        # float 0-1
  - rag.answer_relevance      # float 0-1
  - agent.tool_call_correctness   # float 0-1, per agent run
  - agent.step_coherence          # float 0-1, per step

Storing these as OTel metrics with a Prometheus-compatible scrape endpoint means you can write alerting rulesof the form “alert when rag.faithfulness p50 drops below 0.7 over a 1-hour window” — the same pattern you would use for API error-rate alerts. Evaluation quality becomes a first-class operational signal, not an offline analysis step.

Separating the failure modes in practice

A useful diagnostic pattern when a RAG system degrades is to compute all four RAGAS metrics on the failing queries simultaneously and read the pattern:

Score pattern	Failure mode	Fix
Low context precision + low recall, high faithfulness	Retrieval failure	Embedding model, chunking strategy, or query expansion
High context precision + high recall, low faithfulness	Context-integration failure	Prompt engineering, context window ordering, or model choice
All metrics stable, answer correctness drifting on golden set	Corpus freshness failure	Re-index corpus, add corpus-age monitoring
Low answer relevance with high faithfulness + precision	Query understanding / prompt-framing	Query rewriting or prompt adjustment for the decision framing

Low context precision + low context recall, high faithfulness → retrieval failure. The model is faithfully using what it was given; it was just given the wrong material. Fix: embedding model, chunking strategy, or query expansion.
High context precision + high context recall, low faithfulness → context-integration failure. Good context was retrieved; the model ignored it. Fix: prompt engineering, reducing context window noise, or model choice.
All metrics stable, but answer correctness drifting on a golden test set → corpus freshness failure. Re-index the corpus and update staleness monitoring.
Low answer relevance with high faithfulness and precision → the model is staying grounded but answering a different question than the one asked. Usually a query understanding or prompt-framing issue.

This diagnostic matrix is only possible because the metrics are separated. A single composite RAG-quality score would mask all four patterns.

What this means for platform design

For platform engineers supporting RAG and agent workloads, the practical implications are:

1Require OTel GenAI instrumentation at the workload level. RAG pipelines and agent frameworks that do not emit gen_ai.* spans cannot be observed. Make it a platform onboarding requirement, not an optional add-on.
2Provide evaluation pipelines as platform primitives. Running RAGAS or Inspect on a scheduled basis against a golden test set should be a platform-managed job — similar to how integration tests run in CI — not something each application team re-implements independently.
3Store trajectory data for offline analysis. Agent span traces are large; store the full trace for a sample fraction (e.g. 10%) and aggregate metrics for the rest. The full traces are needed when you need to debug a failure mode that only appears in multi-step sequences.
4Treat corpus freshness as an infrastructure metric, not an application concern. Index age, re-indexing job success rate, and document-staleness histograms belong in the same SLO framework as database replication lag.

The core principle: retrieval and generation are distinct pipeline stages with distinct failure modes. Instrument them as such. Add a third instrumentation layer for agent trajectories. Emit all three as OTel metrics so that quality degradation triggers operational alerts the same way a spike in API errors would.

References

[1] Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. arXiv:2005.11401
[2] Es, S. et al. (2024). RAGAS: Automated Evaluation of Retrieval Augmented Generation. EACL 2024 (demo track). ACL Anthology: 2024.eacl-demo.16
[3] TruLens / TruEra. RAG Triad — core concepts documentation. trulens.org
[4] OpenTelemetry. Semantic conventions for generative AI systems — gen_ai.* namespace. opentelemetry.io/docs/specs/semconv/gen-ai/ (and agent spans: gen-ai-agent-spans/)
[5] Liu, X. et al. (2024). AgentBench: Evaluating LLMs as Agents. ICLR 2024. arXiv:2308.03688
[6] UK AI Security Institute. Inspect: LLM Evaluation Framework. inspect.aisi.org.uk — GitHub: UKGovernmentBEIS/inspect_ai

Continue the Journey

AI Platform