RAG Pipeline Observability with OpenTelemetry
RAG pipeline observability is the practice of tracing every stage of a Retrieval-Augmented Generation system — from query embedding to vector search to LLM generation — so you can diagnose failures, measure latency, and track quality in production. Without it, a pipeline can return empty or truncated answers and produce no errors, no logs, and no signal that anything went wrong.
Standard APM tools cover HTTP requests, database queries, and service latency. They are not designed for the multi-stage data flow inside a RAG pipeline: a question passes through an embedding model, a vector database, an optional reranker, a context assembly step, and finally an LLM. Each stage can fail independently. Standard APM sees only the outer request; it cannot tell you which stage took 800ms or why the LLM received 0 retrieved chunks. For that you need distributed tracing with span-level attributes specific to retrieval-augmented generation.
For OpenAI instrumentation with OpenTelemetry, the story is simpler — one model, one API call. RAG pipelines add retrieval complexity that requires a different instrumentation strategy.
If you're new to LLM instrumentation, start with our OpenTelemetry for AI systems guide first — this guide assumes you're already familiar with gen_ai.* spans and focuses specifically on the retrieval stages.
Why RAG Pipelines Fail Silently
RAG systems have five stages where failures produce no exception and no non-200 HTTP status:
1. Embedding. The embedding model changes between versions or is swapped for a cheaper alternative. Semantic similarity scores drift. Retrieval quality degrades slowly. No error fires.
2. Vector search. The query returns zero results because the index is stale, the similarity threshold is too strict, or the query was phrased unusually. The pipeline continues with an empty context and the LLM either hallucinates or returns "I don't know."
3. Reranking. The reranker scores all candidates below a cutoff. The result set collapses from ten candidates to zero. No exception, no warning.
4. Context assembly. Retrieved chunks exceed the model's context window. The assembly step silently truncates. The LLM receives an incomplete prompt and generates a partial or misleading answer.
5. LLM generation. The response contains finish_reason = "length" — the model hit the max_tokens limit before completing its answer. Most applications discard this field and return the truncated text to the user as if it were complete.
None of these stages raise exceptions by default. Without explicit instrumentation and attribute capture at each step, these bugs are invisible in production.
The RAG Trace: What Each Span Should Capture
A well-instrumented RAG pipeline produces a trace with five child spans inside a parent rag.query span. Below are the attributes each span should carry.
Note on standards: The OTel GenAI semantic conventions (including
gen_ai.*attributes) carry Development stability — they may change in future releases. To opt into experimental GenAI attributes before they stabilize, setOTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental. Therag.*attributes used in this guide are custom attributes chosen for clarity; they are not part of any official specification. OpenInference (used by LlamaIndex auto-instrumentation) emits a different attribute schema — see the auto-instrumentation section below for details.
Span 1: Query Embedding
| Attribute | Example | Purpose |
|---|---|---|
rag.query.text | "What is chunking?" | The raw user query |
rag.embedding.model | text-embedding-3-small | Model used for embedding |
rag.embedding.duration_ms | 42 | Latency for the embed call |
Span 2: Vector Search
| Attribute | Example | Purpose |
|---|---|---|
rag.retrieval.query | "What is chunking?" | Query sent to the vector DB |
rag.retrieval.top_k | 5 | Number of results requested |
rag.retrieval.results_count | 0 | Number of results returned |
rag.retrieval.empty_result | true | Boolean flag for empty retrieval |
The rag.retrieval.empty_result boolean is the most important attribute in the entire trace. It lets you filter for all requests where retrieval failed silently.
Span 3: Reranking (optional)
| Attribute | Example | Purpose |
|---|---|---|
rag.reranking.model | cohere-rerank-v3 | Reranker model |
rag.reranking.scores | [0.91, 0.74, 0.52] | Score list for retrieved chunks |
Span 4: Context Assembly
| Attribute | Example | Purpose |
|---|---|---|
rag.context.token_count | 3840 | Total tokens in assembled context |
rag.context.truncated | true | Whether context was cut |
Span 5: LLM Generation
| Attribute | Example | Purpose |
|---|---|---|
gen_ai.request.model | gpt-4o | Requested model |
gen_ai.usage.input_tokens | 4096 | Tokens in the prompt |
gen_ai.usage.output_tokens | 128 | Tokens in the response |
gen_ai.response.finish_reasons | ["length"] | Why generation stopped (string array) |
When gen_ai.response.finish_reasons contains "length", the LLM hit max_tokens and the answer is incomplete. This value correlates directly with rag.context.truncated = true upstream.
Auto-Instrumentation: LlamaIndex + OpenTelemetry
LlamaIndex has first-class OpenTelemetry support. Two packages are available:
openinference-instrumentation-llama-index— maintained by Arize/OpenInference, recommended in the official LlamaIndex docsopentelemetry-instrumentation-llamaindex— from Traceloop/OpenLLMetry (Traceloop was acquired by ServiceNow in March 2026; the library continues under Apache 2.0)
Both produce OpenInference-compatible spans. The examples below use the Arize package:
pip install openinference-instrumentation-llama-index opentelemetry-exporter-otlp
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
def setup_tracing():
exporter = OTLPSpanExporter(
endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"),
)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Instruments LLM calls, retrievers, and embeddings automatically
LlamaIndexInstrumentor().instrument()
setup_tracing()
# Your existing LlamaIndex code works unchanged
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is context window truncation?")
LlamaIndexInstrumentor().instrument() is a single call. After it runs, every LlamaIndex operation — VectorStoreRetriever, OpenAIEmbedding, OpenAI LLM, and the query engine itself — produces OpenInference-compatible spans with attributes like retrieval.documents, document.score, embedding.model_name, llm.token_count.prompt, and llm.token_count.completion. Note that OpenInference uses its own attribute names — not the rag.* custom attributes described earlier in this guide. The rag.* attributes are for manual instrumentation of custom pipelines; if you use auto-instrumentation, filter and alert on the OpenInference attribute names instead.
PII control: To prevent query text and retrieved documents from appearing in traces, set:
OPENINFERENCE_HIDE_INPUTS=true
OPENINFERENCE_HIDE_OUTPUTS=true
These are the OpenInference privacy controls. You can also set OPENINFERENCE_HIDE_EMBEDDING_VECTORS=true and OPENINFERENCE_HIDE_INPUT_TEXT=true for more granular control. This suppresses input/output content in span attributes while preserving token counts, latency, and structural attributes.
Auto-Instrumentation: LangChain + OpenTelemetry
LangChain uses a similar pattern via the OpenLLMetry instrumentation package. Traceloop, the maintainer of opentelemetry-instrumentation-langchain, was acquired by ServiceNow in March 2026; the library continues under Apache 2.0.
pip install opentelemetry-instrumentation-langchain opentelemetry-exporter-otlp
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.langchain import LangChainInstrumentor
def setup_tracing():
exporter = OTLPSpanExporter(
endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"),
)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
LangChainInstrumentor().instrument()
setup_tracing()
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.load_local("./faiss_index", embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
result = qa_chain.invoke({"query": "Explain vector search latency."})
The LangChainInstrumentor wraps the RetrievalQA chain and produces a span tree that shows:
- The root chain invocation with overall latency
- A
retrieverchild span with the number of documents retrieved - An
llmchild span withgen_ai.usage.input_tokens,gen_ai.usage.output_tokens, andgen_ai.response.finish_reasons
The gap between auto-instrumentation and complete observability is rag.retrieval.empty_result and rag.context.truncated — the LangChain instrumentation does not currently set these. Add them with manual spans, shown in the next section.
Manual Instrumentation for Custom RAG Pipelines
Auto-instrumentation covers the standard LlamaIndex and LangChain call paths. For custom retrievers, custom rerankers, or any logic outside those frameworks, we instrument manually:
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
exporter = OTLPSpanExporter(
endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"),
)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("rag.pipeline")
def embed_query(query: str) -> list[float]:
# Replace with your actual embedding call
import time
start = time.monotonic()
vector = your_embedding_model.encode(query)
duration_ms = (time.monotonic() - start) * 1000
span = trace.get_current_span()
span.set_attribute("rag.query.text", query)
span.set_attribute("rag.embedding.model", "text-embedding-3-small")
span.set_attribute("rag.embedding.duration_ms", round(duration_ms))
return vector
def retrieve(vector: list[float], top_k: int = 5) -> list[dict]:
results = your_vector_db.search(vector, top_k=top_k)
span = trace.get_current_span()
span.set_attribute("rag.retrieval.top_k", top_k)
span.set_attribute("rag.retrieval.results_count", len(results))
span.set_attribute("rag.retrieval.empty_result", len(results) == 0)
return results
def assemble_context(chunks: list[dict], max_tokens: int = 3000) -> tuple[str, bool]:
text = "\n\n".join(c["text"] for c in chunks)
token_count = len(text.split()) # simplified; use a real tokenizer
truncated = token_count > max_tokens
if truncated:
text = " ".join(text.split()[:max_tokens])
span = trace.get_current_span()
span.set_attribute("rag.context.token_count", token_count)
span.set_attribute("rag.context.truncated", truncated)
return text, truncated
def run_rag_pipeline(query: str) -> str:
with tracer.start_as_current_span("rag.query") as root_span:
root_span.set_attribute("rag.query.text", query)
with tracer.start_as_current_span("rag.embed"):
vector = embed_query(query)
with tracer.start_as_current_span("rag.retrieve"):
chunks = retrieve(vector)
if not chunks:
root_span.set_attribute("rag.retrieval.empty_result", True)
return "No relevant documents found."
with tracer.start_as_current_span("rag.assemble"):
context, truncated = assemble_context(chunks)
with tracer.start_as_current_span("rag.generate"):
response = your_llm.generate(context=context, query=query)
span = trace.get_current_span()
span.set_attribute("gen_ai.request.model", "gpt-4o")
span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)
span.set_attribute("gen_ai.response.finish_reasons", [response.choices[0].finish_reason])
return response.choices[0].message.content
The key instrumentation points are in retrieve() — setting rag.retrieval.empty_result — and in assemble_context() — setting rag.context.truncated. These two booleans power the most important production alerts.
Sending RAG Traces to Uptrace
Uptrace, an OpenTelemetry-native APM, accepts traces via OTLP and indexes all span attributes as queryable fields. The exporter configuration is the same across all three approaches:
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
exporter = OTLPSpanExporter(
endpoint="https://api.uptrace.dev:4317",
headers={"uptrace-dsn": "https://<secret>@api.uptrace.dev?grpc=4317"},
)
After traces arrive, distributed tracing in Uptrace lets you:
- Filter by
rag.retrieval.empty_result = trueto find all requests where retrieval returned nothing - Group by
gen_ai.request.modelto compare token usage and latency across model versions - Filter by
gen_ai.response.finish_reasonscontaining"length"to find truncated responses - Sort by span duration to identify which pipeline stage is the latency bottleneck
The rag.* attributes are stored as custom attributes in Uptrace and are immediately searchable without any schema configuration.
What to Alert On
Four alert conditions cover the most common RAG failure modes. Tie them to SLA/SLO monitoring requirements to define acceptable thresholds per environment.
| Condition | Suggested threshold | What it means |
|---|---|---|
rag.retrieval.empty_result rate | > 5% of requests | Index staleness, query distribution shift |
gen_ai.response.finish_reasons contains "length" rate | > 2% of requests | Context too large, max_tokens too low |
| p95 vector search latency | > 500ms | Vector DB performance degradation |
| p95 end-to-end RAG latency | > 3s | Compound latency across all stages |
Empty retrieval above 5% typically indicates the vector index is stale or the embedding model was swapped without re-indexing. A finish_reasons containing "length" rate above 2% means users are consistently receiving incomplete answers — investigate rag.context.truncated in the same traces to confirm.
FAQ
What is RAG observability?
RAG observability is the practice of tracing and measuring every stage of a Retrieval-Augmented Generation pipeline — embedding, vector search, reranking, context assembly, and LLM generation. The goal is to make failures visible in production, because most RAG failure modes (empty retrieval, context truncation, incomplete generation) produce no exception and no error log by default.
How do I trace a RAG pipeline with OpenTelemetry?
Use LlamaIndexInstrumentor().instrument() or LangChainInstrumentor().instrument() for framework-based pipelines. For custom pipelines, use tracer.start_as_current_span() to create a span per stage and call span.set_attribute() to record attributes like rag.retrieval.results_count and rag.context.truncated. Export spans via OTLP to any compatible backend.
Does LlamaIndex support OpenTelemetry?
Yes. Install openinference-instrumentation-llama-index (recommended, maintained by Arize/OpenInference) or opentelemetry-instrumentation-llamaindex (from OpenLLMetry/ServiceNow) and call LlamaIndexInstrumentor().instrument() once at startup. Both automatically instrument LLM calls, embedding calls, and vector store retrievers using the OpenInference semantic conventions. Spans are exported via the standard OTLP exporter.
Does LangChain support OpenTelemetry tracing?
Yes. Install opentelemetry-instrumentation-langchain and call LangChainInstrumentor().instrument(). The instrumentation wraps LangChain chains and produces spans for each component. It captures gen_ai.* attributes on LLM calls and retriever metadata, though some RAG-specific attributes like rag.retrieval.empty_result require manual instrumentation.
What attributes should I add to RAG spans?
The most operationally useful attributes are: rag.retrieval.empty_result (boolean), rag.retrieval.results_count (integer), rag.context.truncated (boolean), rag.context.token_count (integer), and gen_ai.response.finish_reasons (string array). These attributes are sufficient to detect and diagnose the most common silent failures in a RAG pipeline.
How is RAG observability different from standard LLM observability?
LLM observability focuses on a single model interaction: token usage, latency, cost, and finish reason. RAG observability adds the retrieval layer — you need to trace what was retrieved, how much, whether the retrieval was empty, and whether the context was truncated before it reached the model. An LLM call with zero retrieved context looks identical to one with ten chunks unless you instrument the retrieval stages explicitly.
Conclusion
RAG pipelines fail in ways that standard observability tools cannot detect: empty vector search results, silent context truncation, and finish_reasons containing "length" on the LLM response. The fix is span-level instrumentation at each stage with attributes that expose these states directly.
For LlamaIndex and LangChain, a single instrumentation call handles the most common spans automatically — note that auto-instrumentation libraries emit their own attribute schemas (OpenInference for LlamaIndex, OpenLLMetry for LangChain), which differ from the custom rag.* attributes used in manual instrumentation. For custom pipelines, manual spans with rag.retrieval.empty_result and rag.context.truncated cover the two most critical failure modes. Export spans to Uptrace or any OTLP-compatible backend and filter on these attributes to find failures immediately rather than through user complaints.
The OTel GenAI SIG is working toward official RAG semantic conventions. Until that specification stabilizes, the custom rag.* attributes used in manual instrumentation and the OpenInference attributes emitted by auto-instrumentation are both supported by Uptrace's attribute indexing.