RAG Pipeline Observability with OpenTelemetry

April 16, 2026

10 min read

RAG pipeline observability is the practice of tracing every stage of a Retrieval-Augmented Generation system — from query embedding to vector search to LLM generation — so you can diagnose failures, measure latency, and track quality in production. Without it, a pipeline can return empty or truncated answers and produce no errors, no logs, and no signal that anything went wrong.

Standard APM tools cover HTTP requests, database queries, and service latency. They are not designed for the multi-stage data flow inside a RAG pipeline: a question passes through an embedding model, a vector database, an optional reranker, a context assembly step, and finally an LLM. Each stage can fail independently. Standard APM sees only the outer request; it cannot tell you which stage took 800ms or why the LLM received 0 retrieved chunks. For that you need distributed tracing with span-level attributes specific to retrieval-augmented generation.

For OpenAI instrumentation with OpenTelemetry, the story is simpler — one model, one API call. RAG pipelines add retrieval complexity that requires a different instrumentation strategy.

If you're new to LLM instrumentation, start with our OpenTelemetry for AI systems guide first — this guide assumes you're already familiar with gen_ai.* spans and focuses specifically on the retrieval stages.

Why RAG Pipelines Fail Silently

RAG systems have five stages where failures produce no exception and no non-200 HTTP status:

1. Embedding. The embedding model changes between versions or is swapped for a cheaper alternative. Semantic similarity scores drift. Retrieval quality degrades slowly. No error fires.

2. Vector search. The query returns zero results because the index is stale, the similarity threshold is too strict, or the query was phrased unusually. The pipeline continues with an empty context and the LLM either hallucinates or returns "I don't know."

3. Reranking. The reranker scores all candidates below a cutoff. The result set collapses from ten candidates to zero. No exception, no warning.

4. Context assembly. Retrieved chunks exceed the model's context window. The assembly step silently truncates. The LLM receives an incomplete prompt and generates a partial or misleading answer.

5. LLM generation. The response contains finish_reason = "length" — the model hit the max_tokens limit before completing its answer. Most applications discard this field and return the truncated text to the user as if it were complete.

None of these stages raise exceptions by default. Without explicit instrumentation and attribute capture at each step, these bugs are invisible in production.

The RAG Trace: What Each Span Should Capture

A well-instrumented RAG pipeline produces a trace with five child spans inside a parent rag.query span. Below are the attributes each span should carry.

Note on standards: The OTel GenAI semantic conventions (including gen_ai.* attributes) carry Development stability — they may change in future releases. To opt into experimental GenAI attributes before they stabilize, set OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental. The rag.* attributes used in this guide are custom attributes chosen for clarity; they are not part of any official specification. OpenInference (used by LlamaIndex auto-instrumentation) emits a different attribute schema — see the auto-instrumentation section below for details.

Span 1: Query Embedding

Attribute	Example	Purpose
`rag.query.text`	`"What is chunking?"`	The raw user query
`rag.embedding.model`	`text-embedding-3-small`	Model used for embedding
`rag.embedding.duration_ms`	`42`	Latency for the embed call

Span 2: Vector Search

Attribute	Example	Purpose
`rag.retrieval.query`	`"What is chunking?"`	Query sent to the vector DB
`rag.retrieval.top_k`	`5`	Number of results requested
`rag.retrieval.results_count`	`0`	Number of results returned
`rag.retrieval.empty_result`	`true`	Boolean flag for empty retrieval

The rag.retrieval.empty_result boolean is the most important attribute in the entire trace. It lets you filter for all requests where retrieval failed silently.

Span 3: Reranking (optional)

Attribute	Example	Purpose
`rag.reranking.model`	`cohere-rerank-v3`	Reranker model
`rag.reranking.scores`	`[0.91, 0.74, 0.52]`	Score list for retrieved chunks

Span 4: Context Assembly

Attribute	Example	Purpose
`rag.context.token_count`	`3840`	Total tokens in assembled context
`rag.context.truncated`	`true`	Whether context was cut

Span 5: LLM Generation

Attribute	Example	Purpose
`gen_ai.request.model`	`gpt-4o`	Requested model
`gen_ai.usage.input_tokens`	`4096`	Tokens in the prompt
`gen_ai.usage.output_tokens`	`128`	Tokens in the response
`gen_ai.response.finish_reasons`	`["length"]`	Why generation stopped (string array)

When gen_ai.response.finish_reasons contains "length", the LLM hit max_tokens and the answer is incomplete. This value correlates directly with rag.context.truncated = true upstream.

Auto-Instrumentation: LlamaIndex + OpenTelemetry

LlamaIndex has first-class OpenTelemetry support. Two packages are available:

openinference-instrumentation-llama-index — maintained by Arize/OpenInference, recommended in the official LlamaIndex docs
opentelemetry-instrumentation-llamaindex — from Traceloop/OpenLLMetry (Traceloop was acquired by ServiceNow in March 2026; the library continues under Apache 2.0)

Both produce OpenInference-compatible spans. The examples below use the Arize package:

bash

pip install openinference-instrumentation-llama-index opentelemetry-exporter-otlp

python

import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

def setup_tracing():
    exporter = OTLPSpanExporter(
    endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"),
)
    provider = TracerProvider()
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    # Instruments LLM calls, retrievers, and embeddings automatically
    LlamaIndexInstrumentor().instrument()

setup_tracing()

# Your existing LlamaIndex code works unchanged
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

response = query_engine.query("What is context window truncation?")

LlamaIndexInstrumentor().instrument() is a single call. After it runs, every LlamaIndex operation — VectorStoreRetriever, OpenAIEmbedding, OpenAI LLM, and the query engine itself — produces OpenInference-compatible spans with attributes like retrieval.documents, document.score, embedding.model_name, llm.token_count.prompt, and llm.token_count.completion. Note that OpenInference uses its own attribute names — not the rag.* custom attributes described earlier in this guide. The rag.* attributes are for manual instrumentation of custom pipelines; if you use auto-instrumentation, filter and alert on the OpenInference attribute names instead.

PII control: To prevent query text and retrieved documents from appearing in traces, set:

bash

OPENINFERENCE_HIDE_INPUTS=true
OPENINFERENCE_HIDE_OUTPUTS=true

These are the OpenInference privacy controls. You can also set OPENINFERENCE_HIDE_EMBEDDING_VECTORS=true and OPENINFERENCE_HIDE_INPUT_TEXT=true for more granular control. This suppresses input/output content in span attributes while preserving token counts, latency, and structural attributes.

Auto-Instrumentation: LangChain + OpenTelemetry

LangChain uses a similar pattern via the OpenLLMetry instrumentation package. Traceloop, the maintainer of opentelemetry-instrumentation-langchain, was acquired by ServiceNow in March 2026; the library continues under Apache 2.0.

bash

pip install opentelemetry-instrumentation-langchain opentelemetry-exporter-otlp

python

import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.langchain import LangChainInstrumentor

def setup_tracing():
    exporter = OTLPSpanExporter(
    endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"),
)
    provider = TracerProvider()
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    LangChainInstrumentor().instrument()

setup_tracing()

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.load_local("./faiss_index", embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

result = qa_chain.invoke({"query": "Explain vector search latency."})

The LangChainInstrumentor wraps the RetrievalQA chain and produces a span tree that shows:

The root chain invocation with overall latency
A retriever child span with the number of documents retrieved
An llm child span with gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.response.finish_reasons

The gap between auto-instrumentation and complete observability is rag.retrieval.empty_result and rag.context.truncated — the LangChain instrumentation does not currently set these. Add them with manual spans, shown in the next section.

Manual Instrumentation for Custom RAG Pipelines

Auto-instrumentation covers the standard LlamaIndex and LangChain call paths. For custom retrievers, custom rerankers, or any logic outside those frameworks, we instrument manually:

python

import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

exporter = OTLPSpanExporter(
    endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"),
)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("rag.pipeline")


def embed_query(query: str) -> list[float]:
    # Replace with your actual embedding call
    import time
    start = time.monotonic()
    vector = your_embedding_model.encode(query)
    duration_ms = (time.monotonic() - start) * 1000

    span = trace.get_current_span()
    span.set_attribute("rag.query.text", query)
    span.set_attribute("rag.embedding.model", "text-embedding-3-small")
    span.set_attribute("rag.embedding.duration_ms", round(duration_ms))
    return vector


def retrieve(vector: list[float], top_k: int = 5) -> list[dict]:
    results = your_vector_db.search(vector, top_k=top_k)
    span = trace.get_current_span()
    span.set_attribute("rag.retrieval.top_k", top_k)
    span.set_attribute("rag.retrieval.results_count", len(results))
    span.set_attribute("rag.retrieval.empty_result", len(results) == 0)
    return results


def assemble_context(chunks: list[dict], max_tokens: int = 3000) -> tuple[str, bool]:
    text = "\n\n".join(c["text"] for c in chunks)
    token_count = len(text.split())  # simplified; use a real tokenizer
    truncated = token_count > max_tokens
    if truncated:
        text = " ".join(text.split()[:max_tokens])

    span = trace.get_current_span()
    span.set_attribute("rag.context.token_count", token_count)
    span.set_attribute("rag.context.truncated", truncated)
    return text, truncated


def run_rag_pipeline(query: str) -> str:
    with tracer.start_as_current_span("rag.query") as root_span:
        root_span.set_attribute("rag.query.text", query)

        with tracer.start_as_current_span("rag.embed"):
            vector = embed_query(query)

        with tracer.start_as_current_span("rag.retrieve"):
            chunks = retrieve(vector)

        if not chunks:
            root_span.set_attribute("rag.retrieval.empty_result", True)
            return "No relevant documents found."

        with tracer.start_as_current_span("rag.assemble"):
            context, truncated = assemble_context(chunks)

        with tracer.start_as_current_span("rag.generate"):
            response = your_llm.generate(context=context, query=query)
            span = trace.get_current_span()
            span.set_attribute("gen_ai.request.model", "gpt-4o")
            span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
            span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)
            span.set_attribute("gen_ai.response.finish_reasons", [response.choices[0].finish_reason])

        return response.choices[0].message.content

The key instrumentation points are in retrieve() — setting rag.retrieval.empty_result — and in assemble_context() — setting rag.context.truncated. These two booleans power the most important production alerts.

Sending RAG Traces to Uptrace

Uptrace, an OpenTelemetry-native APM, accepts traces via OTLP and indexes all span attributes as queryable fields. The exporter configuration is the same across all three approaches:

python

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

exporter = OTLPSpanExporter(
    endpoint="https://api.uptrace.dev:4317",
    headers={"uptrace-dsn": "https://<secret>@api.uptrace.dev?grpc=4317"},
)

After traces arrive, distributed tracing in Uptrace lets you:

Filter by rag.retrieval.empty_result = true to find all requests where retrieval returned nothing
Group by gen_ai.request.model to compare token usage and latency across model versions
Filter by gen_ai.response.finish_reasons containing "length" to find truncated responses
Sort by span duration to identify which pipeline stage is the latency bottleneck

The rag.* attributes are stored as custom attributes in Uptrace and are immediately searchable without any schema configuration.

What to Alert On

Four alert conditions cover the most common RAG failure modes. Tie them to SLA/SLO monitoring requirements to define acceptable thresholds per environment.

Condition	Suggested threshold	What it means
`rag.retrieval.empty_result` rate	> 5% of requests	Index staleness, query distribution shift
`gen_ai.response.finish_reasons` contains `"length"` rate	> 2% of requests	Context too large, `max_tokens` too low
p95 vector search latency	> 500ms	Vector DB performance degradation
p95 end-to-end RAG latency	> 3s	Compound latency across all stages

Empty retrieval above 5% typically indicates the vector index is stale or the embedding model was swapped without re-indexing. A finish_reasons containing "length" rate above 2% means users are consistently receiving incomplete answers — investigate rag.context.truncated in the same traces to confirm.

FAQ

What is RAG observability?

RAG observability is the practice of tracing and measuring every stage of a Retrieval-Augmented Generation pipeline — embedding, vector search, reranking, context assembly, and LLM generation. The goal is to make failures visible in production, because most RAG failure modes (empty retrieval, context truncation, incomplete generation) produce no exception and no error log by default.

How do I trace a RAG pipeline with OpenTelemetry?

Use LlamaIndexInstrumentor().instrument() or LangChainInstrumentor().instrument() for framework-based pipelines. For custom pipelines, use tracer.start_as_current_span() to create a span per stage and call span.set_attribute() to record attributes like rag.retrieval.results_count and rag.context.truncated. Export spans via OTLP to any compatible backend.

Does LlamaIndex support OpenTelemetry?

Yes. Install openinference-instrumentation-llama-index (recommended, maintained by Arize/OpenInference) or opentelemetry-instrumentation-llamaindex (from OpenLLMetry/ServiceNow) and call LlamaIndexInstrumentor().instrument() once at startup. Both automatically instrument LLM calls, embedding calls, and vector store retrievers using the OpenInference semantic conventions. Spans are exported via the standard OTLP exporter.

Does LangChain support OpenTelemetry tracing?

Yes. Install opentelemetry-instrumentation-langchain and call LangChainInstrumentor().instrument(). The instrumentation wraps LangChain chains and produces spans for each component. It captures gen_ai.* attributes on LLM calls and retriever metadata, though some RAG-specific attributes like rag.retrieval.empty_result require manual instrumentation.

What attributes should I add to RAG spans?

The most operationally useful attributes are: rag.retrieval.empty_result (boolean), rag.retrieval.results_count (integer), rag.context.truncated (boolean), rag.context.token_count (integer), and gen_ai.response.finish_reasons (string array). These attributes are sufficient to detect and diagnose the most common silent failures in a RAG pipeline.

How is RAG observability different from standard LLM observability?

LLM observability focuses on a single model interaction: token usage, latency, cost, and finish reason. RAG observability adds the retrieval layer — you need to trace what was retrieved, how much, whether the retrieval was empty, and whether the context was truncated before it reached the model. An LLM call with zero retrieved context looks identical to one with ten chunks unless you instrument the retrieval stages explicitly.

Conclusion

RAG pipelines fail in ways that standard observability tools cannot detect: empty vector search results, silent context truncation, and finish_reasons containing "length" on the LLM response. The fix is span-level instrumentation at each stage with attributes that expose these states directly.

For LlamaIndex and LangChain, a single instrumentation call handles the most common spans automatically — note that auto-instrumentation libraries emit their own attribute schemas (OpenInference for LlamaIndex, OpenLLMetry for LangChain), which differ from the custom rag.* attributes used in manual instrumentation. For custom pipelines, manual spans with rag.retrieval.empty_result and rag.context.truncated cover the two most critical failure modes. Export spans to Uptrace or any OTLP-compatible backend and filter on these attributes to find failures immediately rather than through user complaints.

The OTel GenAI SIG is working toward official RAG semantic conventions. Until that specification stabilizes, the custom rag.* attributes used in manual instrumentation and the OpenInference attributes emitted by auto-instrumentation are both supported by Uptrace's attribute indexing.

Why RAG Pipelines Fail Silently

The RAG Trace: What Each Span Should Capture

Span 1: Query Embedding

Span 2: Vector Search

Span 3: Reranking (optional)

Span 4: Context Assembly

Span 5: LLM Generation

Auto-Instrumentation: LlamaIndex + OpenTelemetry

Auto-Instrumentation: LangChain + OpenTelemetry

Manual Instrumentation for Custom RAG Pipelines

Sending RAG Traces to Uptrace

What to Alert On

FAQ

Conclusion