# RAG Pipeline Observability with OpenTelemetry

> How to trace every stage of a RAG pipeline with OpenTelemetry — embedding, vector search, reranking, context assembly, and LLM call — using LlamaIndex, LangChain, and Uptrace.

RAG pipeline observability is the practice of tracing every stage of a Retrieval-Augmented Generation system — from query embedding to vector search to LLM generation — so you can diagnose failures, measure latency, and track quality in production. Without it, a pipeline can return empty or truncated answers and produce no errors, no logs, and no signal that anything went wrong.

Standard APM tools cover HTTP requests, database queries, and service latency. They are not designed for the multi-stage data flow inside a RAG pipeline: a question passes through an embedding model, a vector database, an optional reranker, a context assembly step, and finally an LLM. Each stage can fail independently. Standard APM sees only the outer request; it cannot tell you which stage took 800ms or why the LLM received 0 retrieved chunks. For that you need distributed tracing with span-level attributes specific to retrieval-augmented generation.

For [OpenAI instrumentation with OpenTelemetry](/guides/opentelemetry-openai), the story is simpler — one model, one API call. RAG pipelines add retrieval complexity that requires a different instrumentation strategy.

If you're new to LLM instrumentation, start with our [OpenTelemetry for AI systems guide](/blog/opentelemetry-ai-systems) first — this guide assumes you're already familiar with `gen_ai.*` spans and focuses specifically on the retrieval stages.

## Why RAG Pipelines Fail Silently

RAG systems have five stages where failures produce no exception and no non-200 HTTP status:

**1. Embedding.** The embedding model changes between versions or is swapped for a cheaper alternative. Semantic similarity scores drift. Retrieval quality degrades slowly. No error fires.

**2. Vector search.** The query returns zero results because the index is stale, the similarity threshold is too strict, or the query was phrased unusually. The pipeline continues with an empty context and the LLM either hallucinates or returns "I don't know."

**3. Reranking.** The reranker scores all candidates below a cutoff. The result set collapses from ten candidates to zero. No exception, no warning.

**4. Context assembly.** Retrieved chunks exceed the model's context window. The assembly step silently truncates. The LLM receives an incomplete prompt and generates a partial or misleading answer.

**5. LLM generation.** The response contains `finish_reason = "length"` — the model hit the `max_tokens` limit before completing its answer. Most applications discard this field and return the truncated text to the user as if it were complete.

None of these stages raise exceptions by default. Without explicit instrumentation and attribute capture at each step, these bugs are invisible in production.

## The RAG Trace: What Each Span Should Capture

A well-instrumented RAG pipeline produces a trace with five child spans inside a parent `rag.query` span. Below are the attributes each span should carry.

> **Note on standards:** The OTel GenAI semantic conventions (including `gen_ai.*` attributes) carry **Development** stability — they may change in future releases. To opt into experimental GenAI attributes before they stabilize, set `OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental`. The `rag.*` attributes used in this guide are **custom attributes** chosen for clarity; they are not part of any official specification. OpenInference (used by LlamaIndex auto-instrumentation) emits a different attribute schema — see the auto-instrumentation section below for details.

### Span 1: Query Embedding

<table>
<thead>
  <tr>
    <th>
      Attribute
    </th>
    
    <th>
      Example
    </th>
    
    <th>
      Purpose
    </th>
  </tr>
</thead>

<tbody>
  <tr>
    <td>
      <code>
        rag.query.text
      </code>
    </td>
    
    <td>
      <code>
        "What is chunking?"
      </code>
    </td>
    
    <td>
      The raw user query
    </td>
  </tr>
  
  <tr>
    <td>
      <code>
        rag.embedding.model
      </code>
    </td>
    
    <td>
      <code>
        text-embedding-3-small
      </code>
    </td>
    
    <td>
      Model used for embedding
    </td>
  </tr>
  
  <tr>
    <td>
      <code>
        rag.embedding.duration_ms
      </code>
    </td>
    
    <td>
      <code>
        42
      </code>
    </td>
    
    <td>
      Latency for the embed call
    </td>
  </tr>
</tbody>
</table>

### Span 2: Vector Search

<table>
<thead>
  <tr>
    <th>
      Attribute
    </th>
    
    <th>
      Example
    </th>
    
    <th>
      Purpose
    </th>
  </tr>
</thead>

<tbody>
  <tr>
    <td>
      <code>
        rag.retrieval.query
      </code>
    </td>
    
    <td>
      <code>
        "What is chunking?"
      </code>
    </td>
    
    <td>
      Query sent to the vector DB
    </td>
  </tr>
  
  <tr>
    <td>
      <code>
        rag.retrieval.top_k
      </code>
    </td>
    
    <td>
      <code>
        5
      </code>
    </td>
    
    <td>
      Number of results requested
    </td>
  </tr>
  
  <tr>
    <td>
      <code>
        rag.retrieval.results_count
      </code>
    </td>
    
    <td>
      <code>
        0
      </code>
    </td>
    
    <td>
      Number of results returned
    </td>
  </tr>
  
  <tr>
    <td>
      <code>
        rag.retrieval.empty_result
      </code>
    </td>
    
    <td>
      <code>
        true
      </code>
    </td>
    
    <td>
      Boolean flag for empty retrieval
    </td>
  </tr>
</tbody>
</table>

The `rag.retrieval.empty_result` boolean is the most important attribute in the entire trace. It lets you filter for all requests where retrieval failed silently.

### Span 3: Reranking (optional)

<table>
<thead>
  <tr>
    <th>
      Attribute
    </th>
    
    <th>
      Example
    </th>
    
    <th>
      Purpose
    </th>
  </tr>
</thead>

<tbody>
  <tr>
    <td>
      <code>
        rag.reranking.model
      </code>
    </td>
    
    <td>
      <code>
        cohere-rerank-v3
      </code>
    </td>
    
    <td>
      Reranker model
    </td>
  </tr>
  
  <tr>
    <td>
      <code>
        rag.reranking.scores
      </code>
    </td>
    
    <td>
      <code>
        [0.91, 0.74, 0.52]
      </code>
    </td>
    
    <td>
      Score list for retrieved chunks
    </td>
  </tr>
</tbody>
</table>

### Span 4: Context Assembly

<table>
<thead>
  <tr>
    <th>
      Attribute
    </th>
    
    <th>
      Example
    </th>
    
    <th>
      Purpose
    </th>
  </tr>
</thead>

<tbody>
  <tr>
    <td>
      <code>
        rag.context.token_count
      </code>
    </td>
    
    <td>
      <code>
        3840
      </code>
    </td>
    
    <td>
      Total tokens in assembled context
    </td>
  </tr>
  
  <tr>
    <td>
      <code>
        rag.context.truncated
      </code>
    </td>
    
    <td>
      <code>
        true
      </code>
    </td>
    
    <td>
      Whether context was cut
    </td>
  </tr>
</tbody>
</table>

### Span 5: LLM Generation

<table>
<thead>
  <tr>
    <th>
      Attribute
    </th>
    
    <th>
      Example
    </th>
    
    <th>
      Purpose
    </th>
  </tr>
</thead>

<tbody>
  <tr>
    <td>
      <code>
        gen_ai.request.model
      </code>
    </td>
    
    <td>
      <code>
        gpt-4o
      </code>
    </td>
    
    <td>
      Requested model
    </td>
  </tr>
  
  <tr>
    <td>
      <code>
        gen_ai.usage.input_tokens
      </code>
    </td>
    
    <td>
      <code>
        4096
      </code>
    </td>
    
    <td>
      Tokens in the prompt
    </td>
  </tr>
  
  <tr>
    <td>
      <code>
        gen_ai.usage.output_tokens
      </code>
    </td>
    
    <td>
      <code>
        128
      </code>
    </td>
    
    <td>
      Tokens in the response
    </td>
  </tr>
  
  <tr>
    <td>
      <code>
        gen_ai.response.finish_reasons
      </code>
    </td>
    
    <td>
      <code>
        ["length"]
      </code>
    </td>
    
    <td>
      Why generation stopped (string array)
    </td>
  </tr>
</tbody>
</table>

When `gen_ai.response.finish_reasons` contains `"length"`, the LLM hit `max_tokens` and the answer is incomplete. This value correlates directly with `rag.context.truncated = true` upstream.

## Auto-Instrumentation: LlamaIndex + OpenTelemetry

LlamaIndex has first-class OpenTelemetry support. Two packages are available:

- `openinference-instrumentation-llama-index` — maintained by Arize/OpenInference, [recommended in the official LlamaIndex docs](https://docs.llamaindex.ai/en/stable/module_guides/observability/)
- `opentelemetry-instrumentation-llamaindex` — from Traceloop/OpenLLMetry (Traceloop was acquired by ServiceNow in March 2026; the library continues under Apache 2.0)

Both produce OpenInference-compatible spans. The examples below use the Arize package:

```bash
pip install openinference-instrumentation-llama-index opentelemetry-exporter-otlp
```

```python
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

def setup_tracing():
    exporter = OTLPSpanExporter(
    endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"),
)
    provider = TracerProvider()
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    # Instruments LLM calls, retrievers, and embeddings automatically
    LlamaIndexInstrumentor().instrument()

setup_tracing()

# Your existing LlamaIndex code works unchanged
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

response = query_engine.query("What is context window truncation?")
```

`LlamaIndexInstrumentor().instrument()` is a single call. After it runs, every LlamaIndex operation — `VectorStoreRetriever`, `OpenAIEmbedding`, `OpenAI` LLM, and the query engine itself — produces OpenInference-compatible spans with attributes like `retrieval.documents`, `document.score`, `embedding.model_name`, `llm.token_count.prompt`, and `llm.token_count.completion`. Note that OpenInference uses its own attribute names — not the `rag.*` custom attributes described earlier in this guide. The `rag.*` attributes are for manual instrumentation of custom pipelines; if you use auto-instrumentation, filter and alert on the OpenInference attribute names instead.

**PII control:** To prevent query text and retrieved documents from appearing in traces, set:

```bash
OPENINFERENCE_HIDE_INPUTS=true
OPENINFERENCE_HIDE_OUTPUTS=true
```

These are the OpenInference privacy controls. You can also set `OPENINFERENCE_HIDE_EMBEDDING_VECTORS=true` and `OPENINFERENCE_HIDE_INPUT_TEXT=true` for more granular control. This suppresses input/output content in span attributes while preserving token counts, latency, and structural attributes.

## Auto-Instrumentation: LangChain + OpenTelemetry

LangChain uses a similar pattern via the OpenLLMetry instrumentation package. Traceloop, the maintainer of `opentelemetry-instrumentation-langchain`, was acquired by ServiceNow in March 2026; the library continues under Apache 2.0.

```bash
pip install opentelemetry-instrumentation-langchain opentelemetry-exporter-otlp
```

```python
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.langchain import LangChainInstrumentor

def setup_tracing():
    exporter = OTLPSpanExporter(
    endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"),
)
    provider = TracerProvider()
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    LangChainInstrumentor().instrument()

setup_tracing()

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.load_local("./faiss_index", embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

result = qa_chain.invoke({"query": "Explain vector search latency."})
```

The `LangChainInstrumentor` wraps the `RetrievalQA` chain and produces a span tree that shows:

- The root chain invocation with overall latency
- A `retriever` child span with the number of documents retrieved
- An `llm` child span with `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, and `gen_ai.response.finish_reasons`

The gap between auto-instrumentation and complete observability is `rag.retrieval.empty_result` and `rag.context.truncated` — the LangChain instrumentation does not currently set these. Add them with manual spans, shown in the next section.

## Manual Instrumentation for Custom RAG Pipelines

Auto-instrumentation covers the standard LlamaIndex and LangChain call paths. For custom retrievers, custom rerankers, or any logic outside those frameworks, we instrument manually:

```python
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

exporter = OTLPSpanExporter(
    endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"),
)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("rag.pipeline")


def embed_query(query: str) -> list[float]:
    # Replace with your actual embedding call
    import time
    start = time.monotonic()
    vector = your_embedding_model.encode(query)
    duration_ms = (time.monotonic() - start) * 1000

    span = trace.get_current_span()
    span.set_attribute("rag.query.text", query)
    span.set_attribute("rag.embedding.model", "text-embedding-3-small")
    span.set_attribute("rag.embedding.duration_ms", round(duration_ms))
    return vector


def retrieve(vector: list[float], top_k: int = 5) -> list[dict]:
    results = your_vector_db.search(vector, top_k=top_k)
    span = trace.get_current_span()
    span.set_attribute("rag.retrieval.top_k", top_k)
    span.set_attribute("rag.retrieval.results_count", len(results))
    span.set_attribute("rag.retrieval.empty_result", len(results) == 0)
    return results


def assemble_context(chunks: list[dict], max_tokens: int = 3000) -> tuple[str, bool]:
    text = "\n\n".join(c["text"] for c in chunks)
    token_count = len(text.split())  # simplified; use a real tokenizer
    truncated = token_count > max_tokens
    if truncated:
        text = " ".join(text.split()[:max_tokens])

    span = trace.get_current_span()
    span.set_attribute("rag.context.token_count", token_count)
    span.set_attribute("rag.context.truncated", truncated)
    return text, truncated


def run_rag_pipeline(query: str) -> str:
    with tracer.start_as_current_span("rag.query") as root_span:
        root_span.set_attribute("rag.query.text", query)

        with tracer.start_as_current_span("rag.embed"):
            vector = embed_query(query)

        with tracer.start_as_current_span("rag.retrieve"):
            chunks = retrieve(vector)

        if not chunks:
            root_span.set_attribute("rag.retrieval.empty_result", True)
            return "No relevant documents found."

        with tracer.start_as_current_span("rag.assemble"):
            context, truncated = assemble_context(chunks)

        with tracer.start_as_current_span("rag.generate"):
            response = your_llm.generate(context=context, query=query)
            span = trace.get_current_span()
            span.set_attribute("gen_ai.request.model", "gpt-4o")
            span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
            span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)
            span.set_attribute("gen_ai.response.finish_reasons", [response.choices[0].finish_reason])

        return response.choices[0].message.content
```

The key instrumentation points are in `retrieve()` — setting `rag.retrieval.empty_result` — and in `assemble_context()` — setting `rag.context.truncated`. These two booleans power the most important production alerts.

## Sending RAG Traces to Uptrace

Uptrace, an [OpenTelemetry-native APM](/opentelemetry/apm), accepts traces via OTLP and indexes all span attributes as queryable fields. The exporter configuration is the same across all three approaches:

```python
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

exporter = OTLPSpanExporter(
    endpoint="https://api.uptrace.dev:4317",
    headers={"uptrace-dsn": "https://<secret>@api.uptrace.dev?grpc=4317"},
)
```

After traces arrive, [distributed tracing in Uptrace](/product/tracing) lets you:

- **Filter by rag.retrieval.empty_result = true** to find all requests where retrieval returned nothing
- **Group by gen_ai.request.model** to compare token usage and latency across model versions
- **Filter by gen_ai.response.finish_reasons containing "length"** to find truncated responses
- **Sort by span duration** to identify which pipeline stage is the latency bottleneck

The `rag.*` attributes are stored as custom attributes in Uptrace and are immediately searchable without any schema configuration.

## What to Alert On

Four alert conditions cover the most common RAG failure modes. Tie them to [SLA/SLO monitoring requirements](/blog/sla-slo-monitoring-requirements) to define acceptable thresholds per environment.

<table>
<thead>
  <tr>
    <th>
      Condition
    </th>
    
    <th>
      Suggested threshold
    </th>
    
    <th>
      What it means
    </th>
  </tr>
</thead>

<tbody>
  <tr>
    <td>
      <code>
        rag.retrieval.empty_result
      </code>
      
       rate
    </td>
    
    <td>
      > 5% of requests
    </td>
    
    <td>
      Index staleness, query distribution shift
    </td>
  </tr>
  
  <tr>
    <td>
      <code>
        gen_ai.response.finish_reasons
      </code>
      
       contains <code>
        "length"
      </code>
      
       rate
    </td>
    
    <td>
      > 2% of requests
    </td>
    
    <td>
      Context too large, <code>
        max_tokens
      </code>
      
       too low
    </td>
  </tr>
  
  <tr>
    <td>
      p95 vector search latency
    </td>
    
    <td>
      > 500ms
    </td>
    
    <td>
      Vector DB performance degradation
    </td>
  </tr>
  
  <tr>
    <td>
      p95 end-to-end RAG latency
    </td>
    
    <td>
      > 3s
    </td>
    
    <td>
      Compound latency across all stages
    </td>
  </tr>
</tbody>
</table>

Empty retrieval above 5% typically indicates the vector index is stale or the embedding model was swapped without re-indexing. A `finish_reasons` containing `"length"` rate above 2% means users are consistently receiving incomplete answers — investigate `rag.context.truncated` in the same traces to confirm.

## FAQ

**What is RAG observability?**

RAG observability is the practice of tracing and measuring every stage of a Retrieval-Augmented Generation pipeline — embedding, vector search, reranking, context assembly, and LLM generation. The goal is to make failures visible in production, because most RAG failure modes (empty retrieval, context truncation, incomplete generation) produce no exception and no error log by default.

**How do I trace a RAG pipeline with OpenTelemetry?**

Use `LlamaIndexInstrumentor().instrument()` or `LangChainInstrumentor().instrument()` for framework-based pipelines. For custom pipelines, use `tracer.start_as_current_span()` to create a span per stage and call `span.set_attribute()` to record attributes like `rag.retrieval.results_count` and `rag.context.truncated`. Export spans via OTLP to any compatible backend.

**Does LlamaIndex support OpenTelemetry?**

Yes. Install `openinference-instrumentation-llama-index` (recommended, maintained by Arize/OpenInference) or `opentelemetry-instrumentation-llamaindex` (from OpenLLMetry/ServiceNow) and call `LlamaIndexInstrumentor().instrument()` once at startup. Both automatically instrument LLM calls, embedding calls, and vector store retrievers using the OpenInference semantic conventions. Spans are exported via the standard OTLP exporter.

**Does LangChain support OpenTelemetry tracing?**

Yes. Install `opentelemetry-instrumentation-langchain` and call `LangChainInstrumentor().instrument()`. The instrumentation wraps LangChain chains and produces spans for each component. It captures `gen_ai.*` attributes on LLM calls and retriever metadata, though some RAG-specific attributes like `rag.retrieval.empty_result` require manual instrumentation.

**What attributes should I add to RAG spans?**

The most operationally useful attributes are: `rag.retrieval.empty_result` (boolean), `rag.retrieval.results_count` (integer), `rag.context.truncated` (boolean), `rag.context.token_count` (integer), and `gen_ai.response.finish_reasons` (string array). These attributes are sufficient to detect and diagnose the most common silent failures in a RAG pipeline.

**How is RAG observability different from standard LLM observability?**

[LLM observability](/glossary/llm-observability) focuses on a single model interaction: token usage, latency, cost, and finish reason. RAG observability adds the retrieval layer — you need to trace what was retrieved, how much, whether the retrieval was empty, and whether the context was truncated before it reached the model. An LLM call with zero retrieved context looks identical to one with ten chunks unless you instrument the retrieval stages explicitly.

## Conclusion

RAG pipelines fail in ways that standard observability tools cannot detect: empty vector search results, silent context truncation, and `finish_reasons` containing `"length"` on the LLM response. The fix is span-level instrumentation at each stage with attributes that expose these states directly.

For LlamaIndex and LangChain, a single instrumentation call handles the most common spans automatically — note that auto-instrumentation libraries emit their own attribute schemas (OpenInference for LlamaIndex, OpenLLMetry for LangChain), which differ from the custom `rag.*` attributes used in manual instrumentation. For custom pipelines, manual spans with `rag.retrieval.empty_result` and `rag.context.truncated` cover the two most critical failure modes. Export spans to Uptrace or any OTLP-compatible backend and filter on these attributes to find failures immediately rather than through user complaints.

The OTel GenAI SIG is working toward official RAG semantic conventions. Until that specification stabilizes, the custom `rag.*` attributes used in manual instrumentation and the OpenInference attributes emitted by auto-instrumentation are both supported by Uptrace's attribute indexing.
Attribute	Example	Purpose
`rag.query.text`	`"What is chunking?"`	The raw user query
`rag.embedding.model`	`text-embedding-3-small`	Model used for embedding
`rag.embedding.duration_ms`	`42`	Latency for the embed call
Attribute	Example	Purpose
`rag.retrieval.query`	`"What is chunking?"`	Query sent to the vector DB
`rag.retrieval.top_k`	`5`	Number of results requested
`rag.retrieval.results_count`	`0`	Number of results returned
`rag.retrieval.empty_result`	`true`	Boolean flag for empty retrieval
Attribute	Example	Purpose
`rag.reranking.model`	`cohere-rerank-v3`	Reranker model
`rag.reranking.scores`	`[0.91, 0.74, 0.52]`	Score list for retrieved chunks
Attribute	Example	Purpose
`rag.context.token_count`	`3840`	Total tokens in assembled context
`rag.context.truncated`	`true`	Whether context was cut
Attribute	Example	Purpose
`gen_ai.request.model`	`gpt-4o`	Requested model
`gen_ai.usage.input_tokens`	`4096`	Tokens in the prompt
`gen_ai.usage.output_tokens`	`128`	Tokens in the response
`gen_ai.response.finish_reasons`	`["length"]`	Why generation stopped (string array)
Condition	Suggested threshold	What it means
`rag.retrieval.empty_result` rate	> 5% of requests	Index staleness, query distribution shift
`gen_ai.response.finish_reasons` contains `"length"` rate	> 2% of requests	Context too large, `max_tokens` too low
p95 vector search latency	> 500ms	Vector DB performance degradation
p95 end-to-end RAG latency	> 3s	Compound latency across all stages