OpenTelemetry for AI Systems: LLM and Agent Observability (2026)

Alexandr Bandurchin
April 02, 2026
8 min read

Quick Answer: OpenTelemetry instruments LLM applications by wrapping API calls in spans with standardized gen_ai.* attributes — model name, token counts, finish reason — defined by the OpenTelemetry GenAI semantic conventions. For agents, each tool call, LLM invocation, and retrieval step becomes a child span, producing a full trace of the reasoning chain. Auto-instrumentation packages exist for OpenAI, Anthropic, LangChain, and LlamaIndex.

What you getHow
Per-request token costgen_ai.usage.input_tokens + gen_ai.usage.output_tokens on each span
End-to-end agent latencyTrace waterfall across all tool calls and LLM calls
Error root causeSpan status + exception events with stack traces
Prompt and completion contentSpan events (not attributes — for privacy and size control)
Cross-service contextW3C traceparent propagated through every HTTP call

Why AI Systems Need Specialized Observability

Traditional APM tracks deterministic code paths where the same input reliably produces the same output. LLM-based systems break this assumption in several ways:

  • Non-determinism: The same prompt produces different outputs. You can't reproduce an issue without capturing the exact input, model parameters, and temperature at the time of the call.
  • Token-based cost model: Latency and cost correlate with token counts, not request count. A single "slow" request may consume 10× more budget than normal. Request rate monitoring is blind to this.
  • Multi-step agent chains: An agent handling one user request may make 5 LLM calls, 3 tool calls, and 2 vector DB lookups — each a potential failure point. A single error metric can't tell you which step failed.
  • Prompt sensitivity: Prompts often contain PII, confidential business data, or medical information. Sending them to observability backends without sanitization creates compliance risk.
SignalTraditional appLLM / AI agent
Latency driverCPU, I/O, networkToken count, model size, context window
Cost unitRequests/secondTokens consumed
Failure modeException, timeoutHallucination, context overflow, tool error
Debug artifactStack tracePrompt + completion + reasoning chain
Cardinality riskURL, user ID in span namesFull prompt text in attributes

OpenTelemetry GenAI Semantic Conventions

The OpenTelemetry GenAI Special Interest Group defines standard gen_ai.* attribute names for LLM instrumentation. Using them ensures your telemetry works with any OTel-compatible backend without custom parsing rules.

Span Attributes

AttributeTypeDescriptionExample
gen_ai.systemstringLLM provideropenai, anthropic, google_vertex_ai
gen_ai.operation.namestringOperation typechat, text_completion, embeddings
gen_ai.request.modelstringRequested modelgpt-4o, claude-3-7-sonnet
gen_ai.response.modelstringModel actually used (may differ)gpt-4o-2024-11-20
gen_ai.request.temperaturefloatSampling temperature0.7
gen_ai.request.max_tokensintMax tokens requested1024
gen_ai.usage.input_tokensintPrompt tokens consumed512
gen_ai.usage.output_tokensintCompletion tokens generated128
gen_ai.response.finish_reasonsstringWhy generation stopped["stop"], ["length"]

Span Events for Prompt and Completion Content

Storing full prompt text in span attributes is an anti-pattern: attributes are always indexed, have size limits, and expose PII in your backend. The GenAI conventions store content in span events instead — events can be filtered or dropped at the Collector level without touching your application code.

Event nameWhen emittedKey event attributes
gen_ai.system.messageBefore LLM callgen_ai.prompt.role, gen_ai.prompt.content
gen_ai.user.messageBefore LLM callgen_ai.prompt.role, gen_ai.prompt.content
gen_ai.assistant.messageAfter LLM callgen_ai.completion.role, gen_ai.completion.content
gen_ai.tool.messageAfter tool resultgen_ai.completion.role, gen_ai.completion.content

Auto-Instrumentation

The fastest path to LLM observability. These packages automatically create spans for every API call, attach gen_ai.* attributes, and capture token usage without any changes to business logic.

bash
# OpenAI and Anthropic
pip install opentelemetry-instrumentation-openai
pip install opentelemetry-instrumentation-anthropic

# LangChain and LlamaIndex
pip install opentelemetry-instrumentation-langchain
pip install opentelemetry-instrumentation-llamaindex
python
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.instrumentation.anthropic import AnthropicInstrumentor
from opentelemetry.instrumentation.langchain import LangChainInstrumentor

# Call once at startup, before creating any clients
OpenAIInstrumentor().instrument()
AnthropicInstrumentor().instrument()
LangChainInstrumentor().instrument()

# All subsequent API calls are traced automatically
import openai
client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this document..."}],
)
# → span: gen_ai.usage.input_tokens=245, gen_ai.usage.output_tokens=89, duration=1.2s

By default, prompt and completion content is not captured. To enable it (only in environments where you control data residency):

python
OpenAIInstrumentor().instrument(capture_message_content=True)

Manual Instrumentation

Use manual instrumentation when you need spans that auto-instrumentation can't create: custom retrieval steps, evaluation scores, business logic, or multi-model orchestration.

LLM Call

python
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer("ai-service")

def call_llm(prompt: str, model: str = "gpt-4o") -> str:
    with tracer.start_as_current_span("gen_ai.chat") as span:
        span.set_attributes({
            "gen_ai.system": "openai",
            "gen_ai.operation.name": "chat",
            "gen_ai.request.model": model,
            "gen_ai.request.temperature": 0.7,
        })

        # Record prompt as event (not attribute) to control size and privacy
        span.add_event("gen_ai.user.message", {
            "gen_ai.prompt.role": "user",
            "gen_ai.prompt.content": prompt[:500],  # cap length
        })

        try:
            response = openai_client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
            )
            completion = response.choices[0].message.content

            span.set_attributes({
                "gen_ai.response.model": response.model,
                "gen_ai.usage.input_tokens": response.usage.prompt_tokens,
                "gen_ai.usage.output_tokens": response.usage.completion_tokens,
                "gen_ai.response.finish_reasons": [response.choices[0].finish_reason],
            })

            span.add_event("gen_ai.assistant.message", {
                "gen_ai.completion.role": "assistant",
                "gen_ai.completion.content": completion[:500],
            })

            return completion

        except Exception as e:
            span.record_exception(e)
            span.set_status(Status(StatusCode.ERROR, str(e)))
            raise
python
def retrieve_context(query: str, top_k: int = 5) -> list[str]:
    with tracer.start_as_current_span("gen_ai.embeddings") as embed_span:
        embed_span.set_attributes({
            "gen_ai.system": "openai",
            "gen_ai.operation.name": "embeddings",
            "gen_ai.request.model": "text-embedding-3-small",
        })
        embedding = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=query,
        )
        embed_span.set_attribute(
            "gen_ai.usage.input_tokens",
            embedding.usage.prompt_tokens,
        )
        vector = embedding.data[0].embedding

    with tracer.start_as_current_span("db.vector_search") as search_span:
        search_span.set_attributes({
            "db.system": "chromadb",
            "db.operation.name": "query",
            "db.vector.top_k": top_k,
        })
        results = vector_db.query(query_embeddings=[vector], n_results=top_k)
        search_span.set_attribute(
            "db.vector.results_count",
            len(results["documents"][0]),
        )
        return results["documents"][0]

Tracing AI Agents

Each step in an agent's execution becomes a child span. The resulting trace waterfall shows exactly where time and tokens are spent across the full reasoning chain.

python
def run_agent(task: str) -> str:
    with tracer.start_as_current_span("agent.run") as span:
        span.set_attributes({
            "gen_ai.system": "openai",
            "agent.name": "research-agent",
            "agent.task": task[:200],
        })

        context_docs = retrieve_context(task)  # creates its own child spans

        messages = [
            {"role": "system", "content": "You are a research assistant."},
            {"role": "user", "content": task},
        ]

        tool_calls_count = 0

        while True:
            response = openai_client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                tools=TOOLS,
            )
            choice = response.choices[0]

            if choice.finish_reason == "tool_calls":
                for tool_call in choice.message.tool_calls:
                    with tracer.start_as_current_span("agent.tool_call") as tool_span:
                        tool_span.set_attributes({
                            "agent.tool.name": tool_call.function.name,
                            "agent.tool.call_id": tool_call.id,
                        })
                        result = execute_tool(tool_call)
                        tool_span.set_attribute(
                            "agent.tool.result_length",
                            len(str(result)),
                        )
                    messages.append({
                        "role": "tool",
                        "content": str(result),
                        "tool_call_id": tool_call.id,
                    })
                    tool_calls_count += 1
            else:
                span.set_attribute("agent.tool_calls_total", tool_calls_count)
                return choice.message.content

The resulting trace:

text
agent.run (1850ms)
├── gen_ai.embeddings (38ms)             ← text-embedding-3-small
├── db.vector_search (45ms)              ← ChromaDB query
├── gen_ai.chat (920ms)                  ← gpt-4o, decides to use tools
│   ├── agent.tool_call: web_search (310ms)
│   └── agent.tool_call: calculator (8ms)
└── gen_ai.chat (530ms)                  ← gpt-4o, final answer

Metrics for AI Workloads

Traces show individual requests. Metrics show aggregate trends — token burn rate, p99 latency, error rate — that drive cost and reliability decisions.

The GenAI semantic conventions define standard metric names:

Metric nameTypeDescription
gen_ai.client.token.usageCounterTokens used, by model and token type
gen_ai.client.operation.durationHistogramLLM call duration in seconds
gen_ai.server.request.durationHistogramServer-side request duration
gen_ai.server.time_to_first_tokenHistogramStreaming: time to first chunk
python
from opentelemetry import metrics

meter = metrics.get_meter("ai-service")

token_counter = meter.create_counter(
    "gen_ai.client.token.usage",
    unit="token",
    description="Token usage by model and token type",
)

duration_histogram = meter.create_histogram(
    "gen_ai.client.operation.duration",
    unit="s",
    description="LLM call duration",
)

def tracked_llm_call(prompt: str, model: str) -> str:
    import time
    attrs = {"gen_ai.system": "openai", "gen_ai.request.model": model}
    start = time.time()

    try:
        response = openai_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
        )
        token_counter.add(
            response.usage.prompt_tokens,
            {**attrs, "gen_ai.token.type": "input"},
        )
        token_counter.add(
            response.usage.completion_tokens,
            {**attrs, "gen_ai.token.type": "output"},
        )
        return response.choices[0].message.content
    finally:
        duration_histogram.record(time.time() - start, attrs)

Key metrics to alert on:

MetricAlert conditionWhy
gen_ai.client.token.usage rate> 2× baseline over 10 minRunaway loop or prompt injection
gen_ai.client.operation.duration p99> 30sModel overloaded or context too large
Error rate (spans with error status)> 2% over 5 minRate limiting or quota exhaustion
Input/output token ratio> 10:1 consistentlySystem prompt too large, optimize it

Privacy and Data Security

Prompts routinely contain names, emails, financial data, and medical information. Before sending telemetry to any backend:

Option 1: Disable content capture (default)

Auto-instrumentation packages don't capture prompt or completion content by default. For manual instrumentation, truncate and sanitize before adding span events.

Option 2: Redact at the Collector

Drop or hash prompt content in the pipeline — no application changes required:

yaml
processors:
  transform:
    trace_statements:
      - context: spanevent
        statements:
          # Remove prompt and completion content from all events
          - delete_matching_keys(attributes, "gen_ai.prompt.content")
          - delete_matching_keys(attributes, "gen_ai.completion.content")

  # Hash for correlation without exposure
  attributes:
    actions:
      - key: gen_ai.prompt.content
        action: hash
        hash_salt: ${env:HASH_SALT}

Option 3: Tail-based filtering

Use the Collector's tail sampling processor to drop entire traces from specific users or sessions before export.

Sampling Strategies for LLM Traffic

LLM calls are slow (100ms–30s) and produce large spans. Apply different rates to different scenarios:

ScenarioStrategyRate
DevelopmentAlwaysOn100%
Production, successful callsTraceIdRatioBased5–10%
Production, errorsTail-based100%
High-token requests (> 2K tokens)Tail-based attribute filter100%
Agent runsTail-based100% (rare and high-value)
python
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased

# Sample 10% of root spans; child spans inherit the decision
sampler = ParentBased(root=TraceIdRatioBased(0.10))

provider = TracerProvider(sampler=sampler, resource=resource)

For tail-based sampling — capturing all errors while sampling successful calls — configure the OpenTelemetry Collector tail sampling processor.

Visualizing AI Traces with Uptrace

Once your LLM app or agent is instrumented, you need a backend to store and query the data. Uptrace is an open source OpenTelemetry APM built on ClickHouse that understands gen_ai.* semantic conventions out of the box — no custom parsing or dashboards setup required.

Uptrace automatically groups spans by gen_ai.request.model and gen_ai.system, so you can see token usage, latency, and error rates broken down by model and provider without any additional configuration. You can try Uptrace without an account, or self-host it for free.

Best Practices

PracticeDoDon't
Prompt contentStore as span events with a size capStore as span attributes (always indexed, no size limit)
Span naminggen_ai.chat, gen_ai.embeddingscall_gpt4, llm_request_user_123
Token trackingSeparate input/output countersSingle "total" counter (hides cost structure)
Model attributegen_ai.request.model (standard)Custom llm_model attribute
Error handlingspan.record_exception(e) + StatusCode.ERRORSwallow exceptions, set no status
Agent stepsOne span per tool callOne span for the entire agent run
SamplingTail-based, 100% errorsHead-based only (misses important errors)

FAQ

Do I need to instrument every LLM call manually?
No. Use auto-instrumentation packages for OpenAI, Anthropic, LangChain, or LlamaIndex. Manual instrumentation is for custom logic, retrieval pipelines, evaluation scores, or frameworks without an existing instrumentor.

Should I store prompts in span attributes?
No — store them as span events. Events can be filtered or dropped at the Collector level without touching application code. Attributes are always indexed and always exported. Cap content at 500–1000 characters to avoid large payloads.

How much overhead does OpenTelemetry add to LLM calls?
Under 1ms per call. LLM API latency (100ms–30s) dominates entirely. The main cost is telemetry storage volume, which sampling controls.

Can I trace streaming responses?
Yes. Start the span before the stream, accumulate token counts from chunks, record the completion event after the final chunk, and end the span. Most providers expose total token counts in the final chunk or a separate usage API.

How do I correlate a trace to the user session?
Add user.id and session.id as span attributes on the root span. They propagate to all child spans automatically through the active context.

You may also be interested in: