OpenTelemetry for AI Systems: LLM and Agent Observability (2026)
Quick Answer: OpenTelemetry instruments LLM applications by wrapping API calls in spans with standardized gen_ai.* attributes — model name, token counts, finish reason — defined by the OpenTelemetry GenAI semantic conventions. For agents, each tool call, LLM invocation, and retrieval step becomes a child span, producing a full trace of the reasoning chain. Auto-instrumentation packages exist for OpenAI, Anthropic, LangChain, and LlamaIndex.
| What you get | How |
|---|---|
| Per-request token cost | gen_ai.usage.input_tokens + gen_ai.usage.output_tokens on each span |
| End-to-end agent latency | Trace waterfall across all tool calls and LLM calls |
| Error root cause | Span status + exception events with stack traces |
| Prompt and completion content | Span events (not attributes — for privacy and size control) |
| Cross-service context | W3C traceparent propagated through every HTTP call |
Why AI Systems Need Specialized Observability
Traditional APM tracks deterministic code paths where the same input reliably produces the same output. LLM-based systems break this assumption in several ways:
- Non-determinism: The same prompt produces different outputs. You can't reproduce an issue without capturing the exact input, model parameters, and temperature at the time of the call.
- Token-based cost model: Latency and cost correlate with token counts, not request count. A single "slow" request may consume 10× more budget than normal. Request rate monitoring is blind to this.
- Multi-step agent chains: An agent handling one user request may make 5 LLM calls, 3 tool calls, and 2 vector DB lookups — each a potential failure point. A single error metric can't tell you which step failed.
- Prompt sensitivity: Prompts often contain PII, confidential business data, or medical information. Sending them to observability backends without sanitization creates compliance risk.
| Signal | Traditional app | LLM / AI agent |
|---|---|---|
| Latency driver | CPU, I/O, network | Token count, model size, context window |
| Cost unit | Requests/second | Tokens consumed |
| Failure mode | Exception, timeout | Hallucination, context overflow, tool error |
| Debug artifact | Stack trace | Prompt + completion + reasoning chain |
| Cardinality risk | URL, user ID in span names | Full prompt text in attributes |
OpenTelemetry GenAI Semantic Conventions
The OpenTelemetry GenAI Special Interest Group defines standard gen_ai.* attribute names for LLM instrumentation. Using them ensures your telemetry works with any OTel-compatible backend without custom parsing rules.
Span Attributes
| Attribute | Type | Description | Example |
|---|---|---|---|
gen_ai.system | string | LLM provider | openai, anthropic, google_vertex_ai |
gen_ai.operation.name | string | Operation type | chat, text_completion, embeddings |
gen_ai.request.model | string | Requested model | gpt-4o, claude-3-7-sonnet |
gen_ai.response.model | string | Model actually used (may differ) | gpt-4o-2024-11-20 |
gen_ai.request.temperature | float | Sampling temperature | 0.7 |
gen_ai.request.max_tokens | int | Max tokens requested | 1024 |
gen_ai.usage.input_tokens | int | Prompt tokens consumed | 512 |
gen_ai.usage.output_tokens | int | Completion tokens generated | 128 |
gen_ai.response.finish_reasons | string | Why generation stopped | ["stop"], ["length"] |
Span Events for Prompt and Completion Content
Storing full prompt text in span attributes is an anti-pattern: attributes are always indexed, have size limits, and expose PII in your backend. The GenAI conventions store content in span events instead — events can be filtered or dropped at the Collector level without touching your application code.
| Event name | When emitted | Key event attributes |
|---|---|---|
gen_ai.system.message | Before LLM call | gen_ai.prompt.role, gen_ai.prompt.content |
gen_ai.user.message | Before LLM call | gen_ai.prompt.role, gen_ai.prompt.content |
gen_ai.assistant.message | After LLM call | gen_ai.completion.role, gen_ai.completion.content |
gen_ai.tool.message | After tool result | gen_ai.completion.role, gen_ai.completion.content |
Auto-Instrumentation
The fastest path to LLM observability. These packages automatically create spans for every API call, attach gen_ai.* attributes, and capture token usage without any changes to business logic.
# OpenAI and Anthropic
pip install opentelemetry-instrumentation-openai
pip install opentelemetry-instrumentation-anthropic
# LangChain and LlamaIndex
pip install opentelemetry-instrumentation-langchain
pip install opentelemetry-instrumentation-llamaindex
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.instrumentation.anthropic import AnthropicInstrumentor
from opentelemetry.instrumentation.langchain import LangChainInstrumentor
# Call once at startup, before creating any clients
OpenAIInstrumentor().instrument()
AnthropicInstrumentor().instrument()
LangChainInstrumentor().instrument()
# All subsequent API calls are traced automatically
import openai
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarize this document..."}],
)
# → span: gen_ai.usage.input_tokens=245, gen_ai.usage.output_tokens=89, duration=1.2s
By default, prompt and completion content is not captured. To enable it (only in environments where you control data residency):
OpenAIInstrumentor().instrument(capture_message_content=True)
Manual Instrumentation
Use manual instrumentation when you need spans that auto-instrumentation can't create: custom retrieval steps, evaluation scores, business logic, or multi-model orchestration.
LLM Call
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer("ai-service")
def call_llm(prompt: str, model: str = "gpt-4o") -> str:
with tracer.start_as_current_span("gen_ai.chat") as span:
span.set_attributes({
"gen_ai.system": "openai",
"gen_ai.operation.name": "chat",
"gen_ai.request.model": model,
"gen_ai.request.temperature": 0.7,
})
# Record prompt as event (not attribute) to control size and privacy
span.add_event("gen_ai.user.message", {
"gen_ai.prompt.role": "user",
"gen_ai.prompt.content": prompt[:500], # cap length
})
try:
response = openai_client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
completion = response.choices[0].message.content
span.set_attributes({
"gen_ai.response.model": response.model,
"gen_ai.usage.input_tokens": response.usage.prompt_tokens,
"gen_ai.usage.output_tokens": response.usage.completion_tokens,
"gen_ai.response.finish_reasons": [response.choices[0].finish_reason],
})
span.add_event("gen_ai.assistant.message", {
"gen_ai.completion.role": "assistant",
"gen_ai.completion.content": completion[:500],
})
return completion
except Exception as e:
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, str(e)))
raise
Embeddings and Vector Search
def retrieve_context(query: str, top_k: int = 5) -> list[str]:
with tracer.start_as_current_span("gen_ai.embeddings") as embed_span:
embed_span.set_attributes({
"gen_ai.system": "openai",
"gen_ai.operation.name": "embeddings",
"gen_ai.request.model": "text-embedding-3-small",
})
embedding = openai_client.embeddings.create(
model="text-embedding-3-small",
input=query,
)
embed_span.set_attribute(
"gen_ai.usage.input_tokens",
embedding.usage.prompt_tokens,
)
vector = embedding.data[0].embedding
with tracer.start_as_current_span("db.vector_search") as search_span:
search_span.set_attributes({
"db.system": "chromadb",
"db.operation.name": "query",
"db.vector.top_k": top_k,
})
results = vector_db.query(query_embeddings=[vector], n_results=top_k)
search_span.set_attribute(
"db.vector.results_count",
len(results["documents"][0]),
)
return results["documents"][0]
Tracing AI Agents
Each step in an agent's execution becomes a child span. The resulting trace waterfall shows exactly where time and tokens are spent across the full reasoning chain.
def run_agent(task: str) -> str:
with tracer.start_as_current_span("agent.run") as span:
span.set_attributes({
"gen_ai.system": "openai",
"agent.name": "research-agent",
"agent.task": task[:200],
})
context_docs = retrieve_context(task) # creates its own child spans
messages = [
{"role": "system", "content": "You are a research assistant."},
{"role": "user", "content": task},
]
tool_calls_count = 0
while True:
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=TOOLS,
)
choice = response.choices[0]
if choice.finish_reason == "tool_calls":
for tool_call in choice.message.tool_calls:
with tracer.start_as_current_span("agent.tool_call") as tool_span:
tool_span.set_attributes({
"agent.tool.name": tool_call.function.name,
"agent.tool.call_id": tool_call.id,
})
result = execute_tool(tool_call)
tool_span.set_attribute(
"agent.tool.result_length",
len(str(result)),
)
messages.append({
"role": "tool",
"content": str(result),
"tool_call_id": tool_call.id,
})
tool_calls_count += 1
else:
span.set_attribute("agent.tool_calls_total", tool_calls_count)
return choice.message.content
The resulting trace:
agent.run (1850ms)
├── gen_ai.embeddings (38ms) ← text-embedding-3-small
├── db.vector_search (45ms) ← ChromaDB query
├── gen_ai.chat (920ms) ← gpt-4o, decides to use tools
│ ├── agent.tool_call: web_search (310ms)
│ └── agent.tool_call: calculator (8ms)
└── gen_ai.chat (530ms) ← gpt-4o, final answer
Metrics for AI Workloads
Traces show individual requests. Metrics show aggregate trends — token burn rate, p99 latency, error rate — that drive cost and reliability decisions.
The GenAI semantic conventions define standard metric names:
| Metric name | Type | Description |
|---|---|---|
gen_ai.client.token.usage | Counter | Tokens used, by model and token type |
gen_ai.client.operation.duration | Histogram | LLM call duration in seconds |
gen_ai.server.request.duration | Histogram | Server-side request duration |
gen_ai.server.time_to_first_token | Histogram | Streaming: time to first chunk |
from opentelemetry import metrics
meter = metrics.get_meter("ai-service")
token_counter = meter.create_counter(
"gen_ai.client.token.usage",
unit="token",
description="Token usage by model and token type",
)
duration_histogram = meter.create_histogram(
"gen_ai.client.operation.duration",
unit="s",
description="LLM call duration",
)
def tracked_llm_call(prompt: str, model: str) -> str:
import time
attrs = {"gen_ai.system": "openai", "gen_ai.request.model": model}
start = time.time()
try:
response = openai_client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
token_counter.add(
response.usage.prompt_tokens,
{**attrs, "gen_ai.token.type": "input"},
)
token_counter.add(
response.usage.completion_tokens,
{**attrs, "gen_ai.token.type": "output"},
)
return response.choices[0].message.content
finally:
duration_histogram.record(time.time() - start, attrs)
Key metrics to alert on:
| Metric | Alert condition | Why |
|---|---|---|
gen_ai.client.token.usage rate | > 2× baseline over 10 min | Runaway loop or prompt injection |
gen_ai.client.operation.duration p99 | > 30s | Model overloaded or context too large |
Error rate (spans with error status) | > 2% over 5 min | Rate limiting or quota exhaustion |
| Input/output token ratio | > 10:1 consistently | System prompt too large, optimize it |
Privacy and Data Security
Prompts routinely contain names, emails, financial data, and medical information. Before sending telemetry to any backend:
Option 1: Disable content capture (default)
Auto-instrumentation packages don't capture prompt or completion content by default. For manual instrumentation, truncate and sanitize before adding span events.
Option 2: Redact at the Collector
Drop or hash prompt content in the pipeline — no application changes required:
processors:
transform:
trace_statements:
- context: spanevent
statements:
# Remove prompt and completion content from all events
- delete_matching_keys(attributes, "gen_ai.prompt.content")
- delete_matching_keys(attributes, "gen_ai.completion.content")
# Hash for correlation without exposure
attributes:
actions:
- key: gen_ai.prompt.content
action: hash
hash_salt: ${env:HASH_SALT}
Option 3: Tail-based filtering
Use the Collector's tail sampling processor to drop entire traces from specific users or sessions before export.
Sampling Strategies for LLM Traffic
LLM calls are slow (100ms–30s) and produce large spans. Apply different rates to different scenarios:
| Scenario | Strategy | Rate |
|---|---|---|
| Development | AlwaysOn | 100% |
| Production, successful calls | TraceIdRatioBased | 5–10% |
| Production, errors | Tail-based | 100% |
| High-token requests (> 2K tokens) | Tail-based attribute filter | 100% |
| Agent runs | Tail-based | 100% (rare and high-value) |
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
# Sample 10% of root spans; child spans inherit the decision
sampler = ParentBased(root=TraceIdRatioBased(0.10))
provider = TracerProvider(sampler=sampler, resource=resource)
For tail-based sampling — capturing all errors while sampling successful calls — configure the OpenTelemetry Collector tail sampling processor.
Visualizing AI Traces with Uptrace
Once your LLM app or agent is instrumented, you need a backend to store and query the data. Uptrace is an open source OpenTelemetry APM built on ClickHouse that understands gen_ai.* semantic conventions out of the box — no custom parsing or dashboards setup required.
Uptrace automatically groups spans by gen_ai.request.model and gen_ai.system, so you can see token usage, latency, and error rates broken down by model and provider without any additional configuration. You can try Uptrace without an account, or self-host it for free.
Best Practices
| Practice | Do | Don't |
|---|---|---|
| Prompt content | Store as span events with a size cap | Store as span attributes (always indexed, no size limit) |
| Span naming | gen_ai.chat, gen_ai.embeddings | call_gpt4, llm_request_user_123 |
| Token tracking | Separate input/output counters | Single "total" counter (hides cost structure) |
| Model attribute | gen_ai.request.model (standard) | Custom llm_model attribute |
| Error handling | span.record_exception(e) + StatusCode.ERROR | Swallow exceptions, set no status |
| Agent steps | One span per tool call | One span for the entire agent run |
| Sampling | Tail-based, 100% errors | Head-based only (misses important errors) |
FAQ
Do I need to instrument every LLM call manually?
No. Use auto-instrumentation packages for OpenAI, Anthropic, LangChain, or LlamaIndex. Manual instrumentation is for custom logic, retrieval pipelines, evaluation scores, or frameworks without an existing instrumentor.
Should I store prompts in span attributes?
No — store them as span events. Events can be filtered or dropped at the Collector level without touching application code. Attributes are always indexed and always exported. Cap content at 500–1000 characters to avoid large payloads.
How much overhead does OpenTelemetry add to LLM calls?
Under 1ms per call. LLM API latency (100ms–30s) dominates entirely. The main cost is telemetry storage volume, which sampling controls.
Can I trace streaming responses?
Yes. Start the span before the stream, accumulate token counts from chunks, record the completion event after the final chunk, and end the span. Most providers expose total token counts in the final chunk or a separate usage API.
How do I correlate a trace to the user session?
Add user.id and session.id as span attributes on the root span. They propagate to all child spans automatically through the active context.
You may also be interested in: