LLM Cost Monitoring with OpenTelemetry

Teams running LLM applications in production face a cost problem that traditional APM tools were never designed to solve. CPU and memory costs are relatively predictable — a web service processing 1,000 requests per second costs roughly the same week over week. LLM API costs are not. A single user session can cost $0.01 or $5 depending on prompt length, model choice, conversation history, and how many retries happen inside your chain. Without instrumentation, cost anomalies are invisible until the monthly invoice.

The standard pattern: a team launches a feature using GPT-5, everything looks fine in staging, and then production traffic reveals that a small percentage of requests trigger long multi-turn conversations that cost 50× more than the average. By the time the bill arrives, the cost has already happened.

OpenTelemetry's GenAI semantic conventions solve this at the instrumentation layer. The gen_ai.usage.input_tokens and gen_ai.usage.output_tokens attributes are captured automatically per API call, giving you token-level visibility that you can turn into dollar figures, per-request cost breakdowns, and budget alerts — using the same observability stack you already have.

Why Standard APM Misses LLM Costs

Traditional APM tracks latency, error rates, and throughput. These metrics are meaningful for LLM applications too, but they say nothing about financial cost. A request that takes 3 seconds and costs $0.002 looks identical in APM to one that takes 3 seconds and costs $0.40. Both have the same latency. Only token counts tell you the difference.

Three things make LLM costs hard to track without dedicated instrumentation:

Token consumption is buried inside SDK calls. Unless you manually read response.usage after every API call and record it somewhere, the data never appears in your traces or metrics. Most applications don't do this consistently.

Costs happen across chained calls. A LangChain agent might make 8 OpenAI calls to answer a single user question. The cost of the full interaction is the sum of all 8, but standard tracing only shows individual requests — not their aggregate cost under a parent operation.

Model prices vary widely and change. GPT-5.4 costs 12× more per input token than GPT-5.4-nano ($2.50 vs $0.20 per 1M tokens). Reasoning models like o3 and o4-mini bill internal "thinking" tokens that never appear in the response but still cost money. If your application conditionally uses different models, you need model-level attribution to understand your cost structure.

LLM Pricing Reference

Current pricing for the most common models (April 2026 — always verify against provider docs as prices change):

Model	Input (per 1M tokens)	Output (per 1M tokens)	Notes
gpt-5.4	$2.50	$15.00	OpenAI flagship (Mar 2026)
gpt-5	$1.25	$10.00	Good balance of cost and capability
gpt-5.4-mini	$0.75	$4.50	Mid-tier, good for most tasks
gpt-5.4-nano	$0.20	$1.25	Lowest cost in GPT-5.4 family
o3	$2.00	$8.00	Reasoning model — see note below
o4-mini	$1.10	$4.40	Compact reasoning model
claude-sonnet-4.6	$3.00	$15.00	Anthropic recommended
claude-haiku-4.5	$1.00	$5.00	Anthropic budget tier
gemini-2.5-pro	$1.25	$10.00	Contexts under 200K tokens

Reasoning models (o3, o4-mini) require special handling. These models use internal "reasoning tokens" during inference that are billed as output tokens but not returned in the response. gen_ai.usage.output_tokens includes these hidden tokens, so actual cost can be significantly higher than visible completion length suggests. Set conservative alert thresholds for o-series models and treat output token counts as an upper bound on reasoning effort.

Output tokens are consistently more expensive than input tokens — 4–8× for most models. Applications generating long completions (code, detailed explanations) have very different cost profiles from those producing short factual answers.

Capturing Token Usage with OpenTelemetry

The opentelemetry-instrumentation-openai-v2 package automatically records gen_ai.usage.input_tokens and gen_ai.usage.output_tokens on every span. No manual response parsing required:

python

from opentelemetry.instrumentation.openai_v2 import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter())
)
trace.set_tracer_provider(provider)

OpenAIInstrumentor().instrument()

# All subsequent OpenAI calls are automatically traced with token counts
from openai import OpenAI
client = OpenAI()

Each span now carries the token breakdown:

text

Span: gen_ai.operation.name = "chat"
  gen_ai.system              = "openai"
  gen_ai.request.model       = "gpt-5"
  gen_ai.usage.input_tokens  = 312
  gen_ai.usage.output_tokens = 87
  gen_ai.response.finish_reason = "stop"

For Anthropic, the equivalent package is opentelemetry-instrumentation-anthropic. Both emit the same gen_ai.* attributes, so your queries and dashboards work across providers. For full setup instructions and available options, see the OpenAI instrumentation guide.

Calculating Cost Per Request

With token counts on spans, cost calculation is straightforward. Add it as a custom span attribute so it's queryable alongside everything else:

python

from opentelemetry import trace

# Keep pricing in one place — update when providers change rates
MODEL_PRICING = {
    "gpt-5.4":              {"input": 2.50,  "output": 15.00},
    "gpt-5":                {"input": 1.25,  "output": 10.00},
    "gpt-5.4-mini":         {"input": 0.75,  "output": 4.50},
    "gpt-5.4-nano":         {"input": 0.20,  "output": 1.25},
    "o3":                   {"input": 2.00,  "output": 8.00},
    "o4-mini":              {"input": 1.10,  "output": 4.40},
    "claude-sonnet-4-6":    {"input": 3.00,  "output": 15.00},
    "claude-haiku-4-5":     {"input": 1.00,  "output": 5.00},
    "gemini-2.5-pro":       {"input": 1.25,  "output": 10.00},
}

def calculate_cost_usd(model: str, input_tokens: int, output_tokens: int) -> float:
    pricing = MODEL_PRICING.get(model, {"input": 0.0, "output": 0.0})
    return (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000

tracer = trace.get_tracer(__name__)

def chat_with_cost(prompt: str, model: str = "gpt-5") -> str:
    with tracer.start_as_current_span("llm.chat") as span:
        span.set_attribute("gen_ai.request.model", model)

        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )

        input_tokens = response.usage.prompt_tokens
        output_tokens = response.usage.completion_tokens
        cost = calculate_cost_usd(model, input_tokens, output_tokens)

        span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
        span.set_attribute("llm.cost.usd", cost)

        return response.choices[0].message.content

The llm.cost.usd attribute is now queryable in your observability backend: filter by model, sum over time ranges, group by service or user.

Tracking Total Cost per Agent Run

When a single user operation triggers multiple LLM calls — a LangChain agent, a multi-step chain, any orchestrated workflow — you want the total cost of the full interaction, not just individual calls. Wrap the operation in a parent span and aggregate:

python

def run_research_agent(question: str, user_id: str) -> str:
    with tracer.start_as_current_span("agent.run") as parent_span:
        parent_span.set_attribute("app.user_id", user_id)
        parent_span.set_attribute("app.operation", "research")

        total_cost = 0.0
        total_input_tokens = 0
        total_output_tokens = 0

        # Step 1: decompose the question (cheap model)
        with tracer.start_as_current_span("agent.decompose") as span:
            response = client.chat.completions.create(
                model="gpt-5.4-nano",
                messages=[{"role": "user", "content": f"Break this into sub-questions: {question}"}]
            )
            step_cost = calculate_cost_usd(
                "gpt-5.4-nano",
                response.usage.prompt_tokens,
                response.usage.completion_tokens
            )
            span.set_attribute("llm.cost.usd", step_cost)
            total_cost += step_cost
            total_input_tokens += response.usage.prompt_tokens
            total_output_tokens += response.usage.completion_tokens
            sub_questions = response.choices[0].message.content

        # Step 2: answer each sub-question (full model)
        with tracer.start_as_current_span("agent.answer") as span:
            response = client.chat.completions.create(
                model="gpt-5",
                messages=[{"role": "user", "content": sub_questions}]
            )
            step_cost = calculate_cost_usd(
                "gpt-5",
                response.usage.prompt_tokens,
                response.usage.completion_tokens
            )
            span.set_attribute("llm.cost.usd", step_cost)
            total_cost += step_cost
            total_input_tokens += response.usage.prompt_tokens
            total_output_tokens += response.usage.completion_tokens
            answer = response.choices[0].message.content

        # Record totals on the parent span
        parent_span.set_attribute("llm.cost.usd", total_cost)
        parent_span.set_attribute("llm.total_input_tokens", total_input_tokens)
        parent_span.set_attribute("llm.total_output_tokens", total_output_tokens)

        return answer

With this structure you can query both individual step costs and total operation cost from the same trace.

Recording Cost as an OpenTelemetry Metric

Spans are good for per-request cost. For aggregate spend over time — daily cost, cost by model, cost rate anomalies — OpenTelemetry metrics are the right tool. A counter accumulates continuously and can be queried for any time window:

python

from opentelemetry import metrics

meter = metrics.get_meter(__name__)

# Counter for total cost in USD
cost_counter = meter.create_counter(
    name="llm.cost.usd",
    description="Cumulative LLM API cost in USD",
    unit="USD",
)

# Histogram for per-request cost distribution
cost_histogram = meter.create_histogram(
    name="llm.cost.per_request.usd",
    description="Cost distribution per LLM request",
    unit="USD",
)

def tracked_completion(model: str, messages: list) -> str:
    response = client.chat.completions.create(model=model, messages=messages)

    cost = calculate_cost_usd(
        model,
        response.usage.prompt_tokens,
        response.usage.completion_tokens
    )

    labels = {"gen_ai.request.model": model, "service.name": "my-service"}
    cost_counter.add(cost, labels)
    cost_histogram.record(cost, labels)

    return response.choices[0].message.content

The counter gives you cumulative spend that you can diff over any window. The histogram shows your cost distribution — whether you have occasional expensive outlier requests or a uniformly expensive workload. For broader AI metrics patterns — GPU utilization, inference latency histograms, sampling strategies for high-volume workloads — see OpenTelemetry for AI Systems.

Cost Visibility in LangChain Applications

For LangChain chains and agents, LangChainInstrumentor captures spans for each chain step. Combine it with per-call cost attribution using the pattern above. For a deeper walkthrough of LangChain-specific monitoring patterns including silent failure detection, see the LangChain observability guide.

python

from opentelemetry.instrumentation.langchain import LangChainInstrumentor
from opentelemetry.instrumentation.openai_v2 import OpenAIInstrumentor

# Both instrumentors together: LangChain provides chain structure,
# OpenAI instrumentation provides token counts on each LLM call
LangChainInstrumentor().instrument()
OpenAIInstrumentor().instrument()

With both active, your trace shows the chain as the parent span and individual LLM calls — with gen_ai.usage.* attributes — as children. You can sum token counts across children to derive chain-level cost.

Cost Dashboards and Alerts in Uptrace

Uptrace stores gen_ai.usage.* and custom llm.cost.usd attributes as queryable numeric fields in ClickHouse. Once traces and metrics are flowing, useful queries include:

Daily cost by model:
Group llm.cost.usd metric by gen_ai.request.model, sum over 24h. This shows which model drives the most spend and whether usage shifted after a deployment.

P99 cost per agent run:
Filter parent spans with app.operation = "research", take the 99th percentile of llm.cost.usd. High P99 means a small percentage of runs is generating disproportionate cost.

Cost rate alert:
Alert when the rate of llm.cost.usd counter exceeds your threshold — for example, if hourly spend exceeds $50 when normal is under $10. This catches runaway loops or unexpected traffic spikes before they compound.

Configure your Uptrace DSN and OTLP endpoint via the getting started guide to begin streaming telemetry.

Cost Optimization Signals

Collected cost data reveals optimization opportunities that are invisible without instrumentation:

Model downgrade candidates. Compare response quality versus cost across models for your specific use cases. If gpt-5.4-mini handles 80% of your requests acceptably at 3× lower cost than GPT-5.4, routing those requests to the cheaper model has an immediate impact. gpt-5.4-nano reduces input cost by 12× for simple classification or extraction tasks.

Prompt length outliers. High gen_ai.usage.input_tokens on specific endpoints points to prompts that have grown with accumulated context or system prompt bloat. Trimming 200 tokens from a system prompt that runs on every request saves proportionally.

Retry amplification. If your error handling retries failed LLM calls, each retry doubles or triples the cost of that request. Token-level tracing makes retry patterns visible — high input token counts on spans with errors often indicate retry loops.

Conversation history accumulation. Chat applications that include full conversation history in every prompt pay linearly more as conversations grow. Seeing gen_ai.usage.input_tokens increase monotonically across a session identifies this pattern.