What is Distributed Tracing? Concepts & OpenTelemetry Implementation

Distributed tracing is an observability technique that tracks requests as they flow through distributed systems, providing visibility into how different services interact to fulfill user requests. It creates a complete view of a request's journey across microservices, APIs, and databases, recording timing, dependencies, and failures along the way.

With distributed tracing, you can analyze the timing of each operation, monitor logs and errors as they occur in real-time, and identify bottlenecks across your entire system. This technique is particularly valuable in microservices architectures where applications consist of multiple independent services working together.

How Distributed Tracing Works

Modern applications built on microservices or serverless architectures rely on multiple services interacting to fulfill a single user request. This complexity makes it challenging to identify performance bottlenecks, diagnose issues, and analyze overall system behavior.

Distributed tracing addresses these challenges by creating a trace—a representation of a single request's journey through various services and components. Each trace consists of interconnected spans, where each span represents an individual operation within a specific service or component.

When a request enters a service, the trace context propagates with the request through trace headers, allowing downstream services to participate in the same trace. As the request flows through the system, each service generates its own span and updates the trace context with information about the operation's duration, metadata, and relevant context.

flowchart LR browser(Browser) --- webapp(Web App) mobile(Mobile App) --- gateway(API Gateway) gateway --- service1 & service2 & service3 webapp --- service1 & service2 & service3 service1(Account service) --> db1[(Account DB)] service2(Inventory service) --> db2[(Inventory DB)] service3(Shipping service) --> db3[(Shipping DB)]

Distributed tracing tools use the generated trace data to provide visibility into system behavior, identify performance issues, assist with debugging, and help ensure the reliability and scalability of distributed applications.

Getting Started with OpenTelemetry Tracing

The easiest way to get started is to choose an OpenTelemetry APM and follow its documentation. Many vendors offer pre-configured OpenTelemetry distributions that simplify the setup process.

Some vendors, such as Uptrace and SkyWalking, allow you to try their products without creating an account.

Uptrace is an open source APM for OpenTelemetry with an intuitive query builder, rich dashboards, automatic alerts, and integrations for most languages and frameworks. It helps developers and operators gain insight into the latency, errors, and dependencies of their distributed applications, identify performance bottlenecks, debug problems, and optimize overall system performance.

You can get started with Uptrace by downloading a DEB/RPM package or a pre-compiled Go binary.

Core Concepts

Spans

A span represents a unit of work in a trace, such as a remote procedure call (RPC), database query, or in-process function call. Each span contains:

A span name (operation name)
A parent span ID (except for root spans)
A span kind
Start and end timestamps
A status indicating success or failure
Key-value attributes describing the operation
A timeline of events
Links to other spans
A span context that propagates trace ID and other data between services

A trace is a tree of spans showing the path of a request through an application. The root span is the first span in a trace.

flowchart TD client(User Browser) --> webapp(Web App) webapp --> service1 & service2 service1(Account service) --> db1[(Account DB)] db1 --> q1(SELECT * FROM accounts) service2(Inventory service) --> db2[(Inventory DB)] db2 --> q2(SELECT * FROM inventories)

Span Names

OpenTelemetry backends use span names and attributes to group similar spans together. To ensure proper grouping, use short, concise names. Keep the total number of unique span names below 1,000 to avoid creating excessive span groups that can degrade performance.

Good span names (short, distinctive, and groupable):

Span name	Comment
`GET /projects/:id`	Route name with parameter placeholders
`select_project`	Function name without arguments
`SELECT * FROM projects WHERE id = ?`	Database query with placeholders

Poor span names (contain variable parameters):

Span name	Comment
`GET /projects/42`	Contains variable parameter `42`
`select_project(42)`	Contains variable argument `42`
`SELECT * FROM projects WHERE id = 42`	Contains variable value `42`

Span Kind

Span kind must be one of the following values:

server – Server-side operations (e.g., HTTP server handler)
client – Client-side operations (e.g., HTTP client requests)
producer – Message producers (e.g., Kafka producer)
consumer – Message consumers and async operations (e.g., Kafka consumer)
internal – Internal operations

Status Code

Status code indicates whether an operation succeeded or failed:

ok – Success
error – Failure
unset – Default value, allowing backends to assign status

Attributes

Attributes provide contextual information about spans. For example, an HTTP endpoint might have attributes like http.method = GET and http.route = /projects/:id.

While you can name attributes freely, use semantic attribute conventions for common operations to ensure consistency across systems.

Events

Events are timestamped annotations with attributes that lack an end time (and therefore no duration). They typically represent exceptions, errors, logs, and messages, though you can create custom events as well.

Context

Span context carries information about a span as it propagates through different components and services. It includes:

Trace ID: Globally unique identifier for the entire trace (shared by all spans in the trace)
Span ID: Unique identifier for a specific span within a trace
Trace flags: Properties such as sampling status
Trace state: Optional vendor-specific or application-specific data

Context maintains continuity and correlation of spans within a distributed system, allowing services to associate their spans with the correct trace and providing end-to-end visibility.

Context Propagation

Context propagation ensures that trace IDs, span IDs, and other metadata consistently propagate across services and components. OpenTelemetry handles both in-process and distributed propagation.

In-Process Propagation

Implicit: Automatic storage in thread-local variables (Java, Python, Ruby, Node.js)
Explicit: Manual passing of context as function arguments (Go)

Distributed Propagation

OpenTelemetry supports several protocols for serializing and passing context data:

W3C Trace Context (recommended, enabled by default): Uses traceparent header
Example: traceparent=00-84b54e9330faae5350f0dd8673c98146-279fa73bc935cc05-01
B3 Zipkin: Uses headers starting with x-b3-
Example: X-B3-TraceId

Baggage

Baggage propagates custom key-value pairs between services, similar to span context. It allows you to associate contextual information (such as user IDs or session IDs) with requests or transactions.

Baggage provides a standardized way to pass relevant data throughout the system, enabling better observability and analysis without relying on ad hoc mechanisms or manual instrumentation.

Instrumentation

OpenTelemetry instrumentations are plugins for popular frameworks and libraries that use the OpenTelemetry API to record important operations such as HTTP requests, database queries, logs, and errors.

What to Instrument

Focus instrumentation efforts on operations that provide the most value:

Network operations: HTTP requests, RPC calls
Filesystem operations: Reading and writing files
Database queries: Combined network and filesystem operations
Errors and logs: Using structured logging

SDK not initialized before application startup
Instrumentation libraries misconfigured
Overly aggressive sampling
Export endpoint unreachable

Solutions:

Verify initialization order
Check auto-instrumentation package installation
Temporarily set sampling to 100% for debugging
Test backend connectivity and credentials
Enable debug logging

Broken Context Propagation

Problem: Spans appear disconnected or traces fragment across services.

Common Causes:

Context not propagated between services
Uninstrumented custom protocols
Async operations breaking context
Missing trace headers

Solutions:

Verify HTTP client/server instrumentation
Manually manage context for custom protocols
Use explicit context management for async operations
Confirm trace headers are present in requests
Configure propagation for all communication protocols

Performance Overhead

Problem: Application performance degrades after enabling tracing.

Common Causes:

Over-instrumentation
Synchronous export blocking threads
Large attributes or excessive events
High sampling rates

Solutions:

Use asynchronous batch exporters
Implement appropriate sampling (1-5% for high-traffic applications)
Remove unnecessary spans
Limit attribute sizes
Consider tail-based sampling

High Cardinality Issues

Problem: Too many unique span names or attribute values cause storage issues.

Common Causes:

Variable data in span names
Unlimited attribute values
Auto-generated unique identifiers

Solutions:

Use parameterized span names
Normalize or bucket attribute values
Follow semantic conventions for naming

Export Failures

Problem: Spans generate but don't reach the backend.

Common Causes:

Network connectivity issues
Authentication problems
Backend unavailability
Buffer overflow

Solutions:

Monitor exporter metrics and logs
Implement retry with exponential backoff
Verify endpoints and authentication
Adjust batch size and timeout settings
Set up export failure alerts

Memory Issues

Problem: Memory leaks or high usage.

Common Causes:

Spans not properly exported
Data accumulation in buffers
Long-running spans holding references

Solutions:

Ensure proper span lifecycle management
Configure appropriate export intervals
Review attribute sizes
Monitor buffer sizes
Implement resource cleanup

Next Steps

Distributed tracing provides valuable insights for understanding end-to-end application behavior, identifying performance issues, and optimizing system resources.

Explore the OpenTelemetry tracing API for your programming language:

ArchitectureOpenTelemetry provides a flexible and extensible architecture for collecting observability data from applications.

Timeseries MetricsOpenTelemetry Metrics is an open standard on how to collect, aggregate, and send metrics to APM tools. It is a popular alternative to Prometheus.

What is Distributed Tracing? Concepts & OpenTelemetry Implementation

How Distributed Tracing Works

Getting Started with OpenTelemetry Tracing

Core Concepts

Spans

Span Names

Span Kind

Status Code

Attributes

Events

Context

Context Propagation

In-Process Propagation

Distributed Propagation

Baggage

Instrumentation

What to Instrument

Best Practices

Initialize Early

Balance Automatic and Manual Instrumentation

Focus on Critical Components

Follow Semantic Conventions

Implement Smart Sampling

Troubleshooting

Missing Spans

Broken Context Propagation

Performance Overhead

High Cardinality Issues

Export Failures

Memory Issues

Next Steps