What is Distributed Tracing? Concepts & OpenTelemetry Implementation

What is Distributed Tracing?

Distributed tracing is an observability technique that tracks requests as they flow through distributed systems, providing visibility into how different services interact to fulfill user requests. It creates a complete view of a request's journey across microservices, APIs, and databases, recording timing, dependencies, and failures along the way.

With distributed tracing you can analyze the timing of each operation, monitor any logs and errors as they occur in real-time, and identify bottlenecks across your entire system.

OpenTelemetry Tracing is particularly valuable in the context of microservices architecture, where applications are composed of multiple independent services working together to fulfill user requests.

How tracing works?

In modern applications, especially those based on microservices or serverless architectures, different services often interact with each other to fulfill a single user request. This makes it challenging to identify performance bottlenecks, diagnose issues, and analyze the overall system behavior.

Distributed tracing aims to address these challenges by creating a trace, which is a representation of a single user request's journey through the various services and components. Each trace consists of a series of interconnected spans, where each span represents an individual operation or activity within a specific service or component.

When a request enters a service, the trace context is propagated along with the request. This usually involves injecting trace headers into the request, allowing downstream services to participate in the same trace.

As the request flows through the system, each service generates its own span and updates the trace context with information about its operation's duration, metadata, and any relevant context.

flowchart LR browser(Browser) --- webapp(Web App) mobile(Mobile App) --- gateway(API Gateway) gateway --- service1 & service2 & service3 webapp --- service1 & service2 & service3 service1(Account service) --> db1[(Account DB)] service2(Inventory service) --> db2[(Inventory DB)] service3(Shipping service) --> db3[(Shipping DB)]

Distributed tracing tools use the generated trace data to provide visibility into system behavior, aid in identifying performance issues, assist with debugging, and help ensure the reliability and scalability of distributed applications.

How to start using OpenTelemetry Tracing?

The easiest way to get started with tracing is to pick an OpenTelemetry APM and follow the documentation. Many vendors offer pre-configured OpenTelemetry distributions that can simplify the process and enhance your experience.

Some vendors, such as Uptrace and SkyWalking, allow users to try their products without creating an account.

Uptrace is an open source APM for OpenTelemetry with an intuitive query builder, rich dashboards, automatic alerts, and integrations for most languages and frameworks.

Using Uptrace, developers and operators can gain insight into the latency, errors, and dependencies of their distributed applications. Uptrace helps identify performance bottlenecks, debug problems, and optimize the overall system.

You can get started with Uptrace by downloading a DEB/RPM package or a pre-compiled Go binary.

Spans

A span represents an operation (unit of work) in a trace. A span could be a remote procedure call (RPC), a database query, or an in-process function call. A span has:

A span name (operation name).
A parent span.
A span kind.
Start and end time.
A status that reports whether operation succeeded or failed.
A set of key-value attributes describing the operation.
A timeline of events.
A list of links to other spans.
A span context that propagates trace ID and other data between different services.

A trace is a tree of spans that shows the path that a request makes through an app. The root span is the first span in a trace.

flowchart TD client(User Browser) --> webapp(Web App) webapp --> service1 & service2 service1(Account service) --> db1[(Account DB)] db1 --> q1(SELECT * FROM accounts) service2(Inventory service) --> db2[(Inventory DB)] db2 --> q2(SELECT * FROM inventories)

Span names

OpenTelemetry backends use span names and some attributes to group similar spans together. To group spans properly, give them short and concise names. The total number of unique span names should be less than 1000. Otherwise, you will have too many span groups and your experience may suffer.

The following names are good because they are short, distinctive, and help grouping similar spans together:

Span name	Comment
`GET /projects/:id`	Good. A route name with param names.
`select_project`	Good. A function name without arguments.
`SELECT * FROM projects WHERE id = ?`	Good. A database query with placeholders.

The following names are bad because they contain variable params and args:

Span name	Comment
`GET /projects/42`	Bad. Contains a variable param `42`.
`select_project(42)`	Bad. Contains a variable `42`.
`SELECT * FROM projects WHERE id = 42`	Bad. Contains a variable arg `42`.

Span kind

Span kind must have one of the following values:

server for server operations, for example, HTTP server handler.
client for client operations, for example, HTTP client requests.
producer for message producers, for example, a Kafka producer.
consumer for message consumers and async functions, for example, a Kafka consumer.
internal for internal operations.

Status code

Status code indicates whether an operation succeeded or failed. It must have one of the following values:

ok - success.
error - failure.
unset - the default value which allows backends to assign the status.

Attributes

To record contextual information, you can annotate spans with attributes that carry information specific to the operation. For example, an HTTP endpoint may have such attributes as http.method = GET and http.route = /projects/:id.

You can name attributes as you want, but for common operations you should use semantic attributes convention. It defines a list of common attribute keys with their meaning and possible values.

Events

You can also annotate spans with events that have start time and arbitrary number of attributes. The main difference between events and spans is that events don't have end time (and therefore no duration).

Events usually represent exceptions, errors, logs, and messages (such as in RPC), but you can create custom events as well.

Context

Span context carries information about that span as it propagates through different components and services.

Trace/span context is a request-scoped data such as:

Trace ID. A globally unique identifier that represents the entire trace or query. All spans within a trace have the same trace ID.
Span ID. A unique identifier for the specific span within a trace. Each span within a trace has a different span ID.
Trace flags. Flags that indicate various properties of the trace, such as whether it's sampled or not. Sampling refers to the process of determining which spans should be recorded and reported to the observability backend.
Trace State. An optional field that contains additional vendor or application-specific data related to the trace.

Span context is important for maintaining the continuity and correlation of spans within a distributed system. It allows different services and components to associate their spans with the correct trace and provides end-to-end visibility into the flow of requests or transactions.

Span context is typically propagated using headers or metadata of the communication protocols between services, similar to how baggage data is propagated. This ensures that when a service receives a request, it can extract the span context and associate the incoming span with the correct trace.

You can use data from a context for spans correlation or sampling, for example, you can use trace id to know which spans belong to which traces.

Context propagation

Context propagation ensures that relevant contextual data, such as trace IDs, span IDs, and other metadata, is propagated consistently across different services and components of an application.

OpenTemetry propagates context between functions within a process (in-process propagation) and even from one service to another (distributed propagation).

In-process propagation can be implicit or explicit depending on the programming language your are using. Implicit propagation is done automatically by storing the active context in thread-local variables (Java, Python, Ruby, NodeJS). Explicit propagation requires explicitly passing the active context from function to function as an argument (Go).

For distributed context propagation, OpenTelemetry supports several protocols that define how to serialize and pass context data:

W3C trace context in traceparent header, for example, traceparent=00-84b54e9330faae5350f0dd8673c98146-279fa73bc935cc05-01.
B3 Zipkin in headers that start with x-b3-, for example, X-B3-TraceId.

W3C trace context is the recommended propagator that is enabled by default.

Baggage

Baggage works similarly to a span context and allows you to propagate custom key:value pairs (attributes) from one service to another. In the gRPC world, a similar concept is called gRPC metadata.

Baggage allows you to associate key-value pairs with requests or transactions. These key-value pairs represent contextual information that may be relevant to the processing of the request or transaction, such as user IDs, session IDs, or other application-specific metadata.

Baggage helps maintain and correlate contextual information across distributed systems, enabling better observability and analysis of application behavior. It provides a standardized way to pass relevant data throughout the system without relying on ad hoc mechanisms or manual instrumentation.

Instrumentations

OpenTelemetry instrumentations are plugins for popular frameworks and libraries that use OpenTelemetry API to record important operations, for example, HTTP requests, DB queries, logs, errors, and more.

An instrumentation library is the library that performs the instrumentation itself, not the target of the instrumentation. You specify an instrumentation library name when you create a tracer.

What to instrument?

You don't need to instrument every operation to get the most out of tracing. It can take a lot of time and is usually not necessary. Consider prioritizing the following operations:

Network operations, for example, HTTP requests or RPC calls.
Filesystem operations, for example, reading/writing to files.
Database queries which combine network and filesystem operations.
Errors and logs, for example, using structured logging.

Best practices

To effectively implement OpenTelemetry tracing, it is important to follow best practices that ensure accurate and insightful collection of telemetry data.

Initialize Early. Be sure to initialize OpenTelemetry and any relevant variables before using libraries that require instrumentation. This will ensure that all traces are captured accurately.

Automatic vs. Manual. OpenTelemetry provides both automatic and manual instrumentation options. While automatic instrumentation is a good starting point, manual instrumentation may be necessary for certain scenarios that require more control.

Instrument Key Components. Instrument critical components of your application, such as HTTP handlers, database queries, RPC calls, and background jobs. Focus on areas that are critical for performance, reliability, or user experience. Be selective about what you instrument to avoid unnecessary overhead.

Use Semantic Conventions. When adding instrumentation to your code, it is important to follow semantic conventions. This means using standardized attribute names, span names, and span tags as defined by the OpenTelemetry specification. Doing so ensures consistency and interoperability across different instrumentation libraries and backends.

Tail-based Sampling. Consider using tail-based sampling, where a certain percentage of the latest traces are captured. This helps manage the volume of trace data while ensuring you capture critical traces.

By following these best practices, you can effectively use OpenTelemetry tracing to gain insights into your application's behavior and improve its performance, reliability, and observability.

What's next?

Distributed tracing is valuable for understanding the end-to-end behavior of complex applications, identifying performance issues, optimizing system resources, and providing insights for better decision making in terms of architectural improvements or optimizations.

Next, learn about OpenTelemetry tracing API for your programming language:

ArchitectureOpenTelemetry provides a flexible and extensible architecture for collecting observability data from applications.

Timeseries MetricsOpenTelemetry Metrics is an open standard on how to collect, aggregate, and send metrics to APM tools. It is a popular alternative to Prometheus.