OpenTelemetry Distributed Tracing
Distributed tracing allows you to see how a request progresses through different services and systems, the timing of each operation, any logs and errors as they occur.
OpenTelemetry Tracing is particularly valuable in the context of microservices architecture, where applications are composed of multiple independent services working together to fulfill user requests.
How tracing works?
In modern applications, especially those based on microservices or serverless architectures, different services often interact with each other to fulfill a single user request. This makes it challenging to identify performance bottlenecks, diagnose issues, and analyze the overall system behavior.
Distributed tracing aims to address these challenges by creating a trace, which is a representation of a single user request's journey through the various services and components. Each trace consists of a series of interconnected spans, where each span represents an individual operation or activity within a specific service or component.
When a request enters a service, the trace context is propagated along with the request. This usually involves injecting trace headers into the request, allowing downstream services to participate in the same trace.
As the request flows through the system, each service generates its own span and updates the trace context with information about its operation's duration, metadata, and any relevant context.
Distributed tracing tools use the generated trace data to provide visibility into system behavior, aid in identifying performance issues, assist with debugging, and help ensure the reliability and scalability of distributed applications.
A span represents an operation (unit of work) in a trace. A span could be a remote procedure call (RPC), a database query, or an in-process function call. A span has:
- A span name (operation name).
- A parent span.
- A span kind.
- Start and end time.
- A status that reports whether operation succeeded or failed.
- A set of key-value attributes describing the operation.
- A timeline of events.
- A list of links to other spans.
- A span context that propagates trace ID and other data between different services.
A trace is a tree of spans that shows the path that a request makes through an app. The root span is the first span in a trace.
OpenTelemetry backends use span names and some attributes to group similar spans together. To group spans properly, give them short and concise names. The total number of unique span names should be less than 1000. Otherwise, you will have too many span groups and your experience may suffer.
The following names are good because they are short, distinctive, and help grouping similar spans together:
|Good. A route name with param names.|
|Good. A function name without arguments.|
|Good. A database query with placeholders.|
The following names are bad because they contain variable params and args:
|Bad. Contains a variable param |
|Bad. Contains a variable |
|Bad. Contains a variable arg |
Span kind must have one of the following values:
serverfor server operations, for example, HTTP server handler.
clientfor client operations, for example, HTTP client requests.
producerfor message producers, for example, Kafka producer.
consumerfor message consumers and async processing in general, for example, Kafka consumer.
internalfor internal operations.
Status code indicates whether an operation succeeded or failed. It must have one of the following values:
unset- the default value which allows backends to assign the status.
To record contextual information, you can annotate spans with attributes that carry information specific to the operation. For example, an HTTP endpoint may have such attributes as
http.method = GET and
http.route = /projects/:id.
You can name attributes as you want, but for common operations you should use semantic attributes convention. It defines a list of common attribute keys with their meaning and possible values.
You can also annotate spans with events that have start time and arbitrary number of attributes. The main difference between events and spans is that events don't have end time (and therefore no duration).
Events usually represent exceptions, errors, logs, and messages (such as in RPC), but you can create custom events as well.
Span context carries information about that span as it propagates through different components and services.
Trace/span context is a request-scoped data such as:
- Trace ID. A globally unique identifier that represents the entire trace or query. All spans within a trace have the same trace ID.
- Span ID. A unique identifier for the specific span within a trace. Each span within a trace has a different span ID.
- Trace flags. Flags that indicate various properties of the trace, such as whether it's sampled or not. Sampling refers to the process of determining which spans should be recorded and reported to the observability backend.
- Trace State. An optional field that contains additional vendor or application-specific data related to the trace.
Span context is important for maintaining the continuity and correlation of spans within a distributed system. It allows different services and components to associate their spans with the correct trace and provides end-to-end visibility into the flow of requests or transactions.
Span context is typically propagated using headers or metadata of the communication protocols between services, similar to how baggage data is propagated. This ensures that when a service receives a request, it can extract the span context and associate the incoming span with the correct trace.
You can use data from a context for spans correlation or sampling, for example, you can use trace id to know which spans belong to which traces.
Context propagation ensures that relevant contextual data, such as trace IDs, span IDs, and other metadata, is propagated consistently across different services and components of an application.
OpenTemetry propagates context between functions within a process (in-process propagation) and even from one service to another (distributed propagation).
In-process propagation can be implicit or explicit depending on the programming language your are using. Implicit propagation is done automatically by storing the active context in thread-local variables (Java, Python, Ruby, NodeJS). Explicit propagation requires explicitly passing the active context from function to function as an argument (Go).
For distributed context propagation, OpenTelemetry supports several protocols that define how to serialize and pass context data:
- W3C trace context in
traceparentheader, for example,
- B3 Zipkin in headers that start with
x-b3-, for example,
W3C trace context is the recommended propagator that is enabled by default.
Baggage allows you to associate key-value pairs with requests or transactions. These key-value pairs represent contextual information that may be relevant to the processing of the request or transaction, such as user IDs, session IDs, or other application-specific metadata.
Baggage helps maintain and correlate contextual information across distributed systems, enabling better observability and analysis of application behavior. It provides a standardized way to pass relevant data throughout the system without relying on ad hoc mechanisms or manual instrumentation.
OpenTelemetry instrumentations are plugins for popular frameworks and libraries that use OpenTelemetry API to record important operations, for example, HTTP requests, DB queries, logs, errors, and more.
An instrumentation library is the library that performs the instrumentation itself, not the target of the instrumentation. You specify an instrumentation library name when you create a tracer.
What to instrument?
You don't need to instrument every operation to get the most out of tracing. It can take a lot of time and is usually not necessary. Consider prioritizing the following operations:
- Network operations, for example, HTTP requests or RPC calls.
- Filesystem operations, for example, reading/writing to files.
- Database queries which combine network and filesystem operations.
- Errors and logs, for example, using structured logging.
How to start using OpenTelemetry Tracing?
The easiest way to get started with tracing is to pick an OpenTelemetry APM and follow the documentation. Most vendors provide pre-configured OpenTelemetry distributions that allow you to skip some steps and can significantly improve your experience.
Uptrace is an open source APM for OpenTelemetry with an intuitive query builder, rich dashboards, automatic alerts, and integrations for most languages and frameworks.
Using Uptrace, developers and operators can gain insight into the latency, errors, and dependencies of their distributed applications. Uptrace helps identify performance bottlenecks, debug problems, and optimize the overall system.
You can get started with Uptrace by downloading a DEB/RPM package or a pre-compiled Go binary.
Distributed tracing is valuable for understanding the end-to-end behavior of complex applications, identifying performance issues, optimizing system resources, and providing insights for better decision making in terms of architectural improvements or optimizations.
Next, learn about OpenTelemetry tracing API for your programming language: