OpenTelemetry Distributed Tracing
Distributed tracing allows to see how a request progresses through different services and systems, timings of each operation, any logs and errors as they occur.
Distributed tracing tools provide visibility into system behavior, aid in identifying performance issues, assist with debugging, and help ensure the reliability and scalability of distributed applications.
Distributed tracing is particularly valuable in the context of microservices architecture, where applications are composed of multiple independent services working together to fulfill user requests.
Spans
Span represents an operation (unit of work) in a trace. A span could be a remote procedure call (RPC), a database query, or an in-process function call. A span has:
- A parent span.
- A span name (operation name).
- A span kind.
- Start and end time.
- A status that reports whether operation succeeded or failed.
- A set of key-value attributes describing the operation.
- A timeline of events.
- A list of links to other spans.
- A span context that propagates trace ID and other data between different services.
Trace is a tree of spans that shows the path that a request makes through an app. Root span is the first span in a trace.
Span names
Backends use span names and some attributes to group similar spans together. To group spans properly, give them short and concise names. The total number of unique span names should be less than 1000. Otherwise, you will have too many span groups and your experience may suffer.
The following names are good because they are short, distinctive, and help grouping similar spans together:
Span name | Comment |
---|---|
GET /projects/:id | Good. A route name with param names. |
select_project | Good. A function name without arguments. |
SELECT * FROM projects WHERE id = ? | Good. A database query with placeholders. |
The following names are bad because they contain variable params and args:
Span name | Comment |
---|---|
GET /projects/42 | Bad. Contains a variable param 42 . |
select_project(42) | Bad. Contains a variable 42 . |
SELECT * FROM projects WHERE id = 42 | Bad. Contains a variable arg 42 . |
Span kind
Span kind must have one of the following values:
server
for server operations, for example, HTTP server handler.client
for client operations, for example, HTTP client requests.producer
for message producers, for example, Kafka producer.consumer
for message consumers and async processing in general, for example, Kafka consumer.internal
for internal operations.
Status code
Status code indicates whether an operation succeeded or failed. It must have one of the following values:
ok
- success.error
- failure.unset
- the default value which allows backends to assign the status.
Attributes
To record contextual information, you can annotate spans with attributes that carry information specific to the operation. For example, an HTTP endpoint may have such attributes as http.method = GET
and http.route = /projects/:id
.
You can name attributes as you want, but for common operations you should use semantic attributes convention. It defines a list of common attribute keys with their meaning and possible values.
Events
You can also annotate spans with events that have start time and arbitrary number of attributes. The main difference between events and spans is that events don't have end time (and therefore no duration).
Events usually represent exceptions, errors, logs, and messages (such as in RPC), but you can create custom events as well.
Context
Trace/span context is a request-scoped data such as:
- trace id - unique trace identificator;
- span id - unique span identificator;
- trace flags - various flags such as sampled, deferred, and debug.
You can use data from a context for spans correlation or sampling, for example, you can use trace id to know which spans belong to which traces.
Context propagation
OpenTemetry propagates context between functions within a process (in-process propagation) and even from one service to another (distributed propagation).
In-process propagation can be implicit or explicit depending on the programming language your are using. Implicit propagation is done automatically by storing the active context in thread-local variables (Java, Python, Ruby, NodeJS). Explicit propagation requires explicitly passing the active context from function to function as an argument (Go).
For distributed context propagation, OpenTelemetry supports several protocols that define how to serialize and pass context data:
- W3C trace context in
traceparent
header, for example,traceparent=00-84b54e9330faae5350f0dd8673c98146-279fa73bc935cc05-01
. - B3 Zipkin in headers that start with
x-b3-
, for example,X-B3-TraceId
.
W3C trace context is the recommended propagator that is enabled by default.
Baggage
Baggage works similarly to a span context and allows you to propagate user-defined key:value pairs (attributes) from one service to another. In gRPC world, a similar concept is called gRPC metadata.
For example, you can use baggage to propagate information about the service that created the trace to all other services.
Instrumentations
OpenTelemetry Instrumentations are plugins for popular frameworks and libraries that use OpenTelemetry API to record important operations, for example, HTTP requests, DB queries, logs, errors, and more.
An instrumentation library is the library which performs the instrumentation itself, not the target of the instrumentation. You provide an instrumentation library name when you create a tracer.
What to instrument?
You don't need to instrument all operations to get the most out of tracing. It can take a lot of time and usually is not needed. Consider prioritizing the following operations:
- Network operations, for example, HTTP requests or RPC calls.
- Filesystem operations, for example, reading/writing to files.
- Database queries which combine network and filesystem operations.
- Errors and logs, for example, using structured logging.
How to start using OpenTelemetry Tracing?
The easiest way to get started with tracing is to pick an OpenTelemetry backend and follow the documentation. Most vendors provide pre-configured OpenTelemetry distros that allow you to skip some steps and can significantly improve your experience.
Uptrace is an OpenTelemetry backend with an intuitive query builder, rich dashboards, alerting rules, and integrations for most languages and frameworks. It can process billions of spans and metrics on a single server and allows to monitor your applications at 10x lower cost.
Uptrace uses ClickHouse database to store traces, metrics, and logs. You can use it to monitor applications and set up automatic alerts to receive notifications via email, Slack, Telegram, and more.
You can get started with Uptrace by downloading a DEB/RPM package or a pre-compiled Go binary.
What's next?
Next, learn about OpenTelemetry tracing API for your programming language: