OpenTelemetry: head-based and tail-based sampling, rate-limiting

Sampling reduces the cost and verbosity of tracing by reducing the number of created (sampled) spans. In terms of performance, sampling can save CPU cycles and memory required to collect, process, and export spans.

Sampling: when and where

Sampling may happen in different stages of spans processing:

Sampling probability

Sampling provides a sampling probability which enables accurate statistical counting of all spans using only a portion of sampled spans. For example, if the sampling probability is 50% and the number of sampled spans is 10, then the adjusted (total) number of spans is 10 / 50% = 20.

NameSideAdjusted countAccuracy
Head-based samplingClient-sideYes100%
Rate-limiting samplingServer-sideYes<90%
Tail-based samplingServer-sideYes<90%

Head-based sampling

Head-based sampling makes the sampling decision as early as possible and propagates it to other participants using the context. This allows saving CPU and memory resources by not collecting any telemetry data for dropped spans (operations).

Head-based sampling is the simplest, most accurate, and most reliable sampling method which you should prefer over all other methods.

A disadvantage of head-based sampling is that you can't sample spans with errors, because that information is not available when spans are created. To address that concern, you can use tail-based sampling.

Head-based sampling also does not account for traffic spikes and may collect more data than desired. This is where rate-limiting sampling becomes handy.

OpenTelemetry head-based sampling

OpenTelemetry has 2 span properties responsible for client sampling:

  • IsRecording - when false, span discards attributes, events, links etc.
  • Sampled - when false, OpenTelemetry drops the span.

You should check IsRecording property to avoid collecting expensive telemetry data.

if span.IsRecording() {
    // collect expensive data
}

Sampler is a function that accepts a root span about to be created. The function returns a sampling decision which must be one of:

  • Drop - trace is dropped. IsRecording = false, Sampled = false.
  • RecordOnly - trace is recorded but not sampled. IsRecording = true, Sampled = false.
  • RecordAndSample - trace is recorded and sampled. IsRecording = true, Sampled = true.

By default, OpenTelemetry samples all traces, but you can configure it to sample a portion of traces. In that case, backends use the sampling probability to adjust the number of spans.

OpenTelemetry samplers

AlwaysOn sampler samples every trace, for example, a new trace will be started and exported for every request.

AlwaysOff sampler samples no traces or, in other words, drops all traces. You can use this sampler to perform load testing or to temporarily disable tracing.

TraceIDRatioBased sampler uses a trace id to sample a fraction of traces, for example, 20% of traces.

Parent-based sampler is a composite sampler which behaves differently based on the parent of the span. When you start a new trace, sampler makes a decision whether or not to sample it and propagates the decision down to other services.

Rate-limiting sampling

Rate-limiting sampling happens on the server side and ensures that you don't exceed certain limits, for example, it allows to sample 10 or less traces per seconds.

Rate-limiting sampling supports adjusted counts but the accuracy is rather low. To achieve better results and improve performance, you should use rate-limiting sampling together with head-based sampling which is more efficient and accurate.

Most backends (including Uptrace) automatically apply rate-limiting sampling when necessary.

Tail-based sampling

With head-based sampling the sampling decision is made upfront and usually at random. Head-based sampling can't sample failed or unusually long operations, because that information is only available at the end of a trace.

With tail-based sampling we delay the sampling decision until all spans of a trace are available which enables better sampling decisions based on all data from the trace. For example, we can sample failed or unusually long traces.

Most backends (including Uptrace) automatically apply tail-based sampling when necessary, but you can also use OpenTelemetry Collector with tailsamplingprocessoropen in new window to configure sampling according to your needs.

Uptrace

Uptrace is an open source DataDog competitoropen in new window with an intuitive query builder, rich dashboards, alerting rules, and integrations for most languages and frameworks. It can process billions of spans and metrics on a single server and allows to monitor your applications at 10x lower cost.

Uptrace uses ClickHouse database to store traces, metrics, and logs. You can use it to monitor applications and set up automatic alerts to receive notifications via email, Slack, Telegram, and more.

You can get startedopen in new window with Uptrace by downloading a DEB/RPM package or a pre-compiled Go binary.

Last Updated: