OpenTelemetry Metrics

OpenTelemetry Metrics

OpenTelemetry Metrics is a standard on how to collect, aggregate, and send metrics to OpenTelemetry APMopen in new window tools such as Uptrace or Prometheus.

While defining a new standard, OpenTelemetry also aims to work with existing metrics instrumentation protocols such as Prometheus and Statsd. Furthermore, OpenTelemetry Collector supports even more protocols like AWS Metrics, InfluxDB, Chrony, etc.

OpenTelemetry also allows you to correlate metrics and traces via exemplars which should show you a broader picture of the state of your system.

What are metrics?

Metrics are numerical data points that represent the health and performance of your system, such as CPU utilization, network traffic, and database connections.

You can use metrics to measure, monitor, and compare performance, for example, you can measure server response time, memory utilization, error rate, and more.

Instruments

An instrument is a specific type of metric (e.g., counter, gauge, histogram) that you use to collect data about a particular aspect of your application's behavior.

You capture measurements by creating instruments that have:

  • An unique name, for example, http.server.duration.
  • An instrument kind, for example, Histogram.
  • An optional unit of measure, for example, milliseconds or bytes.
  • An optional description.

Timeseries

A single instrument can produce multiple timeseries. A timeseries is a metric with an unique set of attributes, for example, each host has a separate timeseries for the same metric name.

Additive instruments

Additive or summable instruments produce timeseries that, when added up together, produce another meaningful and accurate timeseries. Additive instruments that measure non-decreasing numbers are also called monotonic.

For example, http.server.requests is an additive timeseries, because you can sum the number of requests from different hosts to get the total number of requests.

But system.memory.utilization (percents) is not additive, because the sum of memory utilization from different hosts is not meaningful (90% + 90% = 180%).

Synchronous instruments

Synchronous instruments are invoked together with operations they are measuring. For example, to measure the number of requests, you can call counter.Add(ctx, 1) whenever there is a new request. Synchronous measurements can have an associated trace context.

For synchronous instruments the difference between additive and grouping instruments is that additive instruments produce summable timeseries and grouping instruments produce a histogram.

InstrumentPropertiesAggregationExample
Countermonotonicsum -> deltanumber of requests, request size
UpDownCounteradditivelast value -> sumnumber of connections
Histogramgroupinghistogramrequest duration, request size

Asynchronous instruments

Asynchronous instruments (observers) periodically invoke a callback function to collect measurements. For example, you can use observers to periodically measure memory or CPU usage. Asynchronous measurements can't have an associated trace context.

When choosing between UpDownCounterObserver (additive) and GaugeObserver (grouping), choose UpDownCounterObserver for summable timeseries and GaugeObserver otherwise. For example, to measure system.memory.usage (bytes), you should use UpDownCounterObserver. But to measure system.memory.utilization (percents), you should use GaugeObserver.

Instrument NamePropertiesAggregationExample
CounterObservermonotonicsum -> deltaCPU time
UpDownCounterObserveradditivelast value -> sumMemory usage (bytes)
GaugeObservergroupinglast value -> none/avgMemory utilization (%)

Choosing instruments

  1. If you need a histogram, a heatmap, or percentiles, use Histogram.

  2. If you want to count something by recording a delta value:

  3. If you want to measure something by recording an absolute value:

Counter

synchronous monotonic

Counter is a synchronous instrument that measures additive non-decreasing values, for example, the total number of:

  • processed requests
  • errors
  • received bytes
  • disk reads

Counters are used to measure the number of occurrences of an event or the accumulation of a value over time. They can only increase with time.

For Counter timeseries, backends usually compute deltas and display rate values, for example, per_min(http.server.requests) returns the number of processed requests per minute.

CounterObserver

asynchronous monotonic

CounterObserver is the asynchronous version of the Counter instrument.

UpDownCounter

synchronous additive

UpDownCounter is a synchronous instrument which measures additive values that can increase or decrease with time, for example, the number of:

  • active requests
  • open connections
  • memory in use (megabytes)

For additive non-decreasing values you should use Counter or CounterObserver.

For UpDownCounter timeseries, backends usually display the last value, but different timeseries can be added up together, for example, go.sql.connections_open returns the total number of open connections and go.sql.connections_open{service.name = myservice} returns the number of open connections for one service.

UpDownCounterObserver

asynchronous additive

UpDownCounterObserver is the asynchronous version of the UpDownCounter instrument.

Histogram

synchronous grouping

Histogram is a synchronous instrument that produces a histogram from recorded values, for example:

  • request latency
  • request size

Histograms are used to measure the distribution of values over time. For Histogram timeseries, backends usually display percentiles, heatmaps, and histograms.

GaugeObserver

asynchronous grouping

GaugeObserver is an asynchronous instrument that measures non-additive values for which sum does not produce a meaningful or correct result, for example:

  • error rate
  • memory utilization
  • cache hit rate

For GaugeObserver timeseries, backends usually display the last value and don't allow to sum different timeseries together.

Metrics examples

Number of emails

To measure the number of sent emails, you can create a Counter instrument and increment it whenever an email is sent:

import "go.opentelemetry.io/otel/metric"

emailCounter, _ := meter.Int64Counter(
	"some.prefix.emails",
	metric.WithDescription("Number of sent emails"),
)

emailCounter.Add(ctx, 1)

Later, you can add more attributes to gather detailed statistics, for example:

  • kind = welcome and kind = reset_password to measure different emails.
  • state = sent and state = bounced to measure bounced emails.

Operation latency

To measure the latency of operations, you can create a Histogram instrument and update it synchronously with the operation:

import "go.opentelemetry.io/otel/metric"

opHistogram, _ := meter.Int64Histogram(
	"some.prefix.duration",
	metric.WithDescription("Duration of some operation"),
)

t1 := time.Now()
op(ctx)
dur := time.Since(t1)

opHistogram.Record(ctx, dur.Microseconds())

Cache hit rate

To measure the cache hit rate, you can create an CounterObserver and observe the cache statistics:

import "go.opentelemetry.io/otel/metric"

counter, _ := meter.Int64ObservableCounter("some.prefix.cache")

// Arbitrary key/value labels.
hits := []attribute.KeyValue{attribute.String("type", "hits")}
misses := []attribute.KeyValue{attribute.String("type", "misses")}
errors := []attribute.KeyValue{attribute.String("type", "errors")}

if _, err := meter.RegisterCallback(
	func(ctx context.Context, o metric.Observer) error {
		stats := cache.Stats()

		o.ObserveInt64(counter, stats.Hits, metric.WithAttributes(hits...))
		o.ObserveInt64(counter, stats.Misses, metric.WithAttributes(misses...))
		o.ObserveInt64(counter, stats.Errors, metric.WithAttributes(errors...))

		return nil
	},
	counter,
); err != nil {
	panic(err)
}

See Monitoring cache stats using OpenTelemetry Metricsopen in new window for details.

Error rate

To directly measure the error rate, you can create a GaugeObserver and observe the value without worrying how it is calculated:

import "go.opentelemetry.io/otel/metric"

errorRate, _ := meter.Float64ObservableGauge("some.prefix.error_rate")

if _, err := meter.RegisterCallback(
	func(ctx context.Context, o metric.Observer) error {
		o.ObserveFloat64(errorRate, rand.Float64())
		return nil
	},
	errorRate,
); err != nil {
	panic(err)
}

How to start using OpenTelemetry Metrics?

The easiest way to get started with metrics is to pick an OpenTelemetry backendopen in new window and follow the documentation. Most vendors provide pre-configured OpenTelemetry distros that allow you to skip some steps and can significantly improve your experience.

Uptrace is a OpenTelemetry APMopen in new window that supports distributed tracing, metrics, and logs. You can use it to monitor applications and troubleshoot issues.

Uptrace Overview

Uptrace comes with an intuitive query builder, rich dashboards, alerting rules with notifications, and integrations for most languages and frameworks.

Uptrace can process billions of spans and metrics on a single server and allows you to monitor your applications at 10x lower cost.

In just a few minutes, you can try Uptrace by visiting the cloud demoopen in new window (no login required) or running it locally with Dockeropen in new window. The source code is available on GitHubopen in new window.

What's next?

Next, learn about OpenTelemetry Metrics API for your programming language:

Last Updated: 5/26/2024, 10:35:15 AM
Get insights and updates in your inbox: