OpenTelemetry Metrics [with examples]
OpenTelemetry Metrics is a standard for collecting, aggregating, and sending metrics to OpenTelemetry APM tools such as Uptrace or Prometheus.
While defining a new standard, OpenTelemetry also aims to work with existing metrics instrumentation protocols such as Prometheus and StatsD. Furthermore, OpenTelemetry Collector supports even more protocols like AWS Metrics, InfluxDB, Chrony, and others.
OpenTelemetry also allows you to correlate metrics and traces via exemplars, which provides a broader picture of your system's state.
Prerequisites
Before diving into OpenTelemetry Metrics, you should have a basic understanding of the following OpenTelemetry concepts:
Attributes: Key-value pairs that provide additional context about your measurements. For example, a request duration metric might include attributes like http.method=GET
and http.status_code=200
.
Resource: Represents the entity producing telemetry data, such as a service, host, or container. Resources are described by attributes like service.name
, service.version
, and host.name
.
Meter: The entry point for creating instruments. A meter is associated with a library or service and is used to create all metric instruments for that component.
What are metrics?
Metrics are numerical data points that represent the health and performance of your system, such as CPU utilization, network traffic, and database connections.
You can use metrics to measure, monitor, and compare performance. For example, you can measure server response time, memory utilization, error rate, and more.
Instruments
An instrument is a specific type of metric (e.g., counter, gauge, histogram) that you use to collect data about a particular aspect of your application's behavior.
You capture measurements by creating instruments that have:
- A unique name, for example,
http.server.duration
- An instrument kind, for example, Histogram
- An optional unit of measure, for example,
milliseconds
orbytes
- An optional description
Timeseries
A single instrument can produce multiple timeseries. A timeseries is a metric with a unique set of attributes. For example, each host has a separate timeseries for the same metric name.
Additive instruments
Additive or summable instruments produce timeseries that, when added together, produce another meaningful and accurate timeseries. Additive instruments that measure non-decreasing numbers are also called monotonic.
For example, http.server.requests
is an additive timeseries because you can sum the number of requests from different hosts to get the total number of requests.
However, system.memory.utilization
(percent) is not additive because the sum of memory utilization from different hosts is not meaningful (90% + 90% = 180%
).
Synchronous instruments
Synchronous instruments are invoked together with the operations they are measuring. For example, to measure the number of requests, you can call counter.Add(ctx, 1)
whenever there is a new request. Synchronous measurements can have an associated trace context.
For synchronous instruments, the difference between additive and grouping instruments is that additive instruments produce summable timeseries and grouping instruments produce a histogram.
Instrument | Properties | Aggregation | Example |
---|---|---|---|
Counter | monotonic | sum -> delta | number of requests, request size |
UpDownCounter | additive | last value -> sum | number of connections |
Histogram | grouping | histogram | request duration, request size |
Asynchronous instruments
Asynchronous instruments (observers) periodically invoke a callback function to collect measurements. For example, you can use observers to periodically measure memory or CPU usage. Asynchronous measurements cannot have an associated trace context.
When choosing between UpDownCounterObserver
(additive) and GaugeObserver
(grouping), choose UpDownCounterObserver
for summable timeseries and GaugeObserver
otherwise. For example, to measure system.memory.usage
(bytes), you should use UpDownCounterObserver
. But to measure system.memory.utilization
(percent), you should use GaugeObserver
.
Instrument Name | Properties | Aggregation | Example |
---|---|---|---|
CounterObserver | monotonic | sum -> delta | CPU time |
UpDownCounterObserver | additive | last value -> sum | Memory usage (bytes) |
GaugeObserver | grouping | last value -> none/avg | Memory utilization (%) |
Choosing instruments
- If you need a histogram, a heatmap, or percentiles, use Histogram.
- If you want to count something by recording a delta value:
- If the value is monotonic, use Counter.
- Otherwise, use UpDownCounter.
- If you want to measure something by recording an absolute value:
- If the value is additive/summable:
- If the value is monotonic, use CounterObserver.
- Otherwise, use UpDownCounterObserver.
- If the value is NOT additive/summable, use GaugeObserver.
- If the value is additive/summable:
Common Scenarios
The following table shows which instrument to use for common monitoring scenarios:
Scenario | Instrument | Rationale |
---|---|---|
HTTP requests count | Counter | Monotonic, additive - count increases over time |
Request duration/latency | Histogram | Need percentiles and distribution analysis |
Active database connections | UpDownCounter | Can increase/decrease, additive across instances |
CPU usage (%) | GaugeObserver | Non-additive - cannot sum percentages meaningfully |
Memory usage (bytes) | UpDownCounterObserver | Additive - can sum bytes across instances |
Queue size | UpDownCounter | Can increase/decrease as items are added/removed |
Error count | Counter | Monotonic - errors only accumulate over time |
Thread pool size | UpDownCounterObserver | Changes over time, additive across pools |
Cache hit ratio (%) | GaugeObserver | Non-additive percentage value |
Disk I/O operations | Counter | Monotonic count of operations |
Response size | Histogram | Need to analyze distribution of sizes |
Temperature readings | GaugeObserver | Non-additive current state measurement |
Network bytes sent | Counter | Monotonic, cumulative byte count |
Concurrent users | UpDownCounter | Users connect and disconnect over time |
Counter
Counter
is a synchronous instrument that measures additive non-decreasing values, for example, the total number of:
- processed requests
- errors
- received bytes
- disk reads
Counters are used to measure the number of occurrences of an event or the accumulation of a value over time. They can only increase with time.
For Counter
timeseries, backends usually compute deltas and display rate values, for example, per_min(http.server.requests)
returns the number of processed requests per minute.
CounterObserver
CounterObserver is the asynchronous version of the Counter instrument.
UpDownCounter
UpDownCounter
is a synchronous instrument that measures additive values that can increase or decrease over time, for example, the number of:
- active requests
- open connections
- memory in use (megabytes)
For additive non-decreasing values, you should use Counter or CounterObserver.
For UpDownCounter
timeseries, backends usually display the last value, but different timeseries can be added together. For example, go.sql.connections_open
returns the total number of open connections and go.sql.connections_open{service.name = myservice}
returns the number of open connections for one service.
UpDownCounterObserver
UpDownCounterObserver
is the asynchronous version of the UpDownCounter instrument.
Histogram
Histogram is a synchronous instrument that produces a histogram from recorded values, for example:
- request latency
- request size
Histograms are used to measure the distribution of values over time. For Histogram
timeseries, backends usually display percentiles, heatmaps, and histograms.
GaugeObserver
GaugeObserver
is an asynchronous instrument that measures non-additive values for which sum
does not produce a meaningful or correct result, for example:
- error rate
- memory utilization
- cache hit rate
For GaugeObserver
timeseries, backends usually display the last value and do not allow summing different timeseries together.
Metrics examples
Number of emails
To measure the number of sent emails, you can create a Counter instrument and increment it whenever an email is sent:
import "go.opentelemetry.io/otel/metric"
emailCounter, _ := meter.Int64Counter(
"some.prefix.emails",
metric.WithDescription("Number of sent emails"),
)
emailCounter.Add(ctx, 1)
Later, you can add more attributes to gather detailed statistics, for example:
kind = welcome
andkind = reset_password
to measure different email types.state = sent
andstate = bounced
to measure bounced emails.
Operation latency
To measure the latency of operations, you can create a Histogram instrument and update it synchronously with the operation:
import "go.opentelemetry.io/otel/metric"
opHistogram, _ := meter.Int64Histogram(
"some.prefix.duration",
metric.WithDescription("Duration of some operation"),
)
t1 := time.Now()
op(ctx)
dur := time.Since(t1)
opHistogram.Record(ctx, dur.Microseconds())
Cache hit rate
To measure cache statistics, you can create a CounterObserver and observe the cache statistics:
import "go.opentelemetry.io/otel/metric"
counter, _ := meter.Int64ObservableCounter("some.prefix.cache")
// Arbitrary key/value labels.
hits := []attribute.KeyValue{attribute.String("type", "hits")}
misses := []attribute.KeyValue{attribute.String("type", "misses")}
errors := []attribute.KeyValue{attribute.String("type", "errors")}
if _, err := meter.RegisterCallback(
func(ctx context.Context, o metric.Observer) error {
stats := cache.Stats()
o.ObserveInt64(counter, stats.Hits, metric.WithAttributes(hits...))
o.ObserveInt64(counter, stats.Misses, metric.WithAttributes(misses...))
o.ObserveInt64(counter, stats.Errors, metric.WithAttributes(errors...))
return nil
},
counter,
); err != nil {
panic(err)
}
See Monitoring cache stats using OpenTelemetry Go Metrics for details.
Error rate
To directly measure the error rate, you can create a GaugeObserver and observe the value without worrying about how it is calculated:
import "go.opentelemetry.io/otel/metric"
errorRate, _ := meter.Float64ObservableGauge("some.prefix.error_rate")
if _, err := meter.RegisterCallback(
func(ctx context.Context, o metric.Observer) error {
o.ObserveFloat64(errorRate, rand.Float64())
return nil
},
errorRate,
); err != nil {
panic(err)
}
Best Practices
Following these best practices will help you create effective, performant, and maintainable metrics instrumentation.
Naming Conventions
Use descriptive, hierarchical names: Metric names should clearly describe what is being measured and follow a hierarchical structure using dots as separators.
✅ Good: http.server.request.duration
✅ Good: database.connection.active
✅ Good: cache.operations.total
❌ Bad: requests
❌ Bad: db_conn
❌ Bad: cache_ops
Follow semantic conventions: When possible, use OpenTelemetry Semantic Conventions for consistency across applications and teams.
Use consistent units: Include units in metric names when they're not obvious, and be consistent across your application.
✅ Good: memory.usage.bytes
✅ Good: request.duration.milliseconds
✅ Good: network.throughput.bytes_per_second
❌ Bad: memory (unclear unit)
❌ Bad: latency (could be seconds, milliseconds, etc.)
Attribute Selection
Keep cardinality manageable: High-cardinality attributes (those with many unique values) can impact performance and storage costs. Avoid using unbounded values as attributes.
✅ Good attributes:
- http.method (limited values: GET, POST, etc.)
- http.status_code (limited range: 200, 404, 500, etc.)
- service.version (controlled releases)
❌ High-cardinality attributes to avoid:
- user.id (unbounded)
- request.id (unbounded)
- timestamp (unbounded)
- email.address (unbounded)
Use meaningful attribute names: Choose attribute names that are self-explanatory and follow consistent naming patterns.
✅ Good: {http.method: "GET", http.status_code: "200"}
❌ Bad: {method: "GET", code: "200"}
Prefer standardized attributes: Use well-known attribute names from semantic conventions when available.
Performance Considerations
Choose the right instrument type: Using the wrong instrument can impact both performance and the usefulness of your data.
// ✅ Good: Use Counter for monotonic values
requestCounter.Add(ctx, 1)
// ❌ Bad: Using Histogram when you only need counts
requestHistogram.Record(ctx, 1) // Wastes resources on bucketing
Minimize synchronous instrument calls: Reduce the performance impact on your application's critical path.
// ✅ Good: Batch measurements when possible
func processRequests(requests []Request) {
start := time.Now()
for _, req := range requests {
processRequest(req)
}
// Single measurement for the batch
batchDuration.Record(ctx, time.Since(start).Milliseconds())
batchSize.Record(ctx, int64(len(requests)))
}
// ❌ Bad: Individual measurements for each item
func processRequests(requests []Request) {
for _, req := range requests {
start := time.Now()
processRequest(req)
requestDuration.Record(ctx, time.Since(start).Milliseconds())
}
}
Use asynchronous instruments for expensive measurements: When collecting metrics requires expensive operations (like querying system resources), use observers.
// ✅ Good: Asynchronous measurement of expensive operations
memoryGauge, _ := meter.Int64ObservableGauge("system.memory.usage")
meter.RegisterCallback(func(ctx context.Context, o metric.Observer) error {
// This expensive call happens periodically, not on every request
memStats := getMemoryStats()
o.ObserveInt64(memoryGauge, memStats.Used)
return nil
}, memoryGauge)
Control measurement frequency: Be mindful of how often metrics are collected, especially for high-frequency operations.
// ✅ Good: Sample high-frequency events
if rand.Float64() < 0.01 { // Sample 1% of events
detailedHistogram.Record(ctx, operationDuration)
}
// Always measure critical metrics
errorCounter.Add(ctx, 1)
Resource and Context Management
Reuse instruments: Create instruments once and reuse them throughout your application lifecycle.
// ✅ Good: Create instruments at startup
var (
requestCounter metric.Int64Counter
requestDuration metric.Int64Histogram
activeConnections metric.Int64UpDownCounter
)
func init() {
requestCounter, _ = meter.Int64Counter("http.requests.total")
requestDuration, _ = meter.Int64Histogram("http.request.duration")
activeConnections, _ = meter.Int64UpDownCounter("http.connections.active")
}
// ❌ Bad: Creating instruments repeatedly
func handleRequest() {
counter, _ := meter.Int64Counter("http.requests.total") // Expensive!
counter.Add(ctx, 1)
}
Use appropriate context: Pass relevant context to measurements to enable correlation with traces.
// ✅ Good: Use request context for correlation
func handleRequest(ctx context.Context) {
requestCounter.Add(ctx, 1) // Can be correlated with trace
}
// ❌ Bad: Using background context loses correlation
func handleRequest(ctx context.Context) {
requestCounter.Add(context.Background(), 1) // No trace correlation
}
Aggregation and Analysis Considerations
Design for your analysis needs: Consider how you'll use the metrics when choosing instruments and attributes.
// ✅ Good: Structure for useful aggregation
requestDuration.Record(ctx, duration,
metric.WithAttributes(
attribute.String("http.method", method),
attribute.String("http.route", route), // Not full path
attribute.String("service.version", version),
))
// This allows queries like:
// - Average latency by HTTP method
// - 95th percentile by service version
// - Error rate by route pattern
Balance detail with utility: More attributes provide more insight but increase complexity and resource usage.
// ✅ Good: Essential attributes for analysis
attribute.String("environment", env), // prod, staging, dev
attribute.String("service.version", version), // v1.2.3
attribute.String("http.method", method), // GET, POST
attribute.String("http.route", route), // /users/{id}, not /users/123
// ❌ Too detailed: Creates explosion of timeseries
attribute.String("user.id", userID), // High cardinality
attribute.String("request.id", requestID), // Unique per request
attribute.String("http.url.full", fullURL), // High cardinality
How to start using OpenTelemetry Metrics?
The easiest way to get started with metrics is to pick an OpenTelemetry backend and follow the documentation. Most vendors provide pre-configured OpenTelemetry distros that allow you to skip some steps and can significantly improve your experience.
Uptrace is an OpenTelemetry APM that supports distributed tracing, metrics, and logs. You can use it to monitor applications and troubleshoot issues.
Uptrace comes with an intuitive query builder, rich dashboards, alerting rules with notifications, and integrations for most languages and frameworks.
Uptrace can process billions of spans and metrics on a single server and allows you to monitor your applications at 10x lower cost.
In just a few minutes, you can try Uptrace by visiting the cloud demo (no login required) or running it locally with Docker. The source code is available on GitHub.
What's next?
Next, learn about OpenTelemetry Metrics API for your programming language: