OpenTelemetry Metrics [with examples]

OpenTelemetry Metrics is a standard for collecting, aggregating, and sending metrics to OpenTelemetry APM tools such as Uptrace or Prometheus.

While defining a new standard, OpenTelemetry also aims to work with existing metrics instrumentation protocols such as Prometheus and StatsD. Furthermore, OpenTelemetry Collector supports even more protocols like AWS Metrics, InfluxDB, Chrony, and others.

OpenTelemetry also allows you to correlate metrics and traces via exemplars, which provides a broader picture of your system's state.

Metric Instruments

InstrumentTypeWhen to UseExample Use CaseProperties
CounterSynchronousCounting events that only increaseHTTP requests, bytes sentMonotonic, Additive
UpDownCounterSynchronousValues that go up and downActive connections, queue sizeAdditive
HistogramSynchronousMeasuring distributionsRequest latency, payload sizeGrouping
CounterObserverAsyncMonotonic values collected on demandCPU time, total disk readsMonotonic, Additive
UpDownCounterObserverAsyncUp/down values collected on demandMemory usage (bytes)Additive
GaugeObserverAsyncNon-additive values collected on callCPU utilization (%), error rateGrouping

Choose synchronous instruments (Counter, UpDownCounter, Histogram) when you're measuring operations as they happen. Choose asynchronous observers when you need to poll values periodically (like system metrics).

Prerequisites

To work effectively with OpenTelemetry Metrics, make sure you're familiar with a few foundational concepts:

Attributes: Key-value pairs that provide additional context about your measurements. For example, a request duration metric might include attributes like http.method=GET and http.status_code=200.

Resource: Represents the entity producing telemetry data, such as a service, host, or container. Resources are described by attributes like service.name, service.version, and host.name.

Meter: The entry point for creating instruments. A meter is associated with a library or service and is used to create all metric instruments for that component.

What are metrics?

Metrics are numerical data points that represent the health and performance of your system, such as CPU utilization, network traffic, and database connections.

You can use metrics to measure, monitor, and compare performance. For example, you can measure server response time, memory utilization, error rate, and more.

Instruments

An instrument is a specific type of metric (e.g., counter, gauge, histogram) that you use to collect data about a particular aspect of your application's behavior.

You capture measurements by creating instruments that have:

  • A unique name, for example, http.server.duration
  • An instrument kind, for example, Histogram
  • An optional unit of measure, for example, milliseconds or bytes
  • An optional description

Timeseries

A single instrument can produce multiple timeseries. A timeseries is a metric with a unique set of attributes. For example, each host has a separate timeseries for the same metric name.

Additive instruments

Additive or summable instruments produce timeseries that, when added together, produce another meaningful and accurate timeseries. Additive instruments that measure non-decreasing numbers are also called monotonic.

For example, http.server.requests is an additive timeseries because you can sum the number of requests from different hosts to get the total number of requests.

However, system.memory.utilization (percent) is not additive because the sum of memory utilization from different hosts is not meaningful (90% + 90% = 180%).

Synchronous instruments

Synchronous instruments are invoked together with the operations they are measuring. For example, to measure the number of requests, you can call counter.Add(ctx, 1) whenever there is a new request. Synchronous measurements can have an associated trace context.

For synchronous instruments, the difference between additive and grouping instruments is that additive instruments produce summable timeseries and grouping instruments produce a histogram.

InstrumentPropertiesAggregationExample
Countermonotonicsum -> deltanumber of requests, request size
UpDownCounteradditivelast value -> sumnumber of connections
Histogramgroupinghistogramrequest duration, request size

Asynchronous instruments

Asynchronous instruments (observers) periodically invoke a callback function to collect measurements. For example, you can use observers to periodically measure memory or CPU usage. Asynchronous measurements cannot have an associated trace context.

When choosing between UpDownCounterObserver (additive) and GaugeObserver (grouping), choose UpDownCounterObserver for summable timeseries and GaugeObserver otherwise. For example, to measure system.memory.usage (bytes), you should use UpDownCounterObserver. But to measure system.memory.utilization (percent), you should use GaugeObserver.

Instrument NamePropertiesAggregationExample
CounterObservermonotonicsum -> deltaCPU time
UpDownCounterObserveradditivelast value -> sumMemory usage (bytes)
GaugeObservergroupinglast value -> none/avgMemory utilization (%)

Choosing instruments

  1. If you need a histogram, a heatmap, or percentiles, use Histogram.
  2. If you want to count something by recording a delta value:
  3. If you want to measure something by recording an absolute value:

Common Scenarios

The following table shows which instrument to use for common monitoring scenarios:

ScenarioInstrumentRationale
HTTP requests countCounterMonotonic, additive - count increases over time
Request duration/latencyHistogramNeed percentiles and distribution analysis
Active database connectionsUpDownCounterCan increase/decrease, additive across instances
CPU usage (%)GaugeObserverNon-additive - cannot sum percentages meaningfully
Memory usage (bytes)UpDownCounterObserverAdditive - can sum bytes across instances
Queue sizeUpDownCounterCan increase/decrease as items are added/removed
Error countCounterMonotonic - errors only accumulate over time
Thread pool sizeUpDownCounterObserverChanges over time, additive across pools
Cache hit ratio (%)GaugeObserverNon-additive percentage value
Disk I/O operationsCounterMonotonic count of operations
Response sizeHistogramNeed to analyze distribution of sizes
Temperature readingsGaugeObserverNon-additive current state measurement
Network bytes sentCounterMonotonic, cumulative byte count
Concurrent usersUpDownCounterUsers connect and disconnect over time

Counter

Counter is a synchronous instrument that measures additive non-decreasing values, for example, the total number of:

  • processed requests
  • errors
  • received bytes
  • disk reads

Counters are used to measure the number of occurrences of an event or the accumulation of a value over time. They can only increase with time.

For Counter timeseries, backends usually compute deltas and display rate values, for example, per_min(http.server.requests) returns the number of processed requests per minute.

CounterObserver

CounterObserver is the asynchronous version of the Counter instrument.

UpDownCounter

UpDownCounter is a synchronous instrument that measures additive values that can increase or decrease over time, for example, the number of:

  • active requests
  • open connections
  • memory in use (megabytes)

For additive non-decreasing values, you should use Counter or CounterObserver.

For UpDownCounter timeseries, backends usually display the last value, but different timeseries can be added together. For example, go.sql.connections_open returns the total number of open connections and go.sql.connections_open{service.name = myservice} returns the number of open connections for one service.

UpDownCounterObserver

UpDownCounterObserver is the asynchronous version of the UpDownCounter instrument.

Histogram

Histogram is a synchronous instrument that produces a histogram from recorded values, for example:

  • request latency
  • request size

Histograms are used to measure the distribution of values over time. For Histogram timeseries, backends usually display percentiles, heatmaps, and histograms.

GaugeObserver

GaugeObserver is an asynchronous instrument that measures non-additive values for which sum does not produce a meaningful or correct result, for example:

  • error rate
  • memory utilization
  • cache hit rate

For GaugeObserver timeseries, backends usually display the last value and do not allow summing different timeseries together.

Aggregation Strategies

OpenTelemetry processes raw measurements using aggregation to create meaningful metrics. Understanding aggregation is crucial for choosing the right instrument and interpreting your metrics correctly.

Aggregation Types

Different instruments use different default aggregations:

InstrumentDefault AggregationTemporalityWhat it means
CounterSumDeltaChange since last collection
UpDownCounterSumCumulativeCurrent total value
HistogramExplicit HistogramDeltaDistribution of values
CounterObserverSumDeltaChange since last observation
UpDownCounterObserverSumCumulativeCurrent absolute value
GaugeObserverLast ValueN/AMost recent measurement

Sum Aggregation

Sum aggregation adds all measurements together for a reporting period.

Delta temporality: Reports the change since the last collection. Used for counters where you want to see rate of change.

text
Collection 1: 100 requests → reports 100
Collection 2: 150 requests → reports 50 (delta)
Collection 3: 200 requests → reports 50 (delta)

Used to calculate: requests per second, error rate

Cumulative temporality: Reports the total value since process start. Some backends prefer this format.

text
Collection 1: 100 requests → reports 100
Collection 2: 150 requests → reports 150 (cumulative)
Collection 3: 200 requests → reports 200 (cumulative)

Backend calculates rate by taking derivatives

Last Value Aggregation

Records only the most recent measurement. Used for gauges and up-down counters where the current value matters more than the sum.

text
Collection 1: 50 connections → reports 50
Collection 2: 45 connections → reports 45
Collection 3: 52 connections → reports 52

Shows: current state of the system

Histogram Aggregation

Distributes measurements into configurable buckets to enable percentile calculations.

Default buckets (general-purpose):

text
[0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000, ∞]

How it works:

Given measurements: [23, 45, 67, 89, 120, 340, 890]

Results in bucket counts:

text
[0-5):     0
[5-10):    0
[10-25):   0
[25-50):   2  (23, 45)
[50-75):   1  (67)
[75-100):  1  (89)
[100-250): 1  (120)
[250-500): 1  (340)
[500-∞):   1  (890)

From this, backends can calculate:

  • p50 (median): ~67ms
  • p95: ~890ms
  • p99: ~890ms
  • Average: 239ms

Exponential Histogram Aggregation

More efficient alternative to explicit histograms with automatic bucket sizing. Provides better accuracy with fewer buckets.

Benefits:

  • Automatically adjusts to value ranges
  • More precise for percentile calculations
  • Lower storage overhead
  • Supported by newer backends

Not all backends support exponential histograms yet. Check your backend compatibility.

Views

Views allow you to customize how metrics are aggregated and exported without changing your instrumentation code. They're powerful tools for controlling metric cardinality, changing aggregations, and filtering data.

Changing Histogram Buckets

Customize histogram buckets for your specific use case:

go Go
import (
    "go.opentelemetry.io/otel/sdk/metric"
)

// Create custom buckets for response times in milliseconds
customBuckets := []float64{
    0, 10, 25, 50, 75, 100, 150, 200, 300, 500, 1000, 2000, 5000,
}

// Apply to specific histogram
provider := metric.NewMeterProvider(
    metric.WithView(
        metric.NewView(
            metric.Instrument{
                Name: "http.server.duration",
                Kind: metric.InstrumentKindHistogram,
            },
            metric.Stream{
                Aggregation: metric.AggregationExplicitBucketHistogram{
                    Boundaries: customBuckets,
                },
            },
        ),
    ),
)
python Python
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.view import View
from opentelemetry.sdk.metrics.aggregation import ExplicitBucketHistogramAggregation

# Custom buckets for latency
custom_buckets = [0, 10, 25, 50, 75, 100, 150, 200, 300, 500, 1000, 2000, 5000]

provider = MeterProvider(
    views=[
        View(
            instrument_name="http.server.duration",
            aggregation=ExplicitBucketHistogramAggregation(
                boundaries=custom_buckets
            ),
        ),
    ],
)
java Java
import io.opentelemetry.sdk.metrics.SdkMeterProvider;
import io.opentelemetry.sdk.metrics.View;
import io.opentelemetry.sdk.metrics.Aggregation;

List<Double> customBuckets = Arrays.asList(
    0.0, 10.0, 25.0, 50.0, 75.0, 100.0, 150.0, 200.0, 300.0, 500.0,
    1000.0, 2000.0, 5000.0
);

SdkMeterProvider provider = SdkMeterProvider.builder()
    .registerView(
        InstrumentSelector.builder()
            .setName("http.server.duration")
            .build(),
        View.builder()
            .setAggregation(
                Aggregation.explicitBucketHistogram(customBuckets)
            )
            .build()
    )
    .build();

Filtering Attributes

Reduce cardinality by limiting which attributes are recorded:

go Go
// Only keep specific attributes
provider := metric.NewMeterProvider(
    metric.WithView(
        metric.NewView(
            metric.Instrument{Name: "http.server.duration"},
            metric.Stream{
                AttributeFilter: attribute.NewSet(
                    attribute.String("http.method", ""),
                    attribute.String("http.route", ""),
                    // http.url and other high-cardinality attributes excluded
                ),
            },
        ),
    ),
)
python Python
from opentelemetry.sdk.metrics.view import View

# Drop specific high-cardinality attributes
provider = MeterProvider(
    views=[
        View(
            instrument_name="http.server.requests",
            attribute_keys={"http.method", "http.route"},  # Only keep these
        ),
    ],
)

Changing Aggregation

Convert a histogram to a counter or change aggregation type:

Go:

go
// Convert histogram to sum for high-frequency metrics
provider := metric.NewMeterProvider(
    metric.WithView(
        metric.NewView(
            metric.Instrument{Name: "request.size"},
            metric.Stream{
                Aggregation: metric.AggregationSum{},
            },
        ),
    ),
)

Dropping Metrics

Completely exclude metrics you don't need:

go Go
// Drop internal metrics
provider := metric.NewMeterProvider(
    metric.WithView(
        metric.NewView(
            metric.Instrument{Name: "internal.*"},  // Wildcard pattern
            metric.Stream{
                Aggregation: metric.AggregationDrop{},
            },
        ),
    ),
)
python Python
from opentelemetry.sdk.metrics.view import View
from opentelemetry.sdk.metrics.aggregation import DropAggregation

provider = MeterProvider(
    views=[
        View(
            instrument_name="internal.*",
            aggregation=DropAggregation(),
        ),
    ],
)

Cardinality Management

Cardinality refers to the number of unique timeseries created by your metrics. High cardinality can cause performance issues, increased storage costs, and query slowdowns.

What Causes High Cardinality

Each unique combination of metric name and attribute values creates a separate timeseries:

go
// Low cardinality (good) - ~20 timeseries
// 4 methods × 5 status codes = 20 timeseries
requestCounter.Add(ctx, 1,
    attribute.String("http.method", "GET"),      // 4 values: GET, POST, PUT, DELETE
    attribute.Int("http.status_code", 200),      // 5 values: 200, 201, 400, 404, 500
)

// High cardinality (bad) - thousands of timeseries
// Each user × each endpoint = potentially millions
requestCounter.Add(ctx, 1,
    attribute.String("user.id", userID),         // ❌ Unbounded
    attribute.String("http.url", fullURL),       // ❌ Unbounded
    attribute.String("request.id", requestID),   // ❌ Unique every time
)

Calculating Cardinality

For a metric with multiple attributes:

text
Cardinality = (# unique values of attr1) × (# unique values of attr2) × ...

Examples:
- http.method (4) × http.route (20) = 80 timeseries ✅ Good
- http.method (4) × http.route (20) × user.id (10000) = 800,000 timeseries ❌ Too high

Cardinality Control Techniques

1. Bucket High-Cardinality Values

go
// ❌ Bad: HTTP status code as-is creates many timeseries
attribute.Int("http.status_code", statusCode)  // Could be 100, 101, 200, 201, etc.

// ✅ Good: Bucket into categories
func statusBucket(code int) string {
    switch {
    case code < 200: return "1xx"
    case code < 300: return "2xx"
    case code < 400: return "3xx"
    case code < 500: return "4xx"
    default: return "5xx"
    }
}

attribute.String("http.status_class", statusBucket(statusCode))  // Only 5 values

2. Use Route Patterns, Not Full URLs

go
// ❌ Bad: Full URL path
attribute.String("http.path", "/users/12345/orders/67890")  // Unbounded

// ✅ Good: Route pattern
attribute.String("http.route", "/users/:id/orders/:orderId")  // Limited values

3. Limit Attribute Values

go
// ❌ Bad: Error messages (unbounded unique values)
attribute.String("error.message", err.Error())

// ✅ Good: Error type (limited set)
attribute.String("error.type", "ValidationError")

4. Use Sampling for High-Cardinality Scenarios

go
// Sample detailed metrics, always record aggregates
if shouldSample() {  // Sample 1% of requests
    detailedCounter.Add(ctx, 1,
        attribute.String("endpoint", endpoint),
        attribute.String("customer.tier", tier),
    )
}

// Always record totals
totalCounter.Add(ctx, 1)

5. Use Views to Drop Attributes

go
// Use Views (shown earlier) to filter out high-cardinality attributes
// before they reach the backend

Cardinality Budget Example

Plan your metric cardinality budget:

text
Metric: http.server.duration
Attributes:
- service.name: 10 services
- http.method: 4 methods (GET, POST, PUT, DELETE)
- http.route: 50 endpoints
- http.status_class: 5 classes (1xx, 2xx, 3xx, 4xx, 5xx)

Cardinality: 10 × 4 × 50 × 5 = 10,000 timeseries ✅ Acceptable

Adding user.id (100,000 users):
10 × 4 × 50 × 5 × 100,000 = 1 billion timeseries ❌ Too high!

Rule of thumb: Keep individual metric cardinality under 10,000 timeseries.

Cost Optimization

Metrics can become expensive at scale. Here are strategies to control costs while maintaining observability.

1. Reduce Collection Frequency

Not all metrics need sub-second precision:

go
// High-frequency business metrics (every 10s)
reader := metric.NewPeriodicReader(
    exporter,
    metric.WithInterval(10 * time.Second),
)

// Low-frequency infrastructure metrics (every 60s)
infraReader := metric.NewPeriodicReader(
    infraExporter,
    metric.WithInterval(60 * time.Second),
)

provider := metric.NewMeterProvider(
    metric.WithReader(reader),
    metric.WithReader(infraReader),
)

2. Use Delta Temporality

Delta temporality reduces data transfer and storage:

go
// Delta sends less data per collection
reader := metric.NewPeriodicReader(
    exporter,
    metric.WithTemporalitySelector(
        func(ik metric.InstrumentKind) metricdata.Temporality {
            return metricdata.DeltaTemporality
        },
    ),
)

Data volume comparison:

text
Cumulative: [100, 250, 500, 850, 1200, 1600, 2100, ...]  (growing)
Delta:      [100,  150, 250, 350,  350,  400,  500, ...]  (bounded)

3. Strategic Sampling

Sample high-volume, low-value metrics:

go
var requestCount int64

func recordMetric() {
    count := atomic.AddInt64(&requestCount, 1)

    // Only export every 100th request
    if count % 100 == 0 {
        batchCounter.Add(ctx, 100)  // Record batch
    }
}

4. Optimize Histogram Buckets

Fewer buckets = less storage:

go
// ❌ Too many buckets (default: 16 buckets)
// Default: [0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000, ∞]

// ✅ Optimized for your use case (8 buckets)
customBuckets := []float64{0, 10, 50, 100, 250, 500, 1000, 5000}

5. Use Exemplars Sparingly

Exemplars link metrics to traces but increase data volume:

go
// Only record exemplars for errors or slow requests
if duration > threshold || isError {
    // Exemplar with trace context attached
    histogram.Record(ctx, duration)
} else {
    // No exemplar for normal requests
    histogram.Record(context.Background(), duration)
}

6. Drop Unnecessary Metrics

Use Views to exclude metrics you don't use:

go
provider := metric.NewMeterProvider(
    metric.WithView(
        metric.NewView(
            metric.Instrument{Name: "debug.*"},
            metric.Stream{Aggregation: metric.AggregationDrop{}},
        ),
    ),
)

7. Aggregate in Collector

Use OpenTelemetry Collector to pre-aggregate before sending to backend:

yaml
processors:
  batch:
    timeout: 60s
    send_batch_size: 10000

  # Aggregate metrics to reduce cardinality
  metricstransform:
    transforms:
      - include: http.server.duration
        action: update
        operations:
          - action: aggregate_labels
            label_set: [http.method, http.route]
            aggregation_type: sum

Cost Impact Example

Before optimization:

text
Metric: request.count
Cardinality: 100,000 timeseries
Collection: 10s intervals
Data points/day: 100,000 × (86,400s / 10s) = 864M data points/day
Monthly cost: $2,500 (example)

After optimization:

text
1. Reduce cardinality: 100,000 → 5,000 timeseries (-95%)
2. Increase interval: 10s → 30s (-67%)
3. Use delta temporality: -30% data size

New data points/day: 5,000 × (86,400s / 30s) × 0.7 = 10M data points/day
Monthly cost: $75 (example)
Savings: 97%

Exemplars

Exemplars are example data points that link metrics to traces, providing context for metric values by attaching trace information. When you see a latency spike in your metrics, exemplars let you jump directly to actual traces that contributed to that spike.

How Exemplars Work

When recording a metric measurement, OpenTelemetry can attach the current trace context as an exemplar. This creates a bidirectional link between your metrics and traces:

  • Metrics → Traces: Click on a metric spike to see actual request traces
  • Traces → Metrics: View aggregated metrics for specific trace patterns

Exemplars are particularly useful for:

  • Debugging latency spikes: Find which specific requests caused high p99 latency
  • Investigating error rates: Jump from error count metrics to actual error traces
  • Understanding outliers: Examine why certain requests behaved differently

When to Use Exemplars

Exemplars increase data volume, so use them strategically:

✅ Good candidates for exemplars:

  • Latency histograms (find slow requests)
  • Error counters (link to failing traces)
  • High-value business metrics (orders, payments)

❌ Skip exemplars for:

  • High-frequency metrics (> 1000/sec per service)
  • Low-priority infrastructure metrics
  • Metrics without corresponding traces

Exemplar Examples

go Go
import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/metric"
    "go.opentelemetry.io/otel/trace"
)

var (
    meter = otel.Meter("my-service")
    requestDuration, _ = meter.Int64Histogram(
        "http.server.request.duration",
        metric.WithDescription("HTTP request duration in milliseconds"),
        metric.WithUnit("ms"),
    )
)

func handleRequest(ctx context.Context) {
    start := time.Now()

    // Your request handling logic
    processRequest(ctx)

    duration := time.Since(start).Milliseconds()

    // Record metric with trace context
    // Exemplar is automatically attached from ctx if tracing is active
    requestDuration.Record(ctx, duration,
        metric.WithAttributes(
            attribute.String("http.method", "GET"),
            attribute.String("http.route", "/api/users"),
        ),
    )
}

// Conditional exemplars: only for slow requests or errors
func handleRequestSelective(ctx context.Context) {
    start := time.Now()
    err := processRequest(ctx)
    duration := time.Since(start).Milliseconds()

    // Only attach exemplar for slow requests (>1s) or errors
    if duration > 1000 || err != nil {
        // Use context with trace - exemplar will be attached
        requestDuration.Record(ctx, duration)
    } else {
        // No exemplar for fast successful requests
        requestDuration.Record(context.Background(), duration)
    }
}
python Python
from opentelemetry import metrics, trace
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
import time

# Setup meter
meter = metrics.get_meter(__name__)

# Create histogram instrument
request_duration = meter.create_histogram(
    name="http.server.request.duration",
    description="HTTP request duration in milliseconds",
    unit="ms",
)

def handle_request():
    # Get current trace context
    ctx = trace.get_current_span().get_span_context()

    start = time.time()
    process_request()
    duration_ms = (time.time() - start) * 1000

    # Record metric - exemplar automatically attached from context
    request_duration.record(
        duration_ms,
        attributes={
            "http.method": "GET",
            "http.route": "/api/users",
        }
    )

# Selective exemplars: only for anomalies
def handle_request_selective():
    start = time.time()
    error = process_request()
    duration_ms = (time.time() - start) * 1000

    # Only attach exemplar for slow requests or errors
    if duration_ms > 1000 or error:
        # Record with trace context
        request_duration.record(
            duration_ms,
            attributes={
                "http.method": "GET",
                "http.route": "/api/users",
                "error": str(bool(error)),
            }
        )
    else:
        # Record without trace context (no exemplar)
        request_duration.record(duration_ms)
java Java
import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.metrics.Meter;
import io.opentelemetry.api.metrics.LongHistogram;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.context.Context;

public class MetricsExample {
    private static final Meter meter = openTelemetry.getMeter("my-service");

    private static final LongHistogram requestDuration = meter
        .histogramBuilder("http.server.request.duration")
        .setDescription("HTTP request duration in milliseconds")
        .setUnit("ms")
        .ofLongs()
        .build();

    public void handleRequest(Context context) {
        long start = System.currentTimeMillis();

        processRequest(context);

        long duration = System.currentTimeMillis() - start;

        // Record metric with current context
        // Exemplar automatically attached from context if tracing is active
        requestDuration.record(
            duration,
            Attributes.of(
                AttributeKey.stringKey("http.method"), "GET",
                AttributeKey.stringKey("http.route"), "/api/users"
            ),
            context
        );
    }

    // Conditional exemplars: only for slow requests or errors
    public void handleRequestSelective(Context context) {
        long start = System.currentTimeMillis();
        boolean hasError = processRequest(context);
        long duration = System.currentTimeMillis() - start;

        // Only attach exemplar for slow requests (>1000ms) or errors
        Context recordContext = (duration > 1000 || hasError)
            ? context
            : Context.root(); // No trace context = no exemplar

        requestDuration.record(
            duration,
            Attributes.of(
                AttributeKey.stringKey("http.method"), "GET",
                AttributeKey.stringKey("http.route"), "/api/users"
            ),
            recordContext
        );
    }
}

Exemplar Configuration

Control exemplar behavior through your metric reader configuration:

Go:

go
import (
    "go.opentelemetry.io/otel/sdk/metric"
)

// Limit exemplar reservoir size to control memory usage
reader := metric.NewPeriodicReader(
    exporter,
    metric.WithInterval(30 * time.Second),
)

// Exemplars are enabled by default
// To disable exemplars entirely, use a custom view
provider := metric.NewMeterProvider(
    metric.WithReader(reader),
)

Python:

python
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader

# Exemplars enabled by default
reader = PeriodicExportingMetricReader(exporter, export_interval_millis=30000)
provider = MeterProvider(metric_readers=[reader])

Viewing Exemplars

Not all backends support exemplars yet. As of 2025, exemplar support includes:

  • Uptrace: Full exemplar support with click-through to traces ✅
  • Grafana: Supports exemplars in Prometheus and Tempo ✅
  • Prometheus: Exemplar support added in v2.26+ ✅
  • Jaeger: No direct exemplar support ❌

When viewing metrics with exemplars in supported tools, you'll see individual data points overlaid on your metric graphs. Clicking an exemplar jumps directly to the associated trace.

Metrics examples

Number of emails

To measure the number of sent emails, you can create a Counter instrument and increment it whenever an email is sent:

go Go
import "go.opentelemetry.io/otel/metric"

emailCounter, _ := meter.Int64Counter(
    "some.prefix.emails",
    metric.WithDescription("Number of sent emails"),
)

emailCounter.Add(ctx, 1)
python Python
from opentelemetry import metrics

meter = metrics.get_meter(__name__)

email_counter = meter.create_counter(
    "some.prefix.emails",
    description="Number of sent emails",
)

email_counter.add(1)
js Node.js
const { metrics } = require('@opentelemetry/api');

const meter = metrics.getMeter('my-service');

const emailCounter = meter.createCounter('some.prefix.emails', {
  description: 'Number of sent emails',
});

emailCounter.add(1);

Later, you can add more attributes to gather detailed statistics, for example:

  • kind = welcome and kind = reset_password to measure different email types.
  • state = sent and state = bounced to measure bounced emails.

Operation latency

To measure the latency of operations, you can create a Histogram instrument and update it synchronously with the operation:

go Go
import "go.opentelemetry.io/otel/metric"

opHistogram, _ := meter.Int64Histogram(
    "some.prefix.duration",
    metric.WithDescription("Duration of some operation"),
)

t1 := time.Now()
op(ctx)
dur := time.Since(t1)

opHistogram.Record(ctx, dur.Microseconds())
python Python
from opentelemetry import metrics
import time

meter = metrics.get_meter(__name__)

op_histogram = meter.create_histogram(
    "some.prefix.duration",
    description="Duration of some operation",
)

start = time.time()
op()
duration = time.time() - start

op_histogram.record(duration * 1_000_000)  # microseconds
js Node.js
const { metrics } = require('@opentelemetry/api');

const meter = metrics.getMeter('my-service');

const opHistogram = meter.createHistogram('some.prefix.duration', {
  description: 'Duration of some operation',
});

const start = Date.now();
op();
const duration = Date.now() - start;

opHistogram.record(duration * 1000);  // microseconds

Cache hit rate

To measure cache statistics, you can create a CounterObserver and observe the cache statistics:

go Go
import "go.opentelemetry.io/otel/metric"

counter, _ := meter.Int64ObservableCounter("some.prefix.cache")

// Arbitrary key/value labels.
hits := []attribute.KeyValue{attribute.String("type", "hits")}
misses := []attribute.KeyValue{attribute.String("type", "misses")}
errors := []attribute.KeyValue{attribute.String("type", "errors")}

if _, err := meter.RegisterCallback(
    func(ctx context.Context, o metric.Observer) error {
        stats := cache.Stats()

        o.ObserveInt64(counter, stats.Hits, metric.WithAttributes(hits...))
        o.ObserveInt64(counter, stats.Misses, metric.WithAttributes(misses...))
        o.ObserveInt64(counter, stats.Errors, metric.WithAttributes(errors...))

        return nil
    },
    counter,
); err != nil {
    panic(err)
}
python Python
from opentelemetry import metrics

meter = metrics.get_meter(__name__)

cache_counter = meter.create_observable_counter("some.prefix.cache")

def observe_cache_stats(options):
    stats = cache.stats()

    yield metrics.Observation(stats.hits, {"type": "hits"})
    yield metrics.Observation(stats.misses, {"type": "misses"})
    yield metrics.Observation(stats.errors, {"type": "errors"})

meter.register_callback([cache_counter], observe_cache_stats)
js Node.js
const { metrics } = require('@opentelemetry/api');

const meter = metrics.getMeter('my-service');

const cacheCounter = meter.createObservableCounter('some.prefix.cache');

cacheCounter.addCallback((observableResult) => {
  const stats = cache.stats();

  observableResult.observe(stats.hits, { type: 'hits' });
  observableResult.observe(stats.misses, { type: 'misses' });
  observableResult.observe(stats.errors, { type: 'errors' });
});

See Monitoring cache stats using OpenTelemetry Go Metrics for details.

Error rate

To directly measure the error rate, you can create a GaugeObserver and observe the value without worrying about how it is calculated:

go Go
import "go.opentelemetry.io/otel/metric"

errorRate, _ := meter.Float64ObservableGauge("some.prefix.error_rate")

if _, err := meter.RegisterCallback(
    func(ctx context.Context, o metric.Observer) error {
        o.ObserveFloat64(errorRate, calculateErrorRate())
        return nil
    },
    errorRate,
); err != nil {
    panic(err)
}
python Python
from opentelemetry import metrics

meter = metrics.get_meter(__name__)

error_rate_gauge = meter.create_observable_gauge("some.prefix.error_rate")

def observe_error_rate(options):
    yield metrics.Observation(calculate_error_rate())

meter.register_callback([error_rate_gauge], observe_error_rate)
js Node.js
const { metrics } = require('@opentelemetry/api');

const meter = metrics.getMeter('my-service');

const errorRateGauge = meter.createObservableGauge('some.prefix.error_rate');

errorRateGauge.addCallback((observableResult) => {
  observableResult.observe(calculateErrorRate());
});

Best Practices

Following these best practices will help you create effective, performant, and maintainable metrics instrumentation.

Naming Conventions

Use descriptive, hierarchical names: Metric names should clearly describe what is being measured and follow a hierarchical structure using dots as separators.

text
✅ Good: http.server.request.duration
✅ Good: database.connection.active
✅ Good: cache.operations.total

❌ Bad: requests
❌ Bad: db_conn
❌ Bad: cache_ops

Follow semantic conventions: When possible, use OpenTelemetry Semantic Conventions for consistency across applications and teams.

Use consistent units: Include units in metric names when they're not obvious, and be consistent across your application.

text
✅ Good: memory.usage.bytes
✅ Good: request.duration.milliseconds
✅ Good: network.throughput.bytes_per_second

❌ Bad: memory (unclear unit)
❌ Bad: latency (could be seconds, milliseconds, etc.)

Attribute Selection

Keep cardinality manageable: High-cardinality attributes (those with many unique values) can impact performance and storage costs. Avoid using unbounded values as attributes.

text
✅ Good attributes:
- http.method (limited values: GET, POST, etc.)
- http.status_code (limited range: 200, 404, 500, etc.)
- service.version (controlled releases)

❌ High-cardinality attributes to avoid:
- user.id (unbounded)
- request.id (unbounded)
- timestamp (unbounded)
- email.address (unbounded)

Use meaningful attribute names: Choose attribute names that are self-explanatory and follow consistent naming patterns.

text
✅ Good: {http.method: "GET", http.status_code: "200"}
❌ Bad: {method: "GET", code: "200"}

Prefer standardized attributes: Use well-known attribute names from semantic conventions when available.

Performance Considerations

Choose the right instrument type: Using the wrong instrument can impact both performance and the usefulness of your data.

go
// ✅ Good: Use Counter for monotonic values
requestCounter.Add(ctx, 1)

// ❌ Bad: Using Histogram when you only need counts
requestHistogram.Record(ctx, 1) // Wastes resources on bucketing

Minimize synchronous instrument calls: Reduce the performance impact on your application's critical path.

go
// ✅ Good: Batch measurements when possible
func processRequests(requests []Request) {
    start := time.Now()
    for _, req := range requests {
        processRequest(req)
    }
    // Single measurement for the batch
    batchDuration.Record(ctx, time.Since(start).Milliseconds())
    batchSize.Record(ctx, int64(len(requests)))
}

// ❌ Bad: Individual measurements for each item
func processRequests(requests []Request) {
    for _, req := range requests {
        start := time.Now()
        processRequest(req)
        requestDuration.Record(ctx, time.Since(start).Milliseconds())
    }
}

Use asynchronous instruments for expensive measurements: When collecting metrics requires expensive operations (like querying system resources), use observers.

go
// ✅ Good: Asynchronous measurement of expensive operations
memoryGauge, _ := meter.Int64ObservableGauge("system.memory.usage")
meter.RegisterCallback(func(ctx context.Context, o metric.Observer) error {
    // This expensive call happens periodically, not on every request
    memStats := getMemoryStats()
    o.ObserveInt64(memoryGauge, memStats.Used)
    return nil
}, memoryGauge)

Control measurement frequency: Be mindful of how often metrics are collected, especially for high-frequency operations.

go
// ✅ Good: Sample high-frequency events
if rand.Float64() < 0.01 { // Sample 1% of events
    detailedHistogram.Record(ctx, operationDuration)
}

// Always measure critical metrics
errorCounter.Add(ctx, 1)

Resource and Context Management

Reuse instruments: Create instruments once and reuse them throughout your application lifecycle.

go
// ✅ Good: Create instruments at startup
var (
    requestCounter    metric.Int64Counter
    requestDuration   metric.Int64Histogram
    activeConnections metric.Int64UpDownCounter
)

func init() {
    requestCounter, _ = meter.Int64Counter("http.requests.total")
    requestDuration, _ = meter.Int64Histogram("http.request.duration")
    activeConnections, _ = meter.Int64UpDownCounter("http.connections.active")
}

// ❌ Bad: Creating instruments repeatedly
func handleRequest() {
    counter, _ := meter.Int64Counter("http.requests.total") // Expensive!
    counter.Add(ctx, 1)
}

Use appropriate context: Pass relevant context to measurements to enable correlation with traces.

go
// ✅ Good: Use request context for correlation
func handleRequest(ctx context.Context) {
    requestCounter.Add(ctx, 1) // Can be correlated with trace
}

// ❌ Bad: Using background context loses correlation
func handleRequest(ctx context.Context) {
    requestCounter.Add(context.Background(), 1) // No trace correlation
}

Aggregation and Analysis Considerations

Design for your analysis needs: Consider how you'll use the metrics when choosing instruments and attributes.

go
// ✅ Good: Structure for useful aggregation
requestDuration.Record(ctx, duration,
    metric.WithAttributes(
        attribute.String("http.method", method),
        attribute.String("http.route", route),      // Not full path
        attribute.String("service.version", version),
    ))

// This allows queries like:
// - Average latency by HTTP method
// - 95th percentile by service version
// - Error rate by route pattern

Balance detail with utility: More attributes provide more insight but increase complexity and resource usage.

go
// ✅ Good: Essential attributes for analysis
attribute.String("environment", env),           // prod, staging, dev
attribute.String("service.version", version),   // v1.2.3
attribute.String("http.method", method),        // GET, POST
attribute.String("http.route", route),          // /users/{id}, not /users/123

// ❌ Too detailed: Creates explosion of timeseries
attribute.String("user.id", userID),           // High cardinality
attribute.String("request.id", requestID),     // Unique per request
attribute.String("http.url.full", fullURL),    // High cardinality

How to start using OpenTelemetry Metrics?

The easiest way to get started with metrics is to pick an OpenTelemetry backend and follow the documentation. Most vendors provide pre-configured OpenTelemetry distros that allow you to skip some steps and can significantly improve your experience.

Uptrace is an OpenTelemetry APM that supports distributed tracing, metrics, and logs. You can use it to monitor applications and troubleshoot issues.

Uptrace Overview

Uptrace comes with an intuitive query builder, rich dashboards, alerting rules with notifications, and integrations for most languages and frameworks.

Uptrace can process billions of spans and metrics on a single server and allows you to monitor your applications at 10x lower cost.

In just a few minutes, you can try Uptrace by visiting the cloud demo (no login required) or running it locally with Docker. The source code is available on GitHub.

What's next?

Next, learn about OpenTelemetry Metrics API for your programming language: