Metric monitors

A metric monitor evaluates a UQL expression over a rolling time window and fires an alert when the computed value exceeds a threshold. Any metric Uptrace collects is available: OpenTelemetry host metrics, custom application metrics, and the internal span and log metrics Uptrace generates automatically.

How evaluation works

Each metric monitor defines:

  • Metrics — one or more metric aliases used in the query expression.
  • Query — a UQL expression that produces a single numeric value per evaluation cycle.
  • Threshold (max_value) — the value above which an alert fires.
  • Evaluation points (num_eval_points) — how many consecutive data points must exceed the threshold before firing. Higher values reduce noise from spikes.

Uptrace evaluates the expression on a fixed schedule. On the first breach, it creates an alert and sends a notification. If the value recovers below the threshold, the alert closes and a recovery notification is sent.

To create a monitor, go to Alerting → Monitors → New monitor → From YAML and paste one of the examples below.

Infrastructure metrics

The following monitors work with the OpenTelemetry Host Metrics receiver.

CPU usage:

yaml
monitors:
  - name: CPU usage
    type: metric
    metrics:
      - system_cpu_load_average_15m as $load_avg_15m
      - system_cpu_time as $cpu_time
    query:
      - $load_avg_15m / uniq($cpu_time.cpu) as cpu_util
      - group by host_name
    column:
      name: cpu_util
      unit: utilization
    detector:
      type: manual
      max_value: 3
    num_eval_points: 10

Filesystem usage:

yaml
monitors:
  - name: Filesystem usage
    type: metric
    metrics:
      - system_filesystem_usage as $fs_usage
    query:
      - $fs_usage{state='used'} / $fs_usage as fs_util
      - group by host_name, mountpoint
      - where mountpoint !~ "/snap"
    column:
      name: fs_util
      unit: utilization
    detector:
      type: manual
      max_value: 0.9
    num_eval_points: 3

Disk pending operations:

yaml
monitors:
  - name: Disk pending operations
    type: metric
    metrics:
      - system_disk_pending_operations as $pending_ops
    query:
      - $pending_ops
      - group by host_name, device
    detector:
      type: manual
      max_value: 100
    num_eval_points: 10

Network errors:

yaml
monitors:
  - name: Network errors
    type: metric
    metrics:
      - system_network_errors as $net_errors
    query:
      - $net_errors
      - group by host_name
    detector:
      type: manual
      max_value: 0
    num_eval_points: 3

Span and log metrics

Uptrace generates three internal metrics from your tracing pipeline that you can query in metric monitors:

MetricDescription
uptrace_tracing_spansSpan count and duration. Excludes events and logs.
uptrace_tracing_logsLog record count. Excludes spans and events.
uptrace_tracing_eventsEvent count. Excludes spans and logs.

All span attributes are available for filtering and grouping — for example where _status_code = 'error' or group by service_name. See Querying spans for the full attribute reference.

PostgreSQL SELECT duration:

yaml
monitors:
  - name: PostgreSQL SELECT duration
    type: metric
    metrics:
      - uptrace_tracing_spans as $spans
    query:
      - avg($spans)
      - where _system = 'db:postgresql'
      - where db_operation = 'SELECT'
    detector:
      type: manual
      max_value: 10 # milliseconds
    num_eval_points: 5

Database operation latency (p50):

yaml
monitors:
  - name: Database operations duration
    type: metric
    metrics:
      - uptrace_tracing_spans as $spans
    query:
      - p50($spans)
      - where _type = "db"
    detector:
      type: manual
      max_value: 10 # milliseconds
    num_eval_points: 5

Log error rate:

yaml
monitors:
  - name: Number of errors
    type: metric
    metrics:
      - uptrace_tracing_logs as $logs
    query:
      - perMin(sum($logs))
      - where _system in ("log:error", "log:fatal")
    detector:
      type: manual
      max_value: 10
    num_eval_points: 3

Failed HTTP requests:

yaml
monitors:
  - name: Failed requests
    type: metric
    metrics:
      - uptrace_tracing_spans as $spans
    query:
      - perMin(count($spans{_status_code="error"})) as failed_requests
      - where _type = "httpserver"
    detector:
      type: manual
      max_value: 0

Alert names

For metric monitors, Uptrace generates alert names using the monitor name and timeseries name, for example, "Disk usage: myhost+mydisk".

For error monitors, Uptrace generates alert names using the error (log) message, for example, "ERROR *fmt.wrapError: writeError failed".

You can customize alert names by specifying a Go template string as the monitor name when creating a monitor, for example, {{ .Attrs.deployment_environment_name }}: {{ .DisplayName }} will prefix the alert name with the deployment environment attribute.

You can use the following variables in templates:

VariableTypeDescription
{{ .DisplayName }}stringSame as _display_name when querying spans and logs.
{{ .Attrs }}mapstringanyAll available attributes, for example, {{ .Attrs.service_name }}.