Alerts and Notifications

Monitor types

Uptrace supports 2 types of monitors: metric and error monitors.

Metric monitors allow you to create alerts when metric values meet certain conditions, for example:

  • Number of requests is greater than 100 per minute form the last 5 minutes.
  • Number of logs/errors is greater than 100 per minute for the last 3 minutes.

Error monitors allow you to create alerts for certain errors (exceptions) and logs, for example:

  • All logs/errors with the certain log_severity.
  • All logs/errors with the certain display_name.
flowchart TD subgraph Monitors error(Error monitor) metric(Metric monitor) end subgraph channels [Notification Channels] email(Email) slack(Slack) pd(PagerDuty) am(AlertManager) webhook(WebHook) end alerts[(Alerts)] Monitors --> alerts alerts --> channels

Metric monitors

Uptrace allows to create alerts when the monitored metric value meets certain conditions, for example, you can create an alert when system_filesystem_usage metric exceeds 90%.

Examples

Here are some examples of metric monitors you can create to monitor OpenTelemetry host metric. You can create them using the UI:

  1. Navigate to "Alerting" -> "Monitors".
  2. Click on "New monitor" -> "From YAML".

To monitor CPU usage:

yaml
monitors:
  - name: CPU usage
    type: metric
    metrics:
      - system_cpu_load_average_15m as $load_avg_15m
      - system_cpu_time as $cpu_time
    query:
      - $load_avg_15m / uniq($cpu_time.cpu) as cpu_util
      - group by host_name
    column: cpu_util
    column_unit: utilization
    max_allowed_value: 3
    check_num_point: 10

To monitor filesystem usage:

yaml
monitors:
  - name: Filesystem usage
    type: metric
    metrics:
      - system_filesystem_usage as $fs_usage
    query:
      - $fs_usage{state='used'} / $fs_usage as fs_util
      - group by host_name, mountpoint
      - where mountpoint !~ "/snap"
    column: fs_util
    column_unit: utilization
    max_allowed_value: 0.9
    check_num_point: 3

To monitor number of disk pending operations:

yaml
monitors:
  - name: Disk pending operations
    type: metric
    metrics:
      - system_disk_pending_operations as $pending_ops
    query:
      - $pending_ops
      - group by host_name, device
    max_allowed_value: 100
    check_num_point: 10

To monitor network errors:

yaml
monitors:
  - name: Network errors
    type: metric
    metrics:
      - system_network_errors as $net_errors
    query:
      - $net_errors
      - group by host_name
    max_allowed_value: 0
    check_num_point: 3

Monitoring spans, logs, and events

You can also monitor tracing data using the following system metrics created by Uptrace:

  • uptrace_tracing_spans. Number of spans and their duration (excluding events and logs).
  • uptrace_tracing_logs. Number of logs (excluding spans and events).
  • uptrace_tracing_events. Number of events (excluding spans and logs).

You can use all available span attributes for filtering and grouping, for example, where _status_code = 'error' or group by host_name.

Examples

You can create the following examples using the UI:

  1. Navigate to "Alerting" -> "Monitors".
  2. Click on "New monitor" -> "From YAML".

To monitor average PostgreSQL SELECT query duration:

yaml
monitors:
  - name: PostgreSQL SELECT duration
    type: metric
    metrics:
      - uptrace_tracing_spans as $spans
    query:
      - avg($spans)
      - where _system = 'db:postgresql'
      - where db_operation = 'SELECT'
    max_allowed_value: 10000 # 10 milliseconds
    check_num_point: 5

To monitor median duration of all database operations:

yaml
monitors:
  - name: Database operations duration
    type: metric
    metrics:
      - uptrace_tracing_spans as $spans
    query:
      - p50($spans)
      - where _type = "db"
    max_allowed_value: 10000 # 10 milliseconds
    check_num_point: 5

To monitor number of errors:

yaml
monitors:
  - name: Number of errors
    type: metric
    metrics:
      - uptrace_tracing_logs as $logs
    query:
      - perMin(sum($logs))
      - where _system in ("log:error", "log:fatal")
    max_allowed_value: 10
    check_num_point: 3

To monitor failed requests:

yaml
monitors:
  - name: Failed requests
    type: metric
    metrics:
      - uptrace_tracing_spans as $spans
    query:
      - perMin(count($spans{_status_code="error"})) as failed_requests
      - where _type = "httpserver"
    max_allowed_value: 0

Error monitors

Uptrace automatically creates an error monitor for logs with log_severity levels ERROR and FATAL.

You can use all available filters to include/exclude monitored logs, for example, where display_name contains "timeout". You can also customize default filters to monitor other log levels, for example, _system in ("log:warn", "log:error", "log:fatal").

You can add group by clauses to customize the default errors grouping and create a separate alert/notification, for example, group by _group_id, service_name, cloud_region.

Examples

You can create the following examples using the UI:

  1. Navigate to "Alerting" -> "Monitors".
  2. Click on "New monitor" -> "From YAML".

To monitor all errors:

yaml
monitors:
  - name: Notify on all errors
    type: error
    notify_everyone_by_email: true
    query:
      - group by _group_id
      - where _system in ("log:error", "log:fatal")

To monitor errors with certain message:

yaml
monitors:
  - name: Notify on "timeout" errors
    type: error
    notify_everyone_by_email: true
    query:
      - group by _group_id
      - where _system in ("log:error", "log:fatal")
      - where display_name contains "timeout"

To monitor exceptions:

yaml
monitors:
  - name: Exceptions
    type: error
    notify_everyone_by_email: true
    query:
      - group by _group_id
      - where _system in ("log:error", "log:fatal")
      - where exception_type exists

To monitor all errors in each environment except dev:

yaml
monitors:
  - name: Notify on all errors except in "dev" environment
    type: error
    notify_everyone_by_email: true
    query:
      - group by _group_id
      - group by deployment_environment
      - where _system in ("log:error", "log:fatal")
      - where deployment_environment != "dev"

Notification channels

You can create notification channels to receive notifications via email, Slack/Mattermost, Telegram, Microsoft Teams, PagerDuty, Opsgenie, AlertManager, and webhooks. You can specify which notification channels to use when creating a monitor.

To create a notification channel:

  1. Go to the "Alerting" -> "Channels" tab.
  2. Click on the "New channel" -> "Slack" to open a form.

Channel conditions

When creating a channel, you can specify a condition to filter out notifications for certain alerts and monitors.

Uptrace uses Expr language for writing conditions. In addition to built-in functions provided by Expr, Uptrace also supports the following functions:

FunctionComment
monitorName() stringReturns the monitor name.
alertName() stringReturns the alert name.
alertType() stringReturns the alert type: error, metric.
attr(key string) stringReturns the alert attribute value.
hasAttr(key string) boolReturns true if the alert attribute exists.

To only send notifications for prod environment:

text
attr("deployment_environment") == "prod"

To only send notifications for host names that start with prod-:

text
attr("host_name") startsWith "prod-"

To only send notifications for monitors that have URGENT in the name:

text
monitorName() contains "URGENT"

Notification frequency

On the first occurrence, Uptrace creates an alert and sends a notification. If there are new occurrences, Uptrace will periodically remind you about the alert.

Uptrace uses an adaptive interval to wait before sending a notification again. The interval increases over time and is different for metric and error monitors.

  • Metric monitors. The interval starts from 15 minutes and doubles every 3 notifications, e.g. 15m, 15m, 15m, 30m, 30m, 30m, 1h... The max interval is 24 hours.
  • Error monitors. The interval starts from 1 hour and doubles every 2 notifications, e.g. 1h, 1h, 2h, 2h, 4h... The max interval is 1 week.

The total number of notifications is not limited, for example, if a metric monitor never recovers, you will receive notifications every 24 hours indefinitely.

When an alert is closed, Uptrace also sends a corresponding notification.

Email notifications

To receive email notifications in the Uptrace Community version, make sure users have correct email addresses and the mailer option is properly configured and enabled.