Alerting and Notifications
Uptrace supports 2 types of monitors: metric and error monitors.
Metric monitors allow to create alerts and receive notifications when metric values meet certain conditions.
Error monitors allow to receive notifications for certain errors (exceptions) and logs, for example, production logs with the ERROR
severity level.
Notification channels
You can create notification channels to receive notifications via email, Slack, Telegram, PagerDuty, Opsgenie, AlertManager, and webhooks. You can specify which notification channels to use when creating monitors.
To create a notification channel:
- Go to the "Alerting" -> "Channels" tab.
- Click on the "New channel" -> "Slack" to open a form.
Monitoring metrics
Uptrace allows to create alerts when the monitored metric value meets certain conditions, for example, you can create an alert when system_filesystem_usage
metric exceeds 90%.
Examples
Here are some examples of metric monitors you can create to monitor OpenTelemetry host metric. This guide uses YAML syntax to define monitors, but usually you will create monitors using Uptrace UI.
To monitor CPU usage:
monitors:
- name: CPU usage
type: metric
metrics:
- system_cpu_load_average_15m as $load_avg_15m
- system_cpu_time as $cpu_time
query:
- $load_avg_15m / uniq($cpu_time.cpu) as cpu_util
- group by host_name
column: cpu_util
column_unit: utilization
max_allowed_value: 3
check_num_point: 10
To monitor filesystem usage:
monitors:
- name: Filesystem usage
type: metric
metrics:
- system_filesystem_usage as $fs_usage
query:
- $fs_usage{state='used'} / $fs_usage as fs_util
- group by host_name, mountpoint
- where mountpoint !~ "/snap"
column: fs_util
column_unit: utilization
max_allowed_value: 0.9
check_num_point: 3
To monitor number of disk pending operations:
monitors:
- name: Disk pending operations
type: metric
metrics:
- system_disk_pending_operations as $pending_ops
query:
- $pending_ops
- group by host_name, device
max_allowed_value: 100
check_num_point: 10
To monitor network errors:
monitors:
- name: Network errors
type: metric
metrics:
- system_network_errors as $net_errors
query:
- $net_errors
- group by host.name
max_allowed_value: 0
check_num_point: 3
Monitoring spans and logs
You can also monitor tracing data using the following system metrics created by Uptrace:
uptrace_tracing_spans
. Number of spans and their duration (excluding events and logs).uptrace_tracing_logs
. Number of logs (excluding spans and events).uptrace_tracing_events
. Number of events (excluding spans and logs).
You can use all available span attributes for filtering and grouping, for example, where _status_code = 'error'
or group by host_name
.
Examples
To monitor average PostgreSQL SELECT
query duration:
monitors:
- name: PostgreSQL SELECT duration
type: metric
metrics:
- uptrace_tracing_spans as $spans
query:
- avg($spans)
- where _system = 'db:postgresql'
- where db_operation = 'SELECT'
max_allowed_value: 10000 # 10 milliseconds
check_num_point: 5
To monitor median duration of all database operations:
monitors:
- name: Database operations duration
type: metric
metrics:
- uptrace_tracing_spans as $spans
query:
- p50($spans)
- where _type = 'db'
max_allowed_value: 10000 # 10 milliseconds
check_num_point: 5
To monitor number of errors:
monitors:
- name: Number of errors
type: metric
metrics:
- uptrace_tracing_logs as $logs
query:
- per_min(sum($logs))
- where _system in ('log:error', 'log:fatal')
max_allowed_value: 10
check_num_point: 3
To monitor number of exceptions:
monitors:
- name: Number of exceptions
type: metric
metrics:
- uptrace_tracing_logs as $logs
query:
- per_min(sum($logs))
- where _system = 'log:error'
- where exception_type exists
max_allowed_value: 10
check_num_point: 3
Monitoring errors
Uptrace automatically creates alerts for exceptions and logs with log_severity
levels ERROR
, FATAL
, and PANIC
.
By default, Uptrace has an error monitor that sends email notification on all error alerts. You can create additional error monitors that will send notifications only for errors that match certain conditions, for example, errors with deployment_environment=prod
and db_system=postgresql
.
Error notifications
Because there can be millions of errors, Uptrace groups errors by message (error_message
or log_message
attributes) and only sends notifications in the following cases:
- When the error message is first seen.
- When the number of occurrences reaches 100, 1000, 10000, etc.
- After 24 hours from the last notification. In 3 days, the period is increased to 1 week.
Email notifications
To receive email notifications in the Uptrace Community version, make sure users have correct email addresses and the smtp_mailer
is properly configured and enabled:
# uptrace.yml
auth:
users:
- name: John Smith
email: john.smith@gmail.com
password: uptrace
notify_by_email: true
smtp_mailer:
enabled: true
host: smtp.gmail.com
port: 587
username: '[SENDER]@gmail.com'
password: '[APP_PASSWORD]'
from: '[SENDER]@gmail.com'
Note that Gmail does not allow to use your real password in smtp_mailer.password
. Intead, you should generate an app password for Gmail:
- In Gmail, click on your avatar -> "Manage your Google Account".
- On the left, click on "Security".
- Scroll to "Signing in to Google" and click on "App password".
See Gmail documentation for details.