Alerts and Notifications
Monitor types
Uptrace supports 2 types of monitors: metric and error monitors.
Metric monitors allow you to create alerts when metric values meet certain conditions, for example:
- Number of requests is greater than 100 per minute form the last 5 minutes.
- Number of logs/errors is greater than 100 per minute for the last 3 minutes.
Error monitors allow you to create alerts for certain errors (exceptions) and logs, for example:
- All logs/errors with the certain
log_severity
. - All logs/errors with the certain
display_name
.
Metric monitors
Uptrace allows to create alerts when the monitored metric value meets certain conditions, for example, you can create an alert when system_filesystem_usage
metric exceeds 90%.
Examples
Here are some examples of metric monitors you can create to monitor OpenTelemetry host metric. You can create them using the UI:
- Navigate to "Alerting" -> "Monitors".
- Click on "New monitor" -> "From YAML".
To monitor CPU usage:
monitors:
- name: CPU usage
type: metric
metrics:
- system_cpu_load_average_15m as $load_avg_15m
- system_cpu_time as $cpu_time
query:
- $load_avg_15m / uniq($cpu_time.cpu) as cpu_util
- group by host_name
column: cpu_util
column_unit: utilization
max_allowed_value: 3
check_num_point: 10
To monitor filesystem usage:
monitors:
- name: Filesystem usage
type: metric
metrics:
- system_filesystem_usage as $fs_usage
query:
- $fs_usage{state='used'} / $fs_usage as fs_util
- group by host_name, mountpoint
- where mountpoint !~ "/snap"
column: fs_util
column_unit: utilization
max_allowed_value: 0.9
check_num_point: 3
To monitor number of disk pending operations:
monitors:
- name: Disk pending operations
type: metric
metrics:
- system_disk_pending_operations as $pending_ops
query:
- $pending_ops
- group by host_name, device
max_allowed_value: 100
check_num_point: 10
To monitor network errors:
monitors:
- name: Network errors
type: metric
metrics:
- system_network_errors as $net_errors
query:
- $net_errors
- group by host_name
max_allowed_value: 0
check_num_point: 3
Monitoring spans, logs, and events
You can also monitor tracing data using the following system metrics created by Uptrace:
uptrace_tracing_spans
. Number of spans and their duration (excluding events and logs).uptrace_tracing_logs
. Number of logs (excluding spans and events).uptrace_tracing_events
. Number of events (excluding spans and logs).
You can use all available span attributes for filtering and grouping, for example, where _status_code = 'error'
or group by host_name
.
Examples
You can create the following examples using the UI:
- Navigate to "Alerting" -> "Monitors".
- Click on "New monitor" -> "From YAML".
To monitor average PostgreSQL SELECT
query duration:
monitors:
- name: PostgreSQL SELECT duration
type: metric
metrics:
- uptrace_tracing_spans as $spans
query:
- avg($spans)
- where _system = 'db:postgresql'
- where db_operation = 'SELECT'
max_allowed_value: 10000 # 10 milliseconds
check_num_point: 5
To monitor median duration of all database operations:
monitors:
- name: Database operations duration
type: metric
metrics:
- uptrace_tracing_spans as $spans
query:
- p50($spans)
- where _type = "db"
max_allowed_value: 10000 # 10 milliseconds
check_num_point: 5
To monitor number of errors:
monitors:
- name: Number of errors
type: metric
metrics:
- uptrace_tracing_logs as $logs
query:
- perMin(sum($logs))
- where _system in ("log:error", "log:fatal")
max_allowed_value: 10
check_num_point: 3
To monitor failed requests:
monitors:
- name: Failed requests
type: metric
metrics:
- uptrace_tracing_spans as $spans
query:
- perMin(count($spans{_status_code="error"})) as failed_requests
- where _type = "httpserver"
max_allowed_value: 0
Error monitors
Uptrace automatically creates an error monitor for logs with log_severity
levels ERROR
and FATAL
.
You can use all available filters to include/exclude monitored logs, for example, where display_name contains "timeout"
. You can also customize default filters to monitor other log levels, for example, _system in ("log:warn", "log:error", "log:fatal")
.
You can add group by
clauses to customize the default errors grouping and create a separate alert/notification, for example, group by _group_id, service_name, cloud_region
.
Examples
You can create the following examples using the UI:
- Navigate to "Alerting" -> "Monitors".
- Click on "New monitor" -> "From YAML".
To monitor all errors:
monitors:
- name: Notify on all errors
type: error
notify_everyone_by_email: true
query:
- group by _group_id
- where _system in ("log:error", "log:fatal")
To monitor errors with certain message:
monitors:
- name: Notify on "timeout" errors
type: error
notify_everyone_by_email: true
query:
- group by _group_id
- where _system in ("log:error", "log:fatal")
- where display_name contains "timeout"
To monitor exceptions:
monitors:
- name: Exceptions
type: error
notify_everyone_by_email: true
query:
- group by _group_id
- where _system in ("log:error", "log:fatal")
- where exception_type exists
To monitor all errors in each environment except dev
:
monitors:
- name: Notify on all errors except in "dev" environment
type: error
notify_everyone_by_email: true
query:
- group by _group_id
- group by deployment_environment
- where _system in ("log:error", "log:fatal")
- where deployment_environment != "dev"
Notification channels
You can create notification channels to receive notifications via email, Slack/Mattermost, Telegram, Microsoft Teams, PagerDuty, Opsgenie, AlertManager, and webhooks. You can specify which notification channels to use when creating a monitor.
To create a notification channel:
- Go to the "Alerting" -> "Channels" tab.
- Click on the "New channel" -> "Slack" to open a form.
Channel conditions
When creating a channel, you can specify a condition to filter out notifications for certain alerts and monitors.
Uptrace uses Expr language for writing conditions. In addition to built-in functions provided by Expr, Uptrace also supports the following functions:
Function | Comment |
---|---|
monitorName() string | Returns the monitor name. |
alertName() string | Returns the alert name. |
alertType() string | Returns the alert type: error , metric . |
attr(key string) string | Returns the alert attribute value. |
hasAttr(key string) bool | Returns true if the alert attribute exists. |
To only send notifications for prod
environment:
attr("deployment_environment") == "prod"
To only send notifications for host names that start with prod-
:
attr("host_name") startsWith "prod-"
To only send notifications for monitors that have URGENT
in the name:
monitorName() contains "URGENT"
Notification frequency
On the first occurrence, Uptrace creates an alert and sends a notification. If there are new occurrences, Uptrace will periodically remind you about the alert.
Uptrace uses an adaptive interval to wait before sending a notification again. The interval increases over time and is different for metric and error monitors.
- Metric monitors. The interval starts from 15 minutes and doubles every 3 notifications, e.g. 15m, 15m, 15m, 30m, 30m, 30m, 1h... The max interval is 24 hours.
- Error monitors. The interval starts from 1 hour and doubles every 2 notifications, e.g. 1h, 1h, 2h, 2h, 4h... The max interval is 1 week.
The total number of notifications is not limited, for example, if a metric monitor never recovers, you will receive notifications every 24 hours indefinitely.
When an alert is closed, Uptrace also sends a corresponding notification.
Email notifications
The information below is only relevant for the self-hosted version.
To receive email notifications in the Uptrace Community version, make sure users have correct email addresses and the mailer option is properly configured and enabled.