Uptrace: Querying Metrics
TIP
To learn about metrics, see OpenTelemetry Metrics documentation.
Uptrace provides a powerful query language that supports joining, grouping, and aggregating multiple metrics in a single query.
Timeseries
A timeseries is a metric with an unique set of attributes, for example, each host has a separate timeseries for the same metric name:
# metric_name{ attr1, attr2... }
system.filesystem.usage{host.name='host1'} # timeseries 1
system.filesystem.usage{host.name='host2'} # timeseries 2
You can add more attributes to create more detailed and rich timeseries, for example, you can use state
attribute to report the number of free and used bytes in a filesystem:
system.filesystem.usage{host.name='host1', state='free'} # timeseries 1
system.filesystem.usage{host.name='host1', state='used'} # timeseries 2
system.filesystem.usage{host.name='host2', state='free'} # timeseries 3
system.filesystem.usage{host.name='host2', state='used'} # timeseries 4
With just 2 attributes, you can write a number of useful queries:
# the filesystem size (free+used bytes) on each host
query:
- $fs_usage group by host.name
# the number of free bytes on each host
query:
- $fs_usage{state='free'} as free group by host.name
# fs utilization on each host
query:
- $fs_usage{state='used'} / $fs_usage as fs_util group by host.name
# the size of your dataset on all hosts
query:
- $fs_usage{state='used'} as dataset_size
Writing queries
Overview
You start creating a query by selecting metric names and giving them a short alias, for example:
metrics:
# metric aliases always start with a dollar sign
- system.filesystem.usage as $fs_usage
- system.network.packets as $packets
Because Uptrace supports multiple metrics in the same query, you must use the alias to reference the metric ($fs_usage
) and metric attributes ($fs_usage.host.name
), for example:
query:
# unlike metric aliases, column aliases don't start with a dollar
- $fs_usage as disk_size
- $fs_usage{state="used"} as used_space
# disk size on the specified device
- $fs_usage{host.name='host1', device='/dev/sdd1'} as host1_sdd1
# number of packets on each host.name
- per_min($packets) as packets_per_min group by host.name
You can use multiple metrics in arithmetic expressions, for example, you can write a query to plot the number of hits, misses, and calculate the hit rate:
metrics:
- service.cache.redis as $redis
query:
- $redis{type="hits"} as hits
- $redis{type="misses"} as misses
- hits / (hits + misses) as hit_rate
Filtering
You can filter datapoints using the following operators:
=
,!=
,<
,<=
,>
,>=
, for example,where host.name = "myhost"
.~
,!~
, for example,where host.name ~ "^prod-[a-z]+-[0-9]+$"
.like
,not like
, for example,where host.name like "prod-%"
.in
,not in
, for example,where host.name in ("host1", "host2")
.
Without a metric alias, Uptrace applies filters to all metrics, but you can specify a metric alias to filter a specific metric, for example, where $metric1.host.name = "myhost"
.
Grouping
You can use grouping to get multiple timeseries, for example, one timeseries for each host name:
metrics:
- system.filesystem.usage as $fs_usage
- system.network.packets as $packets
- group by host.name
You can have multiple attributes in the group by
clause, for example, group by host.name, service.name
. To group by all attributes at once, use group by all
.
Advanced
Count the number of unique combinations of host.name
and service.name
attributes:
uniq($metric_alias, host.name, service.name) as num_timeseries
Calculate the difference between the current and previous values:
delta($kafka_part_offset) as messages_processed
Calculate CPU utilization using system.cpu.load_average.15m and system.cpu.time:
$load_avg_15m / uniq($cpu_time, cpu) as cpu_util
Get the first/last time the metric received an update (note the double dot):
min($cache..time), max($cache..time)
Instruments
OpenTelemetry provides different instruments and each instrument supports different aggregations (functions):
Instrument Name | Timeseries kind |
---|---|
Counter, CounterObserver | Counter |
UpDownCounter, UpDownCounterObserver | Additive |
GaugeObserver | Gauge |
Histogram | Histogram |
AWS CloudWatch | Summary |
Additionally, some of the functions carry special meaning when used in table-based dashboards, for example, note how the last
function affects the table value result when used with additive instruments:
Expression | Result timeseries | Table value |
---|---|---|
avg($metric) | [1, 4, 3] | 2.6 (avg of the values in the result) |
last(avg($metric)) | [1, 4, 3] | 3 (the last value in the result) |
Counter
Counter is a timeseries kind that measures additive non-decreasing values, for example, the total number of:
- processed requests
- received bytes
- disk reads
Uptrace supports the following functions to aggregate counter
timeseries:
Expression | Result timeseries | Table value |
---|---|---|
$metric | Sum of timeseries | Sum of the result timeseries |
last($metric) | Sum of timeseries | Last value in the result timeseries |
per_min($metric) | $metric / _minutes | Avg of the result timeseries |
per_sec($metric) | $metric / _seconds | Avg of the result timeseries |
Gauge
Gauge is a timeseries kind that measures non-additive values for which sum does not produce a meaningful correct result, for example:
- error rate
- memory utilization
- cache hit rate
Uptrace supports the following functions to aggregate gauge
timeseries:
Expression | Result timeseries | Table value |
---|---|---|
$metric | Avg of timeseries | Last value in the result |
avg($metric) | Avg of timeseries | Avg of the result |
min($metric) | Min of timeseries | Min of the result |
max($metric) | Max of timeseries | Max of the result |
sum($metric) * | Sum of timeseries | Sum of the result |
per_min($metric) * | $metric / _minutes | Avg of the result |
per_sec($metric) * | $metric / _seconds | Avg of the result |
delta($metric) * | Diff between curr and previous values | Sum of the result |
* Note that the sum
, per_min
, per_sec
, and delta
functions should not be normally used with this instrument and were added only for compatibility with Prometheus and AWS metrics. For the same reason, per_min(sum($metric))
and delta(sum($metric))
are also supported.
Additive
Additive is a timeseries kind which measures additive values that increase or decrease with time, for example, the number of:
- active requests
- open connections
- memory in use (megabytes)
Uptrace supports the following functions to aggregate additive
timeseries:
Expression | Result timeseries | Table value |
---|---|---|
$metric | Sum of timeseries | Last value in the result |
sum($metric) | Same as $metric | Sum of the result |
avg($metric) | Avg of timeseries | Avg of the result |
last(avg($metric)) | Avg of timeseries | Last value in the result |
min($metric) | Min of timeseries | Min of the result |
max($metric) | Max of timeseries | Max of the result |
per_min($metric) | $metric / _minutes | Avg of the result |
per_sec($metric) | $metric / _seconds | Avg of the result |
delta($metric) * | Diff between curr and previous values | Sum of the result |
* Note that the delta
function should not be normally used with this instrument and was added only for compatibility with Prometheus and AWS metrics.
Histogram
Histogram is a timeseries kind that contains a histogram from recorded values, for example:
- request latency
- request size
Uptrace supports the following functions to aggregate histogram
timeseries:
Expression | Result timeseries | Table value |
---|---|---|
count($metric) | Number of observed values in timeseries | Sum of the result |
p50($metric) | P50 of timeseries | Avg of the result |
p75($metric) | P75 of timeseries | Avg of the result |
p90($metric) | P90 of timeseries | Avg of the result |
p95($metric) | P95 of timeseries | Avg of the result |
p99($metric) | P99 of timeseries | Avg of the result |
avg($metric) | sum($metric) / count($metric) | Avg of the result |
last(avg($metric)) | sum($metric) / count($metric) | Last value in the result |
min($metric) | Min observed value in the histogram | Min of the result |
max($metric) | Max observed value in the histogram | Max of the result |
sum($metric) | Sum of timeseries | Sum of the result |
Note that you can also use per_min(count($metric))
and per_sec(count($metric))
.
Summary
Sum is a timeseries kind that exists for compatibility with Prometheus and AWS. It stores the min
, max
, sum
, and count
aggregates of observed values.
Expression | Result timeseries | Table value |
---|---|---|
sum($metric) | Sum of timeseries | Sum of the result |
count($metric) | Number of observed values in timeseries | Sum of the result |
avg($metric) | sum($metric) / count($metric) | Avg of the result |
last(avg($metric)) | sum($metric) / count($metric) | Last value in the result |
min($metric) | Min observed value in the histogram | Min of the result |
max($metric) | Max observed value in the histogram | Max of the result |
Note that you can also use per_min(count($metric))
and per_sec(count($metric))
.
Dashboards
Uptrace uses 2 different types of dashboards together to visualize metrics data:
- A grid-based dashboard looks like a classic grid of charts.
- A table-based dashboard is a table where each row leads to a separate grid-based dashboard, for example, a table of host names with a separate grid for each host name.
In other words, table dashboards allow you to parameterize grid dashboards with attributes from the table. You can use tables as a replacement for Grafana variables.
For example, Uptrace uses a table-based dashboard to monitor the number of sampled and dropped spans for each project:
metrics:
- uptrace.projects.spans as $spans
query:
- group by project_id
- $spans{type='spans'} as sampled_spans
- $spans{type='dropped'} as dropped_spans
project_id | sampled_spans | dropped_spans | Link to a grid-based dashboard |
---|---|---|---|
1 | 100 | 0 | Grid with where project_id = 1 |
2 | 110 | 0 | Grid with where project_id = 2 |
... | ... | ... | ... |
999 | 90 | 0 | Grid with where project_id = 999 |
Binary operator precedence
The following list shows the precedence of binary operators in Uptrace, from highest to lowest.
^
*
,/
,%
+
,-
==
,!=
,<=
,<
,>=
,>
and
,unless
or
Operators on the same precedence level are left-associative. For example, 2 * 3 % 2
is equivalent to (2 * 3) % 2
.