Scaling Uptrace
This guide covers how to scale Uptrace for higher ingestion volumes. Work through these steps in order:
- Review scaling strategy — vertical first, then horizontal.
- Diagnose the bottleneck — buffer-full errors, part counts, CPU.
- Tune Uptrace processing pipelines — more threads and larger buffers.
- Optimize ClickHouse inserts — async inserts and batching.
- Add ClickHouse sharding — when a single node isn't enough.
- Consider Kafka — only after exhausting the other options.
For the full list of configuration options, see the configuration reference. Most changes require an Uptrace restart — see upgrading Uptrace for the procedure.
Scaling strategy
Vertical first
According to ClickHouse documentation, prefer vertical scaling over horizontal scaling:
Successful deployments with ClickHouse use servers with hundreds of cores, terabytes of RAM, and petabytes of disk space. Scaling vertically first provides cost efficiency, lower operational overhead, and better query performance.
Horizontal scaling
Uptrace instances are stateless and can be scaled horizontally. Use a load balancer to distribute traffic across instances — each instance should connect to the same PostgreSQL and ClickHouse databases.
Resource requirements
| Component | CPU | RAM | Notes |
|---|---|---|---|
| Uptrace instance | 2–4 | 4–8 GB | Per instance behind load balancer |
| PostgreSQL | 2–4 | 4–8 GB | SSD storage recommended |
| ClickHouse | Scale | Scale | Prefer large instances over many small |
| Redis | 1–2 | 2–4 GB | For caching and session storage |
Diagnosing bottlenecks
Before changing configuration, identify the actual bottleneck:
in-memory buffer is fullin Uptrace logs — ingestion is outpacing ClickHouse inserts. Increasemax_buffered_bytesor add more processing threads.- High active part count (
system.partstable) — too many small inserts are creating parts faster than merges can consolidate them. Enable async inserts or increasech_max_insert_size. - High merge activity (
system.mergestable) — related to high part count. Larger, less frequent inserts reduce merge pressure. - Insert latency — slow inserts cause buffers to fill up. Check ClickHouse disk I/O and CPU. SSDs are strongly recommended. Also ensure
async_insertis enabled andwait_for_async_insertis set appropriately — synchronous inserts add a flush round-trip to every batch. - CPU saturation on the Uptrace host — if all cores are busy, adding more
max_threadswon't help. Consider moving ClickHouse to a separate server or adding shards.
Uptrace processing pipelines
Uptrace processes telemetry through parallel pipelines — one each for spans, span links, logs, events, and metrics. Each pipeline has three settings:
| Setting | What it controls | Default |
|---|---|---|
max_threads | Processing goroutines | GOMAXPROCS (CPU cores) |
max_buffered_bytes | In-memory buffer cap | max_threads × 64 MiB |
ch_max_insert_size | Records per INSERT batch | 10000 (varies by pipeline) |
max_threads
The number of goroutines that process records and insert them into ClickHouse. Each thread pulls a batch from the in-memory buffer, processes it, and executes a ClickHouse INSERT.
Default: GOMAXPROCS (CPU cores). Increase when CPU is not the bottleneck but insert latency is — for example, when ClickHouse is on a separate server and network round-trips dominate.
spans:
max_threads: 16
Setting max_threads higher than the number of CPU cores is safe — the threads spend most of their time waiting on ClickHouse I/O.
max_buffered_bytes
Caps the total estimated bytes held in memory for a pipeline. When the buffer is full, incoming records are dropped and an in-memory buffer is full error is logged.
Default: max_threads × 64 MiB. With 16 threads the default buffer is 1 GiB.
spans:
max_buffered_bytes: 536870912 # 512 MiB
If you see buffer-full errors, ingestion is temporarily outpacing ClickHouse insert speed. Increase the value to absorb bursts without dropping data. As a rough guide, size the buffer for the longest realistic ClickHouse stall (a merge, a brief network hiccup, etc.).
ch_max_insert_size
The maximum number of records batched into a single ClickHouse INSERT.
spans:
ch_max_insert_size: 10000
The right value depends on whether ClickHouse async inserts are enabled:
async_insert = 1(recommended). ClickHouse accumulates inserts into its own buffer and flushes them as larger batches. You can use a smallerch_max_insert_size(5000–10000) — smaller Uptrace batches flush faster and reduce buffer pressure, while ClickHouse still creates efficiently sized data parts.async_insert = 0. Each Uptrace insert creates a separate data part. Larger values (10000–20000) are important to avoid too many small parts, which increases merge pressure and degrades query performance.
In both cases, very large batches increase per-insert latency and memory usage. Reduce the value if inserts are timing out.
Example: high-volume configuration
The values below assume async_insert = 1 and use more threads and larger buffers than the defaults. The combined buffer totals roughly 2.6 GiB — make sure the host has enough RAM, especially if Uptrace and ClickHouse share the same server.
spans:
max_threads: 16
ch_max_insert_size: 10000
max_buffered_bytes: 1073741824 # 1 GiB
span_links:
max_threads: 2
ch_max_insert_size: 5000
max_buffered_bytes: 134217728 # 128 MiB
logs:
max_threads: 12
ch_max_insert_size: 10000
max_buffered_bytes: 805306368 # 768 MiB
events:
max_threads: 4
ch_max_insert_size: 5000
max_buffered_bytes: 268435456 # 256 MiB
metrics:
max_threads: 8
ch_max_insert_size: 10000
max_buffered_bytes: 536870912 # 512 MiB
If you see in-memory buffer is full errors, increase max_buffered_bytes for the affected pipeline and add more RAM to the host.
ClickHouse insert tuning
The settings below are ClickHouse server settings. You can set them globally in ClickHouse (users.xml or profiles.xml) or pass them per-query through the Uptrace config:
ch_cluster:
shards:
- replicas:
- addr: localhost:9000
query_settings:
async_insert: 1
wait_for_async_insert: 1
async_insert_max_data_size: 200000000
async_insert_busy_timeout_ms: 5000
The async insert buffer flushes when either async_insert_max_data_size or async_insert_busy_timeout_ms is reached — whichever comes first.
async_insert
When enabled (1, recommended), ClickHouse buffers incoming inserts and flushes them as a single batch instead of writing each insert immediately. This reduces the number of data parts created, lowering merge pressure and improving throughput.
wait_for_async_insert
Controls whether ClickHouse acknowledges an insert immediately or waits until the buffer is flushed to disk.
wait_for_async_insert: 1. Uptrace waits for the flush to complete. Acknowledged data is guaranteed to be persisted, and insert errors are reported back to Uptrace.wait_for_async_insert: 0. The insert returns as soon as data enters the buffer. Faster, but a ClickHouse crash can lose buffered data, and insert errors are silently ignored.
async_insert_max_data_size
Maximum bytes in the async insert buffer before a flush is triggered.
Default: 100 MiB. Since Uptrace already batches on the client side via ch_max_insert_size, the async buffer is merging a few concurrent batches rather than thousands of tiny inserts. For high-volume deployments, increase to 200–500 MB to create fewer, larger data parts at the cost of more memory on the ClickHouse side.
async_insert_busy_timeout_ms
Maximum time (ms) before the async buffer is flushed, even if it hasn't reached async_insert_max_data_size.
Default: 1000 ms. Increase to 3000–5000 ms to give the buffer more time to fill, resulting in fewer and larger data parts. Lower values reduce the data-loss window when wait_for_async_insert: 0, but create more parts. High-volume deployments generally benefit from longer timeouts.
ClickHouse sharding
For ingestion volumes that exceed a single ClickHouse node, distribute data across multiple shards. Sharding is a Premium feature.
Each shard is an independent ClickHouse node (or replicated set) that stores a portion of the data. Uptrace distributes writes using weighted round-robin.
ch_cluster:
cluster: 'uptrace1'
replicated: true
distributed: true # Required for sharding (Premium only)
shards:
- weight: 1
replicas:
- addr: clickhouse-1a:9000
database: uptrace
- addr: clickhouse-1b:9000
database: uptrace
- weight: 1
replicas:
- addr: clickhouse-2a:9000
database: uptrace
- addr: clickhouse-2b:9000
database: uptrace
The weight parameter controls write distribution — a shard with weight 2 receives twice as many writes as weight 1. Use this when shards have different hardware capacity.
This configuration tells Uptrace which shards exist. You also need to configure the ClickHouse cluster itself — define the topology in ClickHouse's remote_servers so that ClickHouse knows about all shards and replicas. See the ClickHouse cluster deployment docs for details.
When adding shards to an existing deployment, you don't need to redistribute old data. New writes go to the new shards, and queries automatically span all shards via distributed tables.
Kafka-based ingestion
Kafka integration is available in the Enterprise tier only.
Kafka solves two problems:
- Buffering during maintenance. When ClickHouse is down or can't keep up, Kafka retains data until ClickHouse is ready. Without Kafka, data arriving during these windows is dropped.
- Efficient batching. Kafka enables larger, more efficient batches before writing to ClickHouse, improving insert throughput.
That said, Kafka adds operational complexity — evaluate whether the buffering benefits justify the additional infrastructure. Consider it only after exhausting the tuning options above.
Architecture
- Uptrace API receives data from clients, validates it, applies rate limits, and writes to Kafka.
- Uptrace Worker consumes raw data, processes it (sampling, service graph, etc.), and publishes to a second Kafka topic.
- ClickHouse consumes directly from that topic using the Kafka engine.
Each component scales and fails independently — the API accepts data at line rate, the worker processes at its own pace, and ClickHouse consumes with its own batching strategy.