Scaling Uptrace

This guide covers how to scale Uptrace for higher ingestion volumes. Work through these steps in order:

Review scaling strategy — vertical first, then horizontal.
Diagnose the bottleneck — buffer-full errors, part counts, CPU.
Tune Uptrace processing pipelines — more threads and larger buffers.
Optimize ClickHouse inserts — async inserts and batching.
Add ClickHouse sharding — when a single node isn't enough.
Consider Kafka — only after exhausting the other options.

For the full list of configuration options, see the configuration reference. Most changes require an Uptrace restart — see upgrading Uptrace for the procedure.

Scaling strategy

Vertical first

According to ClickHouse documentation, prefer vertical scaling over horizontal scaling:

Successful deployments with ClickHouse use servers with hundreds of cores, terabytes of RAM, and petabytes of disk space. Scaling vertically first provides cost efficiency, lower operational overhead, and better query performance.

Uptrace instances are stateless and can be scaled horizontally. Use a load balancer to distribute traffic across instances — each instance should connect to the same PostgreSQL and ClickHouse databases.

Resource requirements

Component	CPU	RAM	Notes
Uptrace instance	2–4	4–8 GB	Per instance behind load balancer
PostgreSQL	2–4	4–8 GB	SSD storage recommended
ClickHouse	Scale	Scale	Prefer large instances over many small
Redis	1–2	2–4 GB	For caching and session storage

Diagnosing bottlenecks

Before changing configuration, identify the actual bottleneck:

in-memory buffer is full in Uptrace logs — ingestion is outpacing ClickHouse inserts. Increase max_buffered_bytes or add more processing threads.
High active part count (system.parts table) — too many small inserts are creating parts faster than merges can consolidate them. Enable async inserts or increase ch_max_insert_size.
High merge activity (system.merges table) — related to high part count. Larger, less frequent inserts reduce merge pressure.
Insert latency — slow inserts cause buffers to fill up. Check ClickHouse disk I/O and CPU. SSDs are strongly recommended. Also ensure async_insert is enabled and wait_for_async_insert is set appropriately — synchronous inserts add a flush round-trip to every batch.
CPU saturation on the Uptrace host — if all cores are busy, adding more max_threads won't help. Consider moving ClickHouse to a separate server or adding shards.

Uptrace processing pipelines

Uptrace processes telemetry through parallel pipelines — one each for spans, span links, logs, events, and metrics. Each pipeline has three settings:

Setting	What it controls	Default
`max_threads`	Processing goroutines	`GOMAXPROCS` (CPU cores)
`max_buffered_bytes`	In-memory buffer cap	`max_threads × 64 MiB`
`ch_max_insert_size`	Records per INSERT batch	`10000` (varies by pipeline)

max_threads

The number of goroutines that process records and insert them into ClickHouse. Each thread pulls a batch from the in-memory buffer, processes it, and executes a ClickHouse INSERT.

Default: GOMAXPROCS (CPU cores). Increase when CPU is not the bottleneck but insert latency is — for example, when ClickHouse is on a separate server and network round-trips dominate.

yml

spans:
  max_threads: 16

Setting max_threads higher than the number of CPU cores is safe — the threads spend most of their time waiting on ClickHouse I/O.

max_buffered_bytes

Caps the total estimated bytes held in memory for a pipeline. When the buffer is full, incoming records are dropped and an in-memory buffer is full error is logged.

Default: max_threads × 64 MiB. With 16 threads the default buffer is 1 GiB.

yml

spans:
  max_buffered_bytes: 536870912 # 512 MiB

If you see buffer-full errors, ingestion is temporarily outpacing ClickHouse insert speed. Increase the value to absorb bursts without dropping data. As a rough guide, size the buffer for the longest realistic ClickHouse stall (a merge, a brief network hiccup, etc.).

ch_max_insert_size

The maximum number of records batched into a single ClickHouse INSERT.

yml

spans:
  ch_max_insert_size: 10000

The right value depends on whether ClickHouse async inserts are enabled:

async_insert = 1 (recommended). ClickHouse accumulates inserts into its own buffer and flushes them as larger batches. You can use a smaller ch_max_insert_size (5000–10000) — smaller Uptrace batches flush faster and reduce buffer pressure, while ClickHouse still creates efficiently sized data parts.
async_insert = 0. Each Uptrace insert creates a separate data part. Larger values (10000–20000) are important to avoid too many small parts, which increases merge pressure and degrades query performance.

In both cases, very large batches increase per-insert latency and memory usage. Reduce the value if inserts are timing out.

Example: high-volume configuration

The values below assume async_insert = 1 and use more threads and larger buffers than the defaults. The combined buffer totals roughly 2.6 GiB — make sure the host has enough RAM, especially if Uptrace and ClickHouse share the same server.

yml

spans:
  max_threads: 16
  ch_max_insert_size: 10000
  max_buffered_bytes: 1073741824 # 1 GiB

span_links:
  max_threads: 2
  ch_max_insert_size: 5000
  max_buffered_bytes: 134217728 # 128 MiB

logs:
  max_threads: 12
  ch_max_insert_size: 10000
  max_buffered_bytes: 805306368 # 768 MiB

events:
  max_threads: 4
  ch_max_insert_size: 5000
  max_buffered_bytes: 268435456 # 256 MiB

metrics:
  max_threads: 8
  ch_max_insert_size: 10000
  max_buffered_bytes: 536870912 # 512 MiB

If you see in-memory buffer is full errors, increase max_buffered_bytes for the affected pipeline and add more RAM to the host.

ClickHouse insert tuning

The settings below are ClickHouse server settings. You can set them globally in ClickHouse (users.xml or profiles.xml) or pass them per-query through the Uptrace config:

yml

ch_cluster:
  shards:
    - replicas:
        - addr: localhost:9000
          query_settings:
            async_insert: 1
            wait_for_async_insert: 1
            async_insert_max_data_size: 200000000
            async_insert_busy_timeout_ms: 5000

The async insert buffer flushes when either async_insert_max_data_size or async_insert_busy_timeout_ms is reached — whichever comes first.

async_insert

When enabled (1, recommended), ClickHouse buffers incoming inserts and flushes them as a single batch instead of writing each insert immediately. This reduces the number of data parts created, lowering merge pressure and improving throughput.

wait_for_async_insert

Controls whether ClickHouse acknowledges an insert immediately or waits until the buffer is flushed to disk.

wait_for_async_insert: 1. Uptrace waits for the flush to complete. Acknowledged data is guaranteed to be persisted, and insert errors are reported back to Uptrace.
wait_for_async_insert: 0. The insert returns as soon as data enters the buffer. Faster, but a ClickHouse crash can lose buffered data, and insert errors are silently ignored.

async_insert_max_data_size

Maximum bytes in the async insert buffer before a flush is triggered.

Default: 100 MiB. Since Uptrace already batches on the client side via ch_max_insert_size, the async buffer is merging a few concurrent batches rather than thousands of tiny inserts. For high-volume deployments, increase to 200–500 MB to create fewer, larger data parts at the cost of more memory on the ClickHouse side.

async_insert_busy_timeout_ms

Maximum time (ms) before the async buffer is flushed, even if it hasn't reached async_insert_max_data_size.

Default: 1000 ms. Increase to 3000–5000 ms to give the buffer more time to fill, resulting in fewer and larger data parts. Lower values reduce the data-loss window when wait_for_async_insert: 0, but create more parts. High-volume deployments generally benefit from longer timeouts.

ClickHouse sharding

For ingestion volumes that exceed a single ClickHouse node, distribute data across multiple shards. Sharding is a Premium feature.

Each shard is an independent ClickHouse node (or replicated set) that stores a portion of the data. Uptrace distributes writes using weighted round-robin.

yml

ch_cluster:
  cluster: 'uptrace1'
  replicated: true
  distributed: true # Required for sharding (Premium only)

  shards:
    - weight: 1
      replicas:
        - addr: clickhouse-1a:9000
          database: uptrace
        - addr: clickhouse-1b:9000
          database: uptrace

    - weight: 1
      replicas:
        - addr: clickhouse-2a:9000
          database: uptrace
        - addr: clickhouse-2b:9000
          database: uptrace

The weight parameter controls write distribution — a shard with weight 2 receives twice as many writes as weight 1. Use this when shards have different hardware capacity.

This configuration tells Uptrace which shards exist. You also need to configure the ClickHouse cluster itself — define the topology in ClickHouse's remote_servers so that ClickHouse knows about all shards and replicas. See the ClickHouse cluster deployment docs for details.

When adding shards to an existing deployment, you don't need to redistribute old data. New writes go to the new shards, and queries automatically span all shards via distributed tables.

Kafka-based ingestion

Kafka integration is available in the Enterprise tier only.

Kafka solves two problems:

Buffering during maintenance. When ClickHouse is down or can't keep up, Kafka retains data until ClickHouse is ready. Without Kafka, data arriving during these windows is dropped.
Efficient batching. Kafka enables larger, more efficient batches before writing to ClickHouse, improving insert throughput.

That said, Kafka adds operational complexity — evaluate whether the buffering benefits justify the additional infrastructure. Consider it only after exhausting the tuning options above.

Architecture

Uptrace API receives data from clients, validates it, applies rate limits, and writes to Kafka.
Uptrace Worker consumes raw data, processes it (sampling, service graph, etc.), and publishes to a second Kafka topic.
ClickHouse consumes directly from that topic using the Kafka engine.

Each component scales and fails independently — the API accepts data at line rate, the worker processes at its own pace, and ClickHouse consumes with its own batching strategy.

Enabling Kafka

Kafka is enabled simply by configuring brokers — there is no separate on/off switch. As soon as kafka.brokers is non-empty, Uptrace switches the telemetry pipeline from direct ClickHouse inserts to Kafka. There are two related blocks:

kafka — the top-level transport used by the Uptrace API (producer) and the worker (consumer).
ch_cluster.kafka — the brokers the ClickHouse Kafka engine tables read from. In a typical deployment these point at the same brokers.

yml

##
## Top-level Kafka transport. The API publishes telemetry here and the
## worker consumes it. Setting at least one broker switches the pipeline
## from direct ClickHouse inserts to Kafka.
##
kafka:
  brokers:
    - 'kafka-1:9092'
    - 'kafka-2:9092'

  # Optional SASL authentication.
  # Mechanism: PLAIN, SCRAM-SHA-256, or SCRAM-SHA-512.
  #sasl: SCRAM-SHA-512
  #user: uptrace
  #password: <secret>

  # Optional TLS for broker connections.
  #tls:
  #  ca_file: /etc/uptrace/kafka-ca.pem

  # Producer tuning.
  max_message_bytes: 10485760 # 10 MiB, must fit your largest batch
  linger: 10ms # how long the producer waits to fill a batch

  # Emit OpenTelemetry traces/metrics for the Kafka client itself.
  otel_tracing: false
  otel_metrics: false

##
## ClickHouse Kafka engine tables read from these brokers. Usually the
## same brokers as above. SASL/TLS for the engine is configured in
## ClickHouse, not here — only the broker list is used.
##
ch_cluster:
  kafka:
    brokers:
      - 'kafka-1:9092'
      - 'kafka-2:9092'

When kafka.brokers is empty (the default), Uptrace runs in-process and writes directly to ClickHouse — no worker and no Kafka required.

If you deploy with Ansible, adding a kafka group to the inventory renders both of these blocks and starts the worker automatically.

Running the worker

With Kafka enabled, the API only validates and publishes data — a separate worker process drains the Kafka topics and inserts into ClickHouse and PostgreSQL. Run it alongside the main uptrace process:

shell

uptrace --config=/etc/uptrace/config.yml worker

The worker requires both Kafka brokers and the licensed Kafka feature (Enterprise tier). It exits on startup if either is missing:

worker requires Kafka brokers: set kafka.brokers in the config — brokers are not configured.
worker requires the Kafka feature, which is not enabled on this instance — the license does not include Kafka.

Run as many worker instances as needed to keep up with ingestion; they share the consumer group and partition the load automatically.

Tuning the ClickHouse Kafka engine

Per-table consumer behavior is tuned under ch_schema. Each telemetry table (spans_data, logs_data, events_data, datapoints, …) accepts the same Kafka engine settings, so you can tune high-volume tables independently:

yml

ch_schema:
  spans_data:
    # Rows per block read from Kafka before flushing to the table.
    kafka_max_block_size: 1048576
    # Max messages fetched per poll.
    kafka_poll_max_batch_size: 65536
    # Timeout for a single Kafka poll.
    kafka_poll_timeout: 1s
    # How often buffered data is flushed, even if a block isn't full.
    kafka_flush_interval: 5s
    # One thread per consumer (default 1).
    kafka_thread_per_consumer: 1
    # Malformed-message handling: default, stream, or dead_letter_queue.
    kafka_handle_error_mode: stream
    # Schema-incompatible messages tolerated per block before erroring.
    kafka_skip_broken_messages: 0
    # Per-partition consumer lag (messages) above which the pipeline is
    # treated as stale, so alert monitors defer the current evaluation
    # until the consumer catches up (avoids alerting on incomplete data).
    kafka_lag_threshold: 100

Only kafka_thread_per_consumer (1), kafka_handle_error_mode (stream), and kafka_lag_threshold (100) have defaults; the remaining settings fall back to the ClickHouse Kafka engine defaults when omitted. Raise kafka_max_block_size and kafka_flush_interval on high-volume tables to create fewer, larger data parts, at the cost of higher consumer memory and flush latency.

TroubleshootingTroubleshoot self-hosted Uptrace by checking versions, reading service logs, and enabling verbose ClickHouse or PostgreSQL diagnostics.

Get startedStep-by-step guide to install and configure OpenTelemetry C++ SDK, export telemetry to Uptrace via OTLP.