How to Monitor RabbitMQ

Alexandr Bandurchin
November 18, 2025
9 min read

A queue quietly fills up overnight. Memory hits the configured watermark and RabbitMQ blocks all publishers. Your entire message pipeline freezes, and you discover the problem when users start complaining.

This scenario repeats across thousands of production systems because teams don't monitor RabbitMQ properly. The broker exposes comprehensive metrics, but most engineers don't know which ones predict failures or how to track them.

This guide shows you exactly how to monitor RabbitMQ: which metrics matter, which tools work best for different scenarios, and how to configure monitoring that catches issues before they impact production.

Effective RabbitMQ performance monitoring requires tracking the right metrics and setting up alerts that fire before problems cascade.

Why You Need RabbitMQ Monitoring

RabbitMQ routes messages between your services. When it fails, business transactions stop processing:

  • E-commerce orders queue up while customers wait for confirmation
  • Data pipelines stall and analytics jobs miss deadlines
  • API responses timeout when async operations don't complete
  • Cascading failures spread as downstream services back up

The problem: standard infrastructure monitoring doesn't catch RabbitMQ issues. CPU and memory look fine. The process is running. But messages are silently accumulating in queues.

You need monitoring that tracks message broker behavior: queue depths, consumer lag, acknowledgment rates, and RabbitMQ-specific resource alarms.

Here's how to set it up.

What to Monitor

RabbitMQ exposes hundreds of statistics through its management API and Prometheus plugin. Track these core metrics to catch problems before they cause outages.

Queue Depth

Track: rabbitmq.queue.messages (ready + unacknowledged)
Why: Growing queues mean consumers can't keep up with message flow

When queue depth increases steadily, messages are arriving faster than your consumers process them. Left unchecked, this fills memory and eventually crashes the broker.

Set alerts:

  • Warning when depth grows >20% in 15 minutes
  • Critical when depth exceeds 10,000 (adjust based on your workload)

What to check when alert fires:

  • Consumer count (zero consumers = immediate problem)
  • Consumer application logs (errors blocking processing?)
  • Message consume rate vs publish rate

Message Flow Rates

Track:

  • rabbitmq.queue.messages.publish_rate
  • rabbitmq.queue.messages.deliver_rate
  • rabbitmq.queue.messages.ack_rate

Why: Rate imbalance reveals problems before queues explode

In healthy state: publish rate ≈ ack rate (temporary bursts are normal)
Problem state: publish rate consistently higher than ack rate

Example:

text
Publish: 1,000 msg/sec
Ack: 750 msg/sec
Gap: 250 msg/sec accumulating

After 1 hour → 900k messages backed up
After 4 hours → 3.6M messages in queue

This metric gives you early warning to scale consumers before disaster.

Unacknowledged Messages

Track: rabbitmq.queue.messages.unacknowledged
Why: Detects stuck consumers before timeouts occur

Unacked messages are delivered but not confirmed. Growing unacked count means:

  • Consumer received message but processing is stuck
  • Consumer crashed without closing connection
  • Network preventing ack delivery

Alert threshold: unacked > (prefetch × consumers × 1.5)

Impact: Unacked messages consume memory and get redelivered if consumer disconnects, creating duplicate processing.

Consumer Count

Track: rabbitmq.queue.consumers
Why: Zero consumers = zero processing

This seems obvious but catches a surprising number of production issues. Consumer pods restart, fail to reconnect, and suddenly critical queues have no consumers. Messages pile up for hours before anyone notices.

Alert immediately: Consumer count = 0 on production queues

Memory Usage

Track: rabbitmq.node.memory.used
Why: Memory alarm blocks all publishers

When memory hits the configured watermark (default 40% of RAM, often increased to 60% in production), RabbitMQ blocks connections that publish messages. Your publishers timeout and the entire pipeline freezes.

What consumes memory:

  • Messages in queues (non-lazy queues keep everything in RAM)
  • Connections and channels (~100KB each)
  • Queue/exchange metadata

Set alerts:

  • Warning at 50% of watermark (20-30% of total RAM)
  • Critical at 75% of watermark (30-45% of total RAM)
  • Emergency if memory alarm triggered

Prevent issues:

  • Enable lazy queues for large message volumes
  • Use connection pooling
  • Add queue length limits
  • Increase watermark or node RAM if consistently high

Disk Space

Track: rabbitmq.node.disk.free
Why: Disk alarm blocks entire cluster

Default alarm triggers at 50MB free disk space. When any node hits this limit, ALL nodes stop accepting messages cluster-wide.

Alert thresholds:

  • Warning: <10GB
  • Critical: <1GB
  • Monitor every node

File Descriptors

Track: rabbitmq.node.fd.used
Why: Hitting FD limit blocks new connections

Linux limits file descriptors (default: 1024, increase via ulimit). Each connection and channel consumes one FD.

Alert thresholds:

  • Warning: >70% of limit
  • Critical: >85% of limit

Common cause: Applications creating connections per request instead of pooling.

Fix: Increase ulimit AND implement connection pooling.

Connection Patterns

Track: rabbitmq.connection.count (trend over time)
Why: Connection churn wastes resources

Stable systems maintain steady connection counts. Frequent spikes/drops indicate:

  • Missing connection pooling
  • Connection leaks
  • Network instability

Connections are expensive to create (TCP + AMQP negotiation). High churn wastes broker CPU.

Message Redeliveries

Track: Redelivery count per message
Why: Catches systematic consumer failures

High redelivery rates reveal this pattern:

text
Message delivered → Consumer crashes → Redelivered
→ Consumer crashes again → Redelivered again
→ Infinite redelivery loop

Solution: Configure dead letter queues to catch repeatedly failing messages.

How to Choose Monitoring Tool

Pick based on your existing infrastructure and team skills.

RabbitMQ Management Plugin

Built-in web UI at port 15672.

Use when:

  • Running development/test environments
  • Need quick queue inspection for debugging
  • Have <5 RabbitMQ nodes

Skip for production because:

  • Only stores recent metrics (hours)
  • High memory overhead
  • No long-term trending
  • Limited alerting

Setup:

shell
rabbitmq-plugins enable rabbitmq_management
# Access http://localhost:15672 (guest/guest)

Prometheus + Grafana

Most popular open-source option.

Use when:

  • Already running Prometheus infrastructure
  • Need full control over data retention
  • Have DevOps team to maintain it
  • Want community dashboard templates

Requires:

  • Prometheus (time-series DB)
  • Grafana (visualization)
  • Alertmanager (notifications)

Setup:

shell
# Enable RabbitMQ plugin
rabbitmq-plugins enable rabbitmq_prometheus

# Add to prometheus.yml
scrape_configs:
  - job_name: rabbitmq
    static_configs:
      - targets: ['rabbitmq:15692']

Cost: Free (self-host) or $50-200/month (Grafana Cloud)

Best for: Teams comfortable managing Prometheus who want proven, mature tooling.

OpenTelemetry

Vendor-neutral standard for traces, metrics, and logs.

Use when:

Benefit: Single agent collects everything (app traces + RabbitMQ metrics + logs).

Example: Track a message through your application AND see RabbitMQ queue behavior in the same trace. When a message is slow to process, see if the delay is in your consumer code or in RabbitMQ delivery.

OpenTelemetry works with multfiple backends. For a complete OpenTelemetry APM solution that includes RabbitMQ monitoring with traces, metrics, and logs in one UI, Uptrace provides free open-source deployment with automatic dashboard generation.

For full configuration, check the OpenTelemetry RabbitMQ setup guide.

Cost: Free with open-source backends like Uptrace

Best for: Microservices architectures with distributed tracing already in place.

Datadog

Commercial SaaS with easiest setup.

Use when:

  • Have monitoring budget ($500+/month)
  • Want zero maintenance
  • Need commercial support
  • Prefer paying for simplicity

Setup: Install agent, add RabbitMQ config, done.

Includes:

  • Pre-built dashboards
  • AI anomaly detection
  • Incident management
  • 700+ other integrations

Cost: $15-31/host/month + data ingestion fees

Best for: Enterprises prioritizing ease over cost.

AWS CloudWatch

AWS-native monitoring.

Use when:

  • Running exclusively on AWS
  • Using Amazon MQ (managed RabbitMQ)
  • Already paying for CloudWatch

Limitations:

  • Fewer RabbitMQ-specific metrics
  • Weak visualization vs Grafana
  • AWS-only

Best for: AWS-heavy infrastructure with Amazon MQ.

Quick Decision Tree

text
Have monitoring budget >$500/month?
└─ Yes → Datadog (easiest)

Already running Prometheus?
└─ Yes → Add rabbitmq_prometheus plugin

Using OpenTelemetry for tracing?
└─ Yes → Add RabbitMQ receiver

None of above?
└─ Start with Prometheus + Grafana (most common path)

How to Set Up Monitoring

Follow these steps regardless of which tool you chose.

Step 1: Create Monitoring User

Don't use default guest account in production.

shell
# Generate secure password
PASSWORD=$(openssl rand -base64 32)

# Create user
rabbitmqctl add_user monitoring_user $PASSWORD

# Grant read-only monitoring access
rabbitmqctl set_user_tags monitoring_user monitoring

# Set permissions
rabbitmqctl set_permissions -p / monitoring_user ".*" ".*" ".*"

Save the password in your secrets manager.

Step 2: Enable Metric Collection

For Prometheus:

shell
# Enable plugin
rabbitmq-plugins enable rabbitmq_prometheus

# Verify metrics endpoint
curl http://localhost:15692/metrics

Configure Prometheus to scrape:

yaml
# prometheus.yml
scrape_configs:
  - job_name: rabbitmq
    scrape_interval: 15s
    static_configs:
      - targets:
          - rabbitmq-node1:15692
          - rabbitmq-node2:15692
    basic_auth:
      username: monitoring_user
      password: <from-secrets>

For OpenTelemetry:

yaml
# otel-collector-config.yaml
receivers:
  rabbitmq:
    endpoint: http://rabbitmq:15672
    username: monitoring_user
    password: <from-secrets>
    collection_interval: 30s

exporters:
  otlp:
    endpoint: backend:4317

service:
  pipelines:
    metrics:
      receivers: [rabbitmq]
      exporters: [otlp]

Step 3: Build Dashboards

Essential dashboard panels:

  1. Node health - Memory, disk, CPU per node
  2. Queue overview - Top 10 queues by message count
  3. Message rates - Publish vs ack rates
  4. Consumer health - Consumer count per queue
  5. Connections - Active connections and channels
  6. Resource limits - File descriptors, Erlang processes

For Grafana:
Import dashboard ID 10991 from grafana.com or build custom panels.

For OpenTelemetry backends:
Dashboards auto-generate from metric metadata.

Step 4: Configure Alerts

Set up these critical alerts first:

yaml
# Memory alarm imminent (adjust based on your watermark setting)
- alert: RabbitMQMemoryHigh
  expr: rabbitmq_node_mem_used / rabbitmq_node_mem_limit > 0.75
  for: 5m
  annotations:
    summary: "RabbitMQ memory at {{ $value | humanizePercentage }}"
    description: "Node {{ $labels.node }} memory usage high - publishers will be blocked at watermark"

# Disk space critical
- alert: RabbitMQDiskLow
  expr: rabbitmq_node_disk_free < 1073741824  # 1GB
  for: 5m

# Queue has no consumers
- alert: RabbitMQNoConsumers
  expr: |
    rabbitmq_queue_consumers{queue=~"prod-.*"} == 0
  for: 2m

# Queue backlog building
- alert: RabbitMQQueueBacklog
  expr: rabbitmq_queue_messages > 10000
  for: 10m

# Node unreachable
- alert: RabbitMQNodeDown
  expr: up{job="rabbitmq"} == 0
  for: 1m

Route to PagerDuty, OpsGenie, or Slack.

Step 5: Test Everything

Verify alerts work before depending on them:

shell
# Test queue backlog alert
rabbitmqadmin declare queue name=test-alert-queue
for i in {1..15000}; do
  rabbitmqadmin publish routing_key=test-alert-queue payload="test"
done

# Verify alert fires
# Clean up
rabbitmqadmin delete queue name=test-alert-queue

Test each alert type monthly to ensure proper routing.

How to Fix Common Issues

Queue Backlog

Symptoms: Queue depth growing, publish rate > ack rate

Diagnose:

shell
rabbitmqctl list_queues name messages consumers

Look for high message count with low consumer count.

Fix:

  1. Scale consumer applications (add replicas/instances)
  2. Check consumer logs for errors blocking processing
  3. Verify consumers are actually running
  4. Consider lazy queues if messages are large

Memory Alarm

Symptoms: Memory exceeds watermark (40-60% of RAM by default), publishers blocked, "blocking connection" in logs

Diagnose:

shell
rabbitmqctl status | grep mem
rabbitmq-diagnostics memory_breakdown

Identify what's consuming memory (queues, connections, buffers).

Fix immediately:

  1. Scale consumers to drain queues
  2. Enable lazy queues (stores on disk not RAM)
  3. Check for connection leaks
  4. Increase memory watermark or node RAM if consistently high

Prevent recurrence:

shell
# Enable lazy mode for heavy queues
rabbitmqctl set_policy lazy "^heavy-.*" \
  '{"queue-mode":"lazy"}' --apply-to queues

# Set queue length limits
rabbitmqctl set_policy limit "^limited-.*" \
  '{"max-length":50000}' --apply-to queues

Consumer Lag

Symptoms: Unacked messages high, processing slow

Diagnose:

shell
rabbitmqctl list_queues name messages_unacknowledged consumers

Fix:

  1. Check consumer application health and logs
  2. Reduce prefetch if consumers grabbing too many
  3. Scale consumers horizontally
  4. Implement consumer timeouts
  5. Check database/API performance if consumer calls external services

Connection Exhausted

Symptoms: File descriptor limit hit, new connections refused

Diagnose:

shell
rabbitmqctl status | grep file_descriptors
lsof -p $(pidof beam.smp) | wc -l

Fix:

  1. Increase ulimit: ulimit -n 65536
  2. Audit applications for connection pooling
  3. Find and fix connection leaks
  4. Set per-user connection limits

In /etc/security/limits.conf:

text
rabbitmq soft nofile 65536
rabbitmq hard nofile 65536

No Consumers

Symptoms: Queue filling up, consumer count = 0

Fix:

  1. Check consumer application status (crashed? scaled to zero?)
  2. Verify network connectivity between consumer and broker
  3. Check consumer application logs for connection errors
  4. Restart consumer applications if needed

Cluster Issues

Symptoms: Nodes report different state, partitions

Diagnose:

shell
rabbitmqctl cluster_status

Fix:

  1. Restart affected nodes
  2. Check network reliability between nodes
  3. Verify cluster partition handling policy
  4. Review firewall rules (ports 4369, 25672, 35672-35682)