How to Monitor RabbitMQ

November 18, 2025

9 min read

A queue quietly fills up overnight. Memory hits the configured watermark and RabbitMQ blocks all publishers. Your entire message pipeline freezes, and you discover the problem when users start complaining.

This scenario repeats across thousands of production systems because teams don't monitor RabbitMQ properly. The broker exposes comprehensive metrics, but most engineers don't know which ones predict failures or how to track them.

This guide shows you exactly how to monitor RabbitMQ: which metrics matter, which tools work best for different scenarios, and how to configure monitoring that catches issues before they impact production.

Effective RabbitMQ performance monitoring requires tracking the right metrics and setting up alerts that fire before problems cascade.

Why You Need RabbitMQ Monitoring

RabbitMQ routes messages between your services. When it fails, business transactions stop processing:

E-commerce orders queue up while customers wait for confirmation
Data pipelines stall and analytics jobs miss deadlines
API responses timeout when async operations don't complete
Cascading failures spread as downstream services back up

The problem: standard infrastructure monitoring doesn't catch RabbitMQ issues. CPU and memory look fine. The process is running. But messages are silently accumulating in queues.

You need monitoring that tracks message broker behavior: queue depths, consumer lag, acknowledgment rates, and RabbitMQ-specific resource alarms.

Here's how to set it up.

What to Monitor

RabbitMQ exposes hundreds of statistics through its management API and Prometheus plugin. Track these core metrics to catch problems before they cause outages.

Queue Depth

Track: rabbitmq.queue.messages (ready + unacknowledged)
Why: Growing queues mean consumers can't keep up with message flow

When queue depth increases steadily, messages are arriving faster than your consumers process them. Left unchecked, this fills memory and eventually crashes the broker.

Set alerts:

Warning when depth grows >20% in 15 minutes
Critical when depth exceeds 10,000 (adjust based on your workload)

What to check when alert fires:

Consumer count (zero consumers = immediate problem)
Consumer application logs (errors blocking processing?)
Message consume rate vs publish rate

Message Flow Rates

Track:

rabbitmq.queue.messages.publish_rate
rabbitmq.queue.messages.deliver_rate
rabbitmq.queue.messages.ack_rate

Why: Rate imbalance reveals problems before queues explode

In healthy state: publish rate ≈ ack rate (temporary bursts are normal)
Problem state: publish rate consistently higher than ack rate

Example:

text

Publish: 1,000 msg/sec
Ack: 750 msg/sec
Gap: 250 msg/sec accumulating

After 1 hour → 900k messages backed up
After 4 hours → 3.6M messages in queue

This metric gives you early warning to scale consumers before disaster.

Unacknowledged Messages

Track: rabbitmq.queue.messages.unacknowledged
Why: Detects stuck consumers before timeouts occur

Unacked messages are delivered but not confirmed. Growing unacked count means:

Consumer received message but processing is stuck
Consumer crashed without closing connection
Network preventing ack delivery

Alert threshold: unacked > (prefetch × consumers × 1.5)

Impact: Unacked messages consume memory and get redelivered if consumer disconnects, creating duplicate processing.

Consumer Count

Track: rabbitmq.queue.consumers
Why: Zero consumers = zero processing

This seems obvious but catches a surprising number of production issues. Consumer pods restart, fail to reconnect, and suddenly critical queues have no consumers. Messages pile up for hours before anyone notices.

Alert immediately: Consumer count = 0 on production queues

Memory Usage

Track: rabbitmq.node.memory.used
Why: Memory alarm blocks all publishers

When memory hits the configured watermark (default 40% of RAM, often increased to 60% in production), RabbitMQ blocks connections that publish messages. Your publishers timeout and the entire pipeline freezes.

What consumes memory:

Messages in queues (non-lazy queues keep everything in RAM)
Connections and channels (~100KB each)
Queue/exchange metadata

Set alerts:

Warning at 50% of watermark (20-30% of total RAM)
Critical at 75% of watermark (30-45% of total RAM)
Emergency if memory alarm triggered

Prevent issues:

Enable lazy queues for large message volumes
Use connection pooling
Add queue length limits
Increase watermark or node RAM if consistently high

Disk Space

Track: rabbitmq.node.disk.free
Why: Disk alarm blocks entire cluster

Default alarm triggers at 50MB free disk space. When any node hits this limit, ALL nodes stop accepting messages cluster-wide.

Alert thresholds:

Warning: <10GB
Critical: <1GB
Monitor every node

File Descriptors

Track: rabbitmq.node.fd.used
Why: Hitting FD limit blocks new connections

Linux limits file descriptors (default: 1024, increase via ulimit). Each connection and channel consumes one FD.

Alert thresholds:

Warning: >70% of limit
Critical: >85% of limit

Common cause: Applications creating connections per request instead of pooling.

Fix: Increase ulimit AND implement connection pooling.

Connection Patterns

Track: rabbitmq.connection.count (trend over time)
Why: Connection churn wastes resources

Stable systems maintain steady connection counts. Frequent spikes/drops indicate:

Missing connection pooling
Connection leaks
Network instability

Connections are expensive to create (TCP + AMQP negotiation). High churn wastes broker CPU.

Message Redeliveries

Track: Redelivery count per message
Why: Catches systematic consumer failures

High redelivery rates reveal this pattern:

text

Message delivered → Consumer crashes → Redelivered
→ Consumer crashes again → Redelivered again
→ Infinite redelivery loop

Solution: Configure dead letter queues to catch repeatedly failing messages.

How to Choose Monitoring Tool

Pick based on your existing infrastructure and team skills.

RabbitMQ Management Plugin

Built-in web UI at port 15672.

Use when:

Running development/test environments
Need quick queue inspection for debugging
Have <5 RabbitMQ nodes

Skip for production because:

Only stores recent metrics (hours)
High memory overhead
No long-term trending
Limited alerting

Setup:

shell

rabbitmq-plugins enable rabbitmq_management
# Access http://localhost:15672 (guest/guest)

Prometheus + Grafana

OpenTelemetry

Vendor-neutral standard for traces, metrics, and logs.

Use when:

Already using OpenTelemetry for distributed tracing
Want to correlate RabbitMQ metrics with application traces
Need vendor flexibility (switch backends easily)

Benefit: Single agent collects everything (app traces + RabbitMQ metrics + logs).

Example: Track a message through your application AND see RabbitMQ queue behavior in the same trace. When a message is slow to process, see if the delay is in your consumer code or in RabbitMQ delivery.

OpenTelemetry works with multfiple backends. For a complete OpenTelemetry APM solution that includes RabbitMQ monitoring with traces, metrics, and logs in one UI, Uptrace provides free open-source deployment with automatic dashboard generation.

For full configuration, check the OpenTelemetry RabbitMQ setup guide.

Cost: Free with open-source backends like Uptrace

Best for: Microservices architectures with distributed tracing already in place.

Datadog

Commercial SaaS with easiest setup.

Use when:

Have monitoring budget ($500+/month)
Want zero maintenance
Need commercial support
Prefer paying for simplicity

Setup: Install agent, add RabbitMQ config, done.

Includes:

Pre-built dashboards
AI anomaly detection
Incident management
700+ other integrations

Cost: $15-31/host/month + data ingestion fees

Best for: Enterprises prioritizing ease over cost.

AWS CloudWatch

AWS-native monitoring.

Use when:

Running exclusively on AWS
Using Amazon MQ (managed RabbitMQ)
Already paying for CloudWatch

Limitations:

Fewer RabbitMQ-specific metrics
Weak visualization vs Grafana
AWS-only

Best for: AWS-heavy infrastructure with Amazon MQ.

Quick Decision Tree

text

Have monitoring budget >$500/month?
└─ Yes → Datadog (easiest)

Already running Prometheus?
└─ Yes → Add rabbitmq_prometheus plugin

Using OpenTelemetry for tracing?
└─ Yes → Add RabbitMQ receiver

None of above?
└─ Start with Prometheus + Grafana (most common path)

How to Set Up Monitoring

Follow these steps regardless of which tool you chose.

Step 1: Create Monitoring User

Don't use default guest account in production.

shell

# Generate secure password
PASSWORD=$(openssl rand -base64 32)

# Create user
rabbitmqctl add_user monitoring_user $PASSWORD

# Grant read-only monitoring access
rabbitmqctl set_user_tags monitoring_user monitoring

# Set permissions
rabbitmqctl set_permissions -p / monitoring_user ".*" ".*" ".*"

Save the password in your secrets manager.

Step 2: Enable Metric Collection

For Prometheus:

shell

# Enable plugin
rabbitmq-plugins enable rabbitmq_prometheus

# Verify metrics endpoint
curl http://localhost:15692/metrics

Configure Prometheus to scrape:

yaml

# prometheus.yml
scrape_configs:
  - job_name: rabbitmq
    scrape_interval: 15s
    static_configs:
      - targets:
          - rabbitmq-node1:15692
          - rabbitmq-node2:15692
    basic_auth:
      username: monitoring_user
      password: <from-secrets>

For OpenTelemetry:

yaml

# otel-collector-config.yaml
receivers:
  rabbitmq:
    endpoint: http://rabbitmq:15672
    username: monitoring_user
    password: <from-secrets>
    collection_interval: 30s

exporters:
  otlp:
    endpoint: backend:4317

service:
  pipelines:
    metrics:
      receivers: [rabbitmq]
      exporters: [otlp]

Step 3: Build Dashboards

Essential dashboard panels:

Node health - Memory, disk, CPU per node
Queue overview - Top 10 queues by message count
Message rates - Publish vs ack rates
Consumer health - Consumer count per queue
Connections - Active connections and channels
Resource limits - File descriptors, Erlang processes

For Grafana:
Import dashboard ID 10991 from grafana.com or build custom panels.

For OpenTelemetry backends:
Dashboards auto-generate from metric metadata.

Step 4: Configure Alerts

Set up these critical alerts first:

yaml

# Memory alarm imminent (adjust based on your watermark setting)
- alert: RabbitMQMemoryHigh
  expr: rabbitmq_node_mem_used / rabbitmq_node_mem_limit > 0.75
  for: 5m
  annotations:
    summary: "RabbitMQ memory at {{ $value | humanizePercentage }}"
    description: "Node {{ $labels.node }} memory usage high - publishers will be blocked at watermark"

# Disk space critical
- alert: RabbitMQDiskLow
  expr: rabbitmq_node_disk_free < 1073741824  # 1GB
  for: 5m

# Queue has no consumers
- alert: RabbitMQNoConsumers
  expr: |
    rabbitmq_queue_consumers{queue=~"prod-.*"} == 0
  for: 2m

# Queue backlog building
- alert: RabbitMQQueueBacklog
  expr: rabbitmq_queue_messages > 10000
  for: 10m

# Node unreachable
- alert: RabbitMQNodeDown
  expr: up{job="rabbitmq"} == 0
  for: 1m

Route to PagerDuty, OpsGenie, or Slack.

Step 5: Test Everything

Verify alerts work before depending on them:

shell

# Test queue backlog alert
rabbitmqadmin declare queue name=test-alert-queue
for i in {1..15000}; do
  rabbitmqadmin publish routing_key=test-alert-queue payload="test"
done

# Verify alert fires
# Clean up
rabbitmqadmin delete queue name=test-alert-queue

Test each alert type monthly to ensure proper routing.

How to Fix Common Issues

Queue Backlog

Symptoms: Queue depth growing, publish rate > ack rate

Diagnose:

shell

rabbitmqctl list_queues name messages consumers

Look for high message count with low consumer count.

Fix:

Scale consumer applications (add replicas/instances)
Check consumer logs for errors blocking processing
Verify consumers are actually running
Consider lazy queues if messages are large

Memory Alarm

Symptoms: Memory exceeds watermark (40-60% of RAM by default), publishers blocked, "blocking connection" in logs

Diagnose:

shell

rabbitmqctl status | grep mem
rabbitmq-diagnostics memory_breakdown

Identify what's consuming memory (queues, connections, buffers).

Fix immediately:

Scale consumers to drain queues
Enable lazy queues (stores on disk not RAM)
Check for connection leaks
Increase memory watermark or node RAM if consistently high

Prevent recurrence:

shell

# Enable lazy mode for heavy queues
rabbitmqctl set_policy lazy "^heavy-.*" \
  '{"queue-mode":"lazy"}' --apply-to queues

# Set queue length limits
rabbitmqctl set_policy limit "^limited-.*" \
  '{"max-length":50000}' --apply-to queues

Consumer Lag

Symptoms: Unacked messages high, processing slow

Diagnose:

shell

rabbitmqctl list_queues name messages_unacknowledged consumers

Fix:

Check consumer application health and logs
Reduce prefetch if consumers grabbing too many
Scale consumers horizontally
Implement consumer timeouts
Check database/API performance if consumer calls external services

Connection Exhausted

Symptoms: File descriptor limit hit, new connections refused

Diagnose:

shell

rabbitmqctl status | grep file_descriptors
lsof -p $(pidof beam.smp) | wc -l

Fix:

Increase ulimit: ulimit -n 65536
Audit applications for connection pooling
Find and fix connection leaks
Set per-user connection limits

In /etc/security/limits.conf:

text

rabbitmq soft nofile 65536
rabbitmq hard nofile 65536

No Consumers

Symptoms: Queue filling up, consumer count = 0

Fix:

Check consumer application status (crashed? scaled to zero?)
Verify network connectivity between consumer and broker
Check consumer application logs for connection errors
Restart consumer applications if needed

Cluster Issues

Symptoms: Nodes report different state, partitions

Diagnose:

shell

rabbitmqctl cluster_status

Fix:

Restart affected nodes
Check network reliability between nodes
Verify cluster partition handling policy
Review firewall rules (ports 4369, 25672, 35672-35682)

Why You Need RabbitMQ Monitoring

What to Monitor

How to Choose Monitoring Tool

How to Set Up Monitoring

How to Fix Common Issues