How to Monitor RabbitMQ
A queue quietly fills up overnight. Memory hits the configured watermark and RabbitMQ blocks all publishers. Your entire message pipeline freezes, and you discover the problem when users start complaining.
This scenario repeats across thousands of production systems because teams don't monitor RabbitMQ properly. The broker exposes comprehensive metrics, but most engineers don't know which ones predict failures or how to track them.
This guide shows you exactly how to monitor RabbitMQ: which metrics matter, which tools work best for different scenarios, and how to configure monitoring that catches issues before they impact production.
Effective RabbitMQ performance monitoring requires tracking the right metrics and setting up alerts that fire before problems cascade.
Why You Need RabbitMQ Monitoring
RabbitMQ routes messages between your services. When it fails, business transactions stop processing:
- E-commerce orders queue up while customers wait for confirmation
- Data pipelines stall and analytics jobs miss deadlines
- API responses timeout when async operations don't complete
- Cascading failures spread as downstream services back up
The problem: standard infrastructure monitoring doesn't catch RabbitMQ issues. CPU and memory look fine. The process is running. But messages are silently accumulating in queues.
You need monitoring that tracks message broker behavior: queue depths, consumer lag, acknowledgment rates, and RabbitMQ-specific resource alarms.
Here's how to set it up.
What to Monitor
RabbitMQ exposes hundreds of statistics through its management API and Prometheus plugin. Track these core metrics to catch problems before they cause outages.
Queue Depth
Track: rabbitmq.queue.messages (ready + unacknowledged)
Why: Growing queues mean consumers can't keep up with message flow
When queue depth increases steadily, messages are arriving faster than your consumers process them. Left unchecked, this fills memory and eventually crashes the broker.
Set alerts:
- Warning when depth grows >20% in 15 minutes
- Critical when depth exceeds 10,000 (adjust based on your workload)
What to check when alert fires:
- Consumer count (zero consumers = immediate problem)
- Consumer application logs (errors blocking processing?)
- Message consume rate vs publish rate
Message Flow Rates
Track:
rabbitmq.queue.messages.publish_raterabbitmq.queue.messages.deliver_raterabbitmq.queue.messages.ack_rate
Why: Rate imbalance reveals problems before queues explode
In healthy state: publish rate ≈ ack rate (temporary bursts are normal)
Problem state: publish rate consistently higher than ack rate
Example:
Publish: 1,000 msg/sec
Ack: 750 msg/sec
Gap: 250 msg/sec accumulating
After 1 hour → 900k messages backed up
After 4 hours → 3.6M messages in queue
This metric gives you early warning to scale consumers before disaster.
Unacknowledged Messages
Track: rabbitmq.queue.messages.unacknowledged
Why: Detects stuck consumers before timeouts occur
Unacked messages are delivered but not confirmed. Growing unacked count means:
- Consumer received message but processing is stuck
- Consumer crashed without closing connection
- Network preventing ack delivery
Alert threshold: unacked > (prefetch × consumers × 1.5)
Impact: Unacked messages consume memory and get redelivered if consumer disconnects, creating duplicate processing.
Consumer Count
Track: rabbitmq.queue.consumers
Why: Zero consumers = zero processing
This seems obvious but catches a surprising number of production issues. Consumer pods restart, fail to reconnect, and suddenly critical queues have no consumers. Messages pile up for hours before anyone notices.
Alert immediately: Consumer count = 0 on production queues
Memory Usage
Track: rabbitmq.node.memory.used
Why: Memory alarm blocks all publishers
When memory hits the configured watermark (default 40% of RAM, often increased to 60% in production), RabbitMQ blocks connections that publish messages. Your publishers timeout and the entire pipeline freezes.
What consumes memory:
- Messages in queues (non-lazy queues keep everything in RAM)
- Connections and channels (~100KB each)
- Queue/exchange metadata
Set alerts:
- Warning at 50% of watermark (20-30% of total RAM)
- Critical at 75% of watermark (30-45% of total RAM)
- Emergency if memory alarm triggered
Prevent issues:
- Enable lazy queues for large message volumes
- Use connection pooling
- Add queue length limits
- Increase watermark or node RAM if consistently high
Disk Space
Track: rabbitmq.node.disk.free
Why: Disk alarm blocks entire cluster
Default alarm triggers at 50MB free disk space. When any node hits this limit, ALL nodes stop accepting messages cluster-wide.
Alert thresholds:
- Warning: <10GB
- Critical: <1GB
- Monitor every node
File Descriptors
Track: rabbitmq.node.fd.used
Why: Hitting FD limit blocks new connections
Linux limits file descriptors (default: 1024, increase via ulimit). Each connection and channel consumes one FD.
Alert thresholds:
- Warning: >70% of limit
- Critical: >85% of limit
Common cause: Applications creating connections per request instead of pooling.
Fix: Increase ulimit AND implement connection pooling.
Connection Patterns
Track: rabbitmq.connection.count (trend over time)
Why: Connection churn wastes resources
Stable systems maintain steady connection counts. Frequent spikes/drops indicate:
- Missing connection pooling
- Connection leaks
- Network instability
Connections are expensive to create (TCP + AMQP negotiation). High churn wastes broker CPU.
Message Redeliveries
Track: Redelivery count per message
Why: Catches systematic consumer failures
High redelivery rates reveal this pattern:
Message delivered → Consumer crashes → Redelivered
→ Consumer crashes again → Redelivered again
→ Infinite redelivery loop
Solution: Configure dead letter queues to catch repeatedly failing messages.
How to Choose Monitoring Tool
Pick based on your existing infrastructure and team skills.
RabbitMQ Management Plugin
Built-in web UI at port 15672.
Use when:
- Running development/test environments
- Need quick queue inspection for debugging
- Have <5 RabbitMQ nodes
Skip for production because:
- Only stores recent metrics (hours)
- High memory overhead
- No long-term trending
- Limited alerting
Setup:
rabbitmq-plugins enable rabbitmq_management
# Access http://localhost:15672 (guest/guest)
Prometheus + Grafana
Most popular open-source option.
Use when:
- Already running Prometheus infrastructure
- Need full control over data retention
- Have DevOps team to maintain it
- Want community dashboard templates
Requires:
- Prometheus (time-series DB)
- Grafana (visualization)
- Alertmanager (notifications)
Setup:
# Enable RabbitMQ plugin
rabbitmq-plugins enable rabbitmq_prometheus
# Add to prometheus.yml
scrape_configs:
- job_name: rabbitmq
static_configs:
- targets: ['rabbitmq:15692']
Cost: Free (self-host) or $50-200/month (Grafana Cloud)
Best for: Teams comfortable managing Prometheus who want proven, mature tooling.
OpenTelemetry
Vendor-neutral standard for traces, metrics, and logs.
Use when:
- Already using OpenTelemetry for distributed tracing
- Want to correlate RabbitMQ metrics with application traces
- Need vendor flexibility (switch backends easily)
Benefit: Single agent collects everything (app traces + RabbitMQ metrics + logs).
Example: Track a message through your application AND see RabbitMQ queue behavior in the same trace. When a message is slow to process, see if the delay is in your consumer code or in RabbitMQ delivery.
OpenTelemetry works with multfiple backends. For a complete OpenTelemetry APM solution that includes RabbitMQ monitoring with traces, metrics, and logs in one UI, Uptrace provides free open-source deployment with automatic dashboard generation.
For full configuration, check the OpenTelemetry RabbitMQ setup guide.
Cost: Free with open-source backends like Uptrace
Best for: Microservices architectures with distributed tracing already in place.
Datadog
Commercial SaaS with easiest setup.
Use when:
- Have monitoring budget ($500+/month)
- Want zero maintenance
- Need commercial support
- Prefer paying for simplicity
Setup: Install agent, add RabbitMQ config, done.
Includes:
- Pre-built dashboards
- AI anomaly detection
- Incident management
- 700+ other integrations
Cost: $15-31/host/month + data ingestion fees
Best for: Enterprises prioritizing ease over cost.
AWS CloudWatch
AWS-native monitoring.
Use when:
- Running exclusively on AWS
- Using Amazon MQ (managed RabbitMQ)
- Already paying for CloudWatch
Limitations:
- Fewer RabbitMQ-specific metrics
- Weak visualization vs Grafana
- AWS-only
Best for: AWS-heavy infrastructure with Amazon MQ.
Quick Decision Tree
Have monitoring budget >$500/month?
└─ Yes → Datadog (easiest)
Already running Prometheus?
└─ Yes → Add rabbitmq_prometheus plugin
Using OpenTelemetry for tracing?
└─ Yes → Add RabbitMQ receiver
None of above?
└─ Start with Prometheus + Grafana (most common path)
How to Set Up Monitoring
Follow these steps regardless of which tool you chose.
Step 1: Create Monitoring User
Don't use default guest account in production.
# Generate secure password
PASSWORD=$(openssl rand -base64 32)
# Create user
rabbitmqctl add_user monitoring_user $PASSWORD
# Grant read-only monitoring access
rabbitmqctl set_user_tags monitoring_user monitoring
# Set permissions
rabbitmqctl set_permissions -p / monitoring_user ".*" ".*" ".*"
Save the password in your secrets manager.
Step 2: Enable Metric Collection
For Prometheus:
# Enable plugin
rabbitmq-plugins enable rabbitmq_prometheus
# Verify metrics endpoint
curl http://localhost:15692/metrics
Configure Prometheus to scrape:
# prometheus.yml
scrape_configs:
- job_name: rabbitmq
scrape_interval: 15s
static_configs:
- targets:
- rabbitmq-node1:15692
- rabbitmq-node2:15692
basic_auth:
username: monitoring_user
password: <from-secrets>
For OpenTelemetry:
# otel-collector-config.yaml
receivers:
rabbitmq:
endpoint: http://rabbitmq:15672
username: monitoring_user
password: <from-secrets>
collection_interval: 30s
exporters:
otlp:
endpoint: backend:4317
service:
pipelines:
metrics:
receivers: [rabbitmq]
exporters: [otlp]
Step 3: Build Dashboards
Essential dashboard panels:
- Node health - Memory, disk, CPU per node
- Queue overview - Top 10 queues by message count
- Message rates - Publish vs ack rates
- Consumer health - Consumer count per queue
- Connections - Active connections and channels
- Resource limits - File descriptors, Erlang processes
For Grafana:
Import dashboard ID 10991 from grafana.com or build custom panels.
For OpenTelemetry backends:
Dashboards auto-generate from metric metadata.
Step 4: Configure Alerts
Set up these critical alerts first:
# Memory alarm imminent (adjust based on your watermark setting)
- alert: RabbitMQMemoryHigh
expr: rabbitmq_node_mem_used / rabbitmq_node_mem_limit > 0.75
for: 5m
annotations:
summary: "RabbitMQ memory at {{ $value | humanizePercentage }}"
description: "Node {{ $labels.node }} memory usage high - publishers will be blocked at watermark"
# Disk space critical
- alert: RabbitMQDiskLow
expr: rabbitmq_node_disk_free < 1073741824 # 1GB
for: 5m
# Queue has no consumers
- alert: RabbitMQNoConsumers
expr: |
rabbitmq_queue_consumers{queue=~"prod-.*"} == 0
for: 2m
# Queue backlog building
- alert: RabbitMQQueueBacklog
expr: rabbitmq_queue_messages > 10000
for: 10m
# Node unreachable
- alert: RabbitMQNodeDown
expr: up{job="rabbitmq"} == 0
for: 1m
Route to PagerDuty, OpsGenie, or Slack.
Step 5: Test Everything
Verify alerts work before depending on them:
# Test queue backlog alert
rabbitmqadmin declare queue name=test-alert-queue
for i in {1..15000}; do
rabbitmqadmin publish routing_key=test-alert-queue payload="test"
done
# Verify alert fires
# Clean up
rabbitmqadmin delete queue name=test-alert-queue
Test each alert type monthly to ensure proper routing.
How to Fix Common Issues
Queue Backlog
Symptoms: Queue depth growing, publish rate > ack rate
Diagnose:
rabbitmqctl list_queues name messages consumers
Look for high message count with low consumer count.
Fix:
- Scale consumer applications (add replicas/instances)
- Check consumer logs for errors blocking processing
- Verify consumers are actually running
- Consider lazy queues if messages are large
Memory Alarm
Symptoms: Memory exceeds watermark (40-60% of RAM by default), publishers blocked, "blocking connection" in logs
Diagnose:
rabbitmqctl status | grep mem
rabbitmq-diagnostics memory_breakdown
Identify what's consuming memory (queues, connections, buffers).
Fix immediately:
- Scale consumers to drain queues
- Enable lazy queues (stores on disk not RAM)
- Check for connection leaks
- Increase memory watermark or node RAM if consistently high
Prevent recurrence:
# Enable lazy mode for heavy queues
rabbitmqctl set_policy lazy "^heavy-.*" \
'{"queue-mode":"lazy"}' --apply-to queues
# Set queue length limits
rabbitmqctl set_policy limit "^limited-.*" \
'{"max-length":50000}' --apply-to queues
Consumer Lag
Symptoms: Unacked messages high, processing slow
Diagnose:
rabbitmqctl list_queues name messages_unacknowledged consumers
Fix:
- Check consumer application health and logs
- Reduce prefetch if consumers grabbing too many
- Scale consumers horizontally
- Implement consumer timeouts
- Check database/API performance if consumer calls external services
Connection Exhausted
Symptoms: File descriptor limit hit, new connections refused
Diagnose:
rabbitmqctl status | grep file_descriptors
lsof -p $(pidof beam.smp) | wc -l
Fix:
- Increase ulimit:
ulimit -n 65536 - Audit applications for connection pooling
- Find and fix connection leaks
- Set per-user connection limits
In /etc/security/limits.conf:
rabbitmq soft nofile 65536
rabbitmq hard nofile 65536
No Consumers
Symptoms: Queue filling up, consumer count = 0
Fix:
- Check consumer application status (crashed? scaled to zero?)
- Verify network connectivity between consumer and broker
- Check consumer application logs for connection errors
- Restart consumer applications if needed
Cluster Issues
Symptoms: Nodes report different state, partitions
Diagnose:
rabbitmqctl cluster_status
Fix:
- Restart affected nodes
- Check network reliability between nodes
- Verify cluster partition handling policy
- Review firewall rules (ports 4369, 25672, 35672-35682)