OpenTelemetry Host Metrics receiver
The hostmetricsreceiver is an OpenTelemetry Collector plugin that collects comprehensive system-level metrics from host machines. Updated for OpenTelemetry Collector v0.140.1+, it provides critical insights into system performance, resource utilization, and potential bottlenecks that could impact your applications.
Overview
By collecting and analyzing host metrics, you can:
- Monitor system health in real-time across your infrastructure
- Identify performance bottlenecks before they impact users
- Track resource utilization trends for capacity planning
- Set up alerts for anomalous behavior or resource exhaustion
- Correlate system metrics with application performance issues
Available Scrapers
The hostmetricsreceiver includes multiple scrapers for different system components:
| Scraper | Description | Key Metrics | Platform Support |
|---|---|---|---|
| cpu | CPU utilization and time | system.cpu.time, system.cpu.utilization | All |
| memory | RAM usage and availability | system.memory.usage, system.memory.utilization | All |
| disk | Disk I/O operations | system.disk.io, system.disk.operations | All |
| filesystem | Filesystem usage (including NFS) | system.filesystem.usage, system.filesystem.utilization | All |
| network | Network I/O and connections | system.network.io, system.network.connections | All |
| load | System load averages | system.cpu.load_average.1m/5m/15m | Linux/Unix |
| paging | Swap/paging activity | system.paging.usage, system.paging.operations | All |
| process | Aggregate process counts | system.processes.running, system.processes.blocked | All |
| processes | Per-process metrics | process.cpu.time, process.memory.usage | All (requires elevated permissions) |
Prerequisites
Before you begin, ensure you have:
- OpenTelemetry Collector installed (installation guide)
- Appropriate permissions to collect system metrics
- Access to modify the Collector configuration
- A telemetry backend configured (e.g., Uptrace, Prometheus, Jaeger)
What is OpenTelemetry Collector?
OpenTelemetry Collector is a vendor-agnostic agent that:
- Collects telemetry data from various sources using receivers
- Processes data through a pipeline (filtering, aggregating, transforming)
- Exports data to one or multiple observability backends
The Collector provides a unified way to receive, process, and export telemetry data (traces, metrics, and logs), eliminating the need for multiple agents.
Key Benefits
- Simplified deployment - Single agent for all telemetry data
- Vendor neutrality - Switch backends without changing instrumentation
- Powerful processing - Transform and enrich data before export
- Reduced overhead - Efficient batching and compression
Quick Start Guide
Basic Configuration
To start collecting host metrics immediately, add this minimal configuration to your otel-collector-config.yaml:
receivers:
hostmetrics:
collection_interval: 10s
scrapers:
cpu:
memory:
disk:
filesystem:
network:
load:
processors:
batch:
timeout: 10s
exporters:
# Configure your exporter here
otlp:
endpoint: "your-backend:4317"
service:
pipelines:
metrics:
receivers: [hostmetrics]
processors: [batch]
exporters: [otlp]
Starting the Collector
# Run with your configuration
otelcol-contrib --config=otel-collector-config.yaml
# Or as a service (systemd)
sudo systemctl start otelcol-contrib
Complete Configuration Example
Here's a comprehensive configuration that includes all available scrapers and recommended processors:
processors:
# Detect and add resource attributes
resourcedetection:
detectors: [env, system, docker, ec2, gcp, azure]
timeout: 5s
override: false
# Convert cumulative metrics to delta
cumulativetodelta:
# Batch data for efficient export
batch:
timeout: 10s
send_batch_size: 1024
receivers:
hostmetrics:
# How often to collect metrics
collection_interval: 10s
# Configure individual scrapers
scrapers:
# CPU utilization metrics
cpu:
metrics:
system.cpu.utilization:
enabled: true
# Disk I/O metrics
disk:
# Exclude specific mount points
exclude:
devices: ["^/dev/loop.*"]
# File System utilization metrics
filesystem:
exclude:
mount_points: ["/tmp/*", "/dev/shm"]
include_fs_types:
match_type: strict
fs_types: [ext3, ext4, xfs, btrfs]
# CPU load average metrics (1, 5, 15 minutes)
load:
# Memory utilization metrics
memory:
# Network interface I/O metrics & TCP connection metrics
network:
include:
interfaces: ["eth0", "eth1"]
# Paging/Swap space utilization and I/O metrics
paging:
# Process count by status
process:
# Per-process metrics (requires elevated permissions)
processes:
include:
names: ["nginx", "postgres", "redis"]
mute_process_name_error: true
service:
pipelines:
metrics:
receivers: [hostmetrics]
processors: [resourcedetection, cumulativetodelta, batch]
exporters: [otlp]
Production-Ready Configuration Template
This is a complete, copy-paste ready configuration optimized for production environments:
# Complete production-ready configuration for hostmetricsreceiver
# Tested with OpenTelemetry Collector v0.140.1+
receivers:
hostmetrics:
# Collect metrics every 30 seconds (balance between granularity and overhead)
collection_interval: 30s
# Root path for containerized deployments
# root_path: /hostfs
scrapers:
# CPU metrics - enabled by default
cpu:
metrics:
system.cpu.utilization:
enabled: true
system.cpu.logical.count:
enabled: true
# Memory metrics
memory:
metrics:
system.memory.utilization:
enabled: true
# Disk I/O metrics - exclude virtual/loop devices
disk:
exclude:
devices: ["^/dev/loop.*", "^/dev/dm-.*"]
match_type: regexp
# Filesystem metrics - production filesystems only
filesystem:
include_fs_types:
match_type: strict
fs_types: [ext3, ext4, xfs, btrfs, zfs, nfs, nfs4]
exclude:
mount_points: ["/boot", "/tmp/*", "/dev/*", "/run/*", "/sys/*"]
match_type: regexp
include_virtual_filesystems: false
# Network metrics - filter primary interfaces
network:
include:
interfaces: ["eth.*", "en.*", "ens.*"]
match_type: regexp
exclude:
interfaces: ["lo", "docker.*", "veth.*"]
match_type: regexp
# Load average (Linux/Unix only)
load:
# Paging/Swap metrics
paging:
# Process count metrics
process:
metrics:
system.processes.created:
enabled: true
# Per-process metrics (optional - requires elevated permissions)
# processes:
# include:
# names: ["nginx", "mysql", "postgres", "redis"]
# match_type: regexp
# mute_process_name_error: true
# max_processes: 100
processors:
# Detect cloud/container environment
resourcedetection:
detectors: [env, system, docker, ec2, gcp, azure, aks, eks, gke]
timeout: 5s
override: false
system:
hostname_sources: [dns, os]
# Convert cumulative to delta for backends that expect delta
cumulativetodelta:
metrics:
- system.network.io
- system.disk.io
# Batch for efficient export
batch:
timeout: 10s
send_batch_size: 1024
send_batch_max_size: 2048
# Filter out noisy metrics (optional)
# filter/metrics:
# metrics:
# exclude:
# match_type: regexp
# metric_names:
# - "system\\.disk\\.merged"
exporters:
# Configure your backend
otlp:
endpoint: "your-backend:4317"
compression: gzip
# headers:
# api-key: "${API_KEY}"
# Debug exporter for troubleshooting
# debug:
# verbosity: detailed
service:
# Enable telemetry for the collector itself
telemetry:
logs:
level: info
metrics:
level: detailed
address: ":8888"
pipelines:
metrics:
receivers: [hostmetrics]
processors: [resourcedetection, cumulativetodelta, batch]
exporters: [otlp]
Usage instructions:
- Save as
otel-collector-config.yaml - Update
exporters.otlp.endpointwith your backend URL - Uncomment
root_pathif running in Docker/Kubernetes - Adjust
collection_intervalbased on your needs (10s for high frequency, 60s for lower overhead) - Enable
processesscraper if you need per-process metrics (requires root/elevated permissions)
Available Scrapers
The hostmetricsreceiver includes multiple scrapers, each collecting specific metric types:
CPU Scraper
Collects CPU utilization metrics per core and aggregated.
Metrics collected:
system.cpu.time- CPU time in different states (user, system, idle, etc.)system.cpu.utilization- CPU utilization percentage
Example configuration:
cpu:
metrics:
system.cpu.utilization:
enabled: true
system.cpu.logical.count:
enabled: true
Memory Scraper
Monitors RAM usage and availability.
Metrics collected:
system.memory.usage- Memory usage by statesystem.memory.utilization- Memory utilization percentage
Example configuration:
memory:
metrics:
system.memory.utilization:
enabled: true
Disk Scraper
Tracks disk I/O operations and performance.
Metrics collected:
system.disk.io- Bytes read/writtensystem.disk.operations- Read/write operations countsystem.disk.io_time- Time spent on I/O operationssystem.disk.merged- Merged read/write operations
Example configuration:
disk:
# Collect metrics for specific devices only
include:
devices: ["sda", "sdb"]
exclude:
devices: ["^/dev/loop.*"]
Filesystem Scraper
Monitors filesystem usage and capacity, including network filesystems like NFS.
Metrics collected:
system.filesystem.usage- Used/free spacesystem.filesystem.utilization- Usage percentagesystem.filesystem.inodes.usage- Inode usage
Example configuration:
filesystem:
include_fs_types:
match_type: regexp
fs_types: ["^ext[234]$", "^xfs$"]
exclude:
mount_points: ["/boot", "/dev/*"]
include_virtual_filesystems: false
Collecting NFS metrics:
To monitor NFS mounts specifically, configure the filesystem scraper to include NFS filesystem types:
filesystem:
# Include NFS v3 and v4 mounts
include_fs_types:
match_type: strict
fs_types: [nfs, nfs4]
# Optionally filter specific mount points
include:
mount_points: ["/mnt/nfs/*"]
# Collect metrics for network filesystems
include_virtual_filesystems: true
This configuration collects standard filesystem metrics (usage, utilization, inodes) for all NFS mounts on the system.
Network Scraper
Collects network interface and connection statistics.
Metrics collected:
system.network.io- Bytes sent/receivedsystem.network.packets- Packets sent/receivedsystem.network.errors- Network errorssystem.network.connections- TCP connection states
Example configuration:
network:
include:
interfaces: ["eth.*", "en.*"]
exclude:
interfaces: ["lo", "docker.*"]
Load Scraper
Reports system load averages (Linux/Unix only).
Metrics collected:
system.cpu.load_average.1m- 1-minute load averagesystem.cpu.load_average.5m- 5-minute load averagesystem.cpu.load_average.15m- 15-minute load average
Paging Scraper
Monitors swap/paging activity.
Metrics collected:
system.paging.usage- Swap space usagesystem.paging.operations- Page in/out operationssystem.paging.faults- Page fault counts
Process Scrapers
Two scrapers provide process-level insights:
Process Scraper - Aggregate process counts:
system.processes.running- Running processessystem.processes.blocked- Blocked processes
Processes Scraper - Per-process metrics:
process.cpu.time- CPU time per processprocess.memory.usage- Memory usage per processprocess.disk.io- Disk I/O per process
Advanced Configuration
Filesystem Metrics Configuration
For production environments with diverse filesystem types:
receivers:
hostmetrics:
collection_interval: 10s
scrapers:
filesystem:
# Only collect from specific filesystem types
include_fs_types:
match_type: strict
fs_types: [ext3, ext4, xfs, btrfs, zfs]
# Exclude temporary and virtual filesystems
exclude:
mount_points:
- "/tmp/*"
- "/dev/shm"
- "/run/*"
- "/sys/*"
fs_types: [tmpfs, devtmpfs, autofs]
# Include metrics for virtual filesystems
include_virtual_filesystems: false
Process Metrics Configuration
To collect detailed per-process metrics:
receivers:
hostmetrics:
collection_interval: 10s
scrapers:
# Aggregate process counts
process:
metrics:
system.processes.created:
enabled: true
# Detailed per-process metrics
processes:
# Include specific processes by name
include:
names: ["nginx", "mysql", "redis", "postgres"]
match_type: regexp
# Or exclude specific processes
exclude:
names: ["^kernel.*"]
# Mute errors for processes that terminate during collection
mute_process_name_error: true
# Limit number of processes to track
max_processes: 256
Permission Requirements
Process metrics require elevated permissions. Configure based on your system:
Linux (systemd):
# /lib/systemd/system/otelcol-contrib.service
[Service]
User=root
Group=root
# Or use capabilities instead of root
AmbientCapabilities=CAP_SYS_PTRACE CAP_DAC_READ_SEARCH
Docker:
docker run --pid=host --cap-add=SYS_PTRACE ...
Kubernetes:
apiVersion: v1
kind: Pod
spec:
containers:
- name: otel-collector
securityContext:
capabilities:
add: ["SYS_PTRACE", "DAC_READ_SEARCH"]
Container Deployments
Docker Configuration
When running the Collector in a container, mount host directories to collect host (not container) metrics:
# Full host filesystem access
docker run \
-v /:/hostfs:ro \
-v ./config.yaml:/etc/otel-collector-config.yaml \
otel/opentelemetry-collector-contrib:latest \
--config=/etc/otel-collector-config.yaml
# Minimal required mounts
docker run \
-v /proc:/hostfs/proc:ro \
-v /sys:/hostfs/sys:ro \
-v /etc/hostname:/hostfs/etc/hostname:ro \
-v ./config.yaml:/etc/otel-collector-config.yaml \
otel/opentelemetry-collector-contrib:latest
Configuration with root_path:
receivers:
hostmetrics:
root_path: /hostfs
collection_interval: 10s
scrapers:
cpu:
memory:
disk:
filesystem:
network:
Kubernetes DaemonSet
Deploy as a DaemonSet to collect metrics from all nodes:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
serviceAccountName: otel-collector
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
volumeMounts:
- name: hostfs
mountPath: /hostfs
readOnly: true
- name: config
mountPath: /etc/otel-collector-config.yaml
subPath: config.yaml
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumes:
- name: hostfs
hostPath:
path: /
- name: config
configMap:
name: otel-collector-config
hostNetwork: true
hostPID: true
Resource Detection
The Resource Detection Processor automatically discovers and adds metadata about the environment:
processors:
resourcedetection:
# Detectors to run
detectors: [env, system, docker, ec2, gcp, azure, aks, eks, gke]
# Detection timeout
timeout: 5s
# Override existing attributes
override: false
# Configure specific detectors
system:
hostname_sources: [dns, os, cname]
ec2:
tags:
- "^env$"
- "^team$"
docker:
use_hostname_if_available: true
Available Detectors
- env - Environment variables
- system - System information (hostname, OS, architecture)
- docker - Docker container metadata
- ec2 - AWS EC2 instance metadata
- ecs - AWS ECS task metadata
- eks - AWS EKS cluster information
- gcp - Google Cloud Platform metadata
- gke - Google Kubernetes Engine metadata
- azure - Azure VM metadata
- aks - Azure Kubernetes Service metadata
- kubernetes - Kubernetes pod/node metadata
Hostname Configuration
Control how hostnames are determined:
processors:
resourcedetection/custom:
detectors: [system]
system:
# Sources: dns, os, cname, lookup
hostname_sources: [dns, os]
# Use FQDN
use_fqdn: true
Performance Optimization
Collection Interval Tuning
Balance between data granularity and resource usage:
receivers:
hostmetrics:
# High-frequency for critical metrics
collection_interval: 5s
scrapers:
cpu:
memory:
hostmetrics/detailed:
# Lower frequency for detailed metrics
collection_interval: 30s
scrapers:
filesystem:
processes:
Metric Filtering
Reduce data volume by filtering unnecessary metrics:
processors:
filter/metrics:
metrics:
exclude:
match_type: regexp
metric_names:
- "system\\.network\\.io.*"
- "system\\.disk\\.merged"
Batching Configuration
Optimize export efficiency:
processors:
batch:
# Maximum time before sending
timeout: 10s
# Maximum batch size
send_batch_size: 1024
# Maximum queue size
send_batch_max_size: 2048
Troubleshooting
Common Issues and Solutions
1. Missing Metrics
Problem: Some expected metrics aren't being collected.
Solutions:
- Check if the scraper is enabled in configuration
- Verify permissions (especially for process metrics)
- Ensure filesystem/device isn't excluded
- Check collector logs for errors
2. High CPU Usage
Problem: Collector consuming excessive CPU.
Solutions:
- Increase collection interval
- Reduce number of processes tracked
- Filter out unnecessary metrics
- Enable metric aggregation
3. Permission Denied Errors
Problem: Cannot collect process or certain system metrics.
Solutions:
# Check current permissions
ps aux | grep otelcol
# Run with elevated permissions
sudo systemctl edit otelcol-contrib
# Add: User=root
# Or add specific capabilities
setcap cap_sys_ptrace,cap_dac_read_search+ep /usr/bin/otelcol-contrib
4. Container Metrics Instead of Host
Problem: Seeing container metrics when expecting host metrics.
Solution:
# Ensure root_path is set
receivers:
hostmetrics:
root_path: /hostfs
# Verify mounts
docker inspect <container_id> | grep Mounts -A 20
Debug Logging
Enable debug logging to troubleshoot issues:
service:
telemetry:
logs:
level: debug
development: true
Monitoring Best Practices
1. Establish Baselines
- Collect metrics for at least 1-2 weeks
- Identify normal operating ranges
- Document peak usage patterns
2. Set Meaningful Alerts
# Example alert thresholds
- CPU utilization > 80% for 5 minutes
- Memory utilization > 90% for 10 minutes
- Disk usage > 85%
- Load average > number of CPU cores
3. Create Dashboards
Group related metrics for quick insights:
- System Overview - CPU, memory, load
- Storage - Disk usage, I/O, filesystem
- Network - Traffic, errors, connections
- Processes - Top consumers, process counts
4. Correlate with Application Metrics
- Link system metrics with application performance
- Identify resource constraints affecting applications
- Plan capacity based on usage trends
Integration with Uptrace
To send metrics to Uptrace:
exporters:
otlp/uptrace:
endpoint: api.uptrace.dev:4317
headers:
uptrace-dsn: "your-project-dsn"
compression: gzip
service:
pipelines:
metrics:
receivers: [hostmetrics]
processors: [resourcedetection, batch]
exporters: [otlp/uptrace]
See the complete Uptrace integration guide for detailed setup instructions.
Security Considerations
Principle of Least Privilege
- Run with minimal required permissions
- Use capabilities instead of root when possible
- Restrict file system access with read-only mounts
Sensitive Data
- Process names may reveal application architecture
- Network connections could expose service topology
- Consider filtering sensitive metrics before export
Configuration Security
# Use environment variables for sensitive data
exporters:
otlp:
endpoint: ${OTLP_ENDPOINT}
headers:
api-key: ${API_KEY}
What's Next?
Expand your observability stack with these complementary receivers:
- OpenTelemetry Kubernetes - Collect Kubernetes cluster metrics and metadata
- OpenTelemetry Docker - Monitor Docker containers and daemon metrics
- OpenTelemetry Redis - Track Redis performance and usage
- OpenTelemetry PostgreSQL - Monitor PostgreSQL database metrics
- Prometheus Integration - Scrape Prometheus metrics