Infrastructure alert coverage ensures that your servers, containers, and other infrastructure components have monitoring alerts in place to detect issues before they impact your applications and customers.
About this scorecard rule
This infrastructure alert coverage rule is part of Level 1 (Reactive) in the business uptime maturity model. It verifies that your critical infrastructure components have basic alerting configured to notify you when problems occur.
Why this matters: Infrastructure issues often cascade to application problems. Without proper infrastructure alerting, you might only discover problems when customers start complaining about slow or unavailable services.
How this rule works
This rule examines your infrastructure entities and checks whether they have alert conditions defined. Specifically, it looks for alerts on:
- INFRA-HOST entities: Physical servers, virtual machines, and cloud instances
- INFRA-KUBERNETES-POD entities: Kubernetes pods and containers
The rule fails if any monitored infrastructure entity lacks at least one alert condition.
Understanding your score
- Pass (Green): All infrastructure entities have at least one alert condition defined
- Fail (Red): One or more infrastructure entities lack alert coverage
- Target: 100% alert coverage across all critical infrastructure components
What this means:
- Passing score: Your infrastructure monitoring foundation is in place
- Failing score: Some infrastructure components could fail without alerting your team
How to improve infrastructure alert coverage
If your score shows missing infrastructure alerts, follow these steps to establish comprehensive coverage:
1. Identify uncovered infrastructure
- Review the failing entities: Identify which specific hosts or pods lack alert coverage
- Prioritize by criticality: Focus first on production systems and business-critical infrastructure
- Assess monitoring gaps: Determine if missing alerts represent actual monitoring gaps or intentional exclusions
2. Set up essential infrastructure alerts
For each infrastructure entity, configure alerts for these critical metrics:
Host monitoring alerts:
- CPU utilization: Alert when CPU usage exceeds 80% for 5 minutes
- Memory usage: Alert when memory utilization exceeds 85% for 5 minutes
- Disk space: Alert when disk usage exceeds 90% or available space drops below 1GB
- Host availability: Alert when the host stops reporting data for 3 minutes
Kubernetes pod alerts:
- Pod restart frequency: Alert when pods restart more than 3 times in 10 minutes
- Container resource limits: Alert when containers approach CPU or memory limits
- Pod availability: Alert when pods are not in a running state for more than 2 minutes
- Node resource pressure: Alert when nodes experience memory or disk pressure
3. Configure alert conditions effectively
Use appropriate thresholds:
- Start with conservative thresholds and adjust based on your environment's normal behavior
- Consider different thresholds for development, staging, and production environments
- Account for expected usage patterns (e.g., batch processing jobs, traffic spikes)
Set proper evaluation windows:
- Use longer windows (5-10 minutes) for metrics that naturally fluctuate
- Use shorter windows (1-3 minutes) for availability and critical failure conditions
- Avoid overly sensitive alerts that trigger on temporary spikes
4. Establish alert routing and escalation
- Define notification channels: Set up email, Slack, or PagerDuty integrations
- Assign responsible teams: Ensure alerts reach the teams who can respond
- Create escalation procedures: Define what happens if initial alerts aren't acknowledged
- Test notification delivery: Verify alerts actually reach the intended recipients
Measuring improvement
Track these metrics to verify your infrastructure alert coverage improvements:
- Coverage percentage: Aim for 100% alert coverage on production infrastructure
- Alert effectiveness: Monitor how often infrastructure alerts help prevent application issues
- Response times: Measure how quickly teams respond to infrastructure alerts
- False positive rate: Ensure alerts are tuned to avoid unnecessary noise
Common scenarios and solutions
Legacy or decommissioned infrastructure:
- Problem: Old hosts or containers still appear in monitoring but don't need alerts
- Solution: Remove unused entities from monitoring or tag them as non-production to exclude from coverage requirements
Development and testing environments:
- Problem: Dev/test infrastructure clutters alert coverage metrics
- Solution: Use tags or naming conventions to separate environments and focus coverage rules on production systems
Specialized infrastructure:
- Problem: Some infrastructure requires custom monitoring approaches
- Solution: Create environment-specific alert templates for different infrastructure types (databases, load balancers, etc.)
Cloud auto-scaling resources:
- Problem: Dynamically created instances may not inherit alert configurations
- Solution: Use infrastructure templates or automation to ensure new instances get proper alert coverage
Advanced considerations
Customizing coverage rules
You may need to adjust the scorecard rule if:
- Different entity types: Your infrastructure includes other entity types (databases, load balancers, etc.)
- Environment segregation: You want to focus only on production infrastructure
- Business criticality: Some infrastructure is more critical than others
Integration with other monitoring tools
If you use multiple monitoring tools:
- Ensure alert coverage doesn't create duplicate notifications
- Coordinate with existing monitoring systems to avoid gaps
- Consider using New Relic as a central aggregation point for infrastructure alerts
Important considerations
- Start with critical systems: Focus first on production infrastructure that directly impacts customers
- Balance coverage with noise: Ensure comprehensive coverage doesn't create alert fatigue
- Regular maintenance: Review and update alert conditions as your infrastructure evolves
- Team readiness: Ensure teams can actually respond to the alerts you're creating
Next steps
- Immediate action: Set up basic alerts for any infrastructure currently lacking coverage
- Ongoing monitoring: Review this scorecard rule weekly to maintain coverage as infrastructure changes
- Advance to Level 2: Once infrastructure alerting is established, focus on proactive monitoring practices
For detailed guidance on infrastructure monitoring setup, see our Infrastructure monitoring documentation.