Use New Relic Infrastructure's Host not reporting alert condition to notify you when we have stopped receiving data from an Infrastructure agent. This Infrastructure feature allows you to dynamically alert on groups of hosts, configure the time window from five to 60 minutes, and take full advantage of New Relic Alerts.
Anyone can view alerts tied to your account. Only Owner, Admins, or Add-on Managers can create, modify, or delete conditions.
You can define conditions based on the sets of hosts most important to you, and configure thresholds appropriate for each filter set. The Host not reporting event triggers when data from the Infrastructure agent does not reach our collector within the time frame you specify.
This feature's flexibility allows you to easily customize what to monitor and when to notify selected individuals or teams. In addition, the email notification includes links to help you quickly troubleshoot the situation.
|Host not reporting condition||Features|
|What to monitor||
You can use filter sets to select which hosts you want to be monitored with the alert condition. The alert condition will also automatically apply to any hosts you add in the future that match these filters.
|How to notify||
Alert conditions are contained in alert policies. You can select an existing policy or create a new policy with email notifications from the Infrastructure UI. If you want to create a new policy with other types of notification channels, use the Alerts UI.
|When to notify||
Email addresses (identified in the alert policy) will be notified automatically about threshold violations for any host matching the filters you have applied, depending on the policy's incident preferences.
|Where to troubleshoot||
The link at the top of the email notification will take you to the Infrastructure Events page centered on the time when the host disconnected. Additional links in the email will take you to additional details in Alerts.
Create "host not reporting" condition
To define the Host not reporting alert criteria:
- Follow standard procedures to create an Infrastructure alert condition.
- Select Host not reporting as the Alert type.
- Define the Critical threshold for triggering the alert notification: minimum 5 minutes, maximum 60 minutes.
- Enable 'Don't trigger alerts for hosts that perform a clean shutdown' option, if you want to prevent false alerts when you have hosts set to shut down via command line.
Currently this feature is supported on all Windows systems and Linux systems using systemd.
Depending on the alert policy's incident preferences, the policy defines which notification channels we use when the defined Critical threshold for the alert condition passes. To avoid "false positives," the host must stop reporting for the entire time period before a violation is opened.
Example: You create a condition to open a violation when any of the filtered set of hosts stop reporting data for seven minutes.
- If any host stops reporting for five minutes, then resumes reporting, the condition does not open a violation.
- If any host stops reporting for seven minutes, even if the others are fine, the condition does open a violation.
Investigate the problem
To further investigate why a host is not reporting data:
- Review the details in the alert email notification.
- Use the link from the email notification to monitor ongoing changes in your environment from Infrastructure's Events page. For example, use the Events page to help determine if a host disconnected right after a root user made a configuration change to the host.
- Optional: Use the email notification's Acknowledge link to verify you are aware of and taking ownership of the alerting incident.
- Use the email links to examine additional details in the Incident details page in Alerts.
We can distinguish between unexpected situations and planned situations with the option Don't trigger alerts for hosts that perform a clean shutdown. Use this option for situations such as:
- Host has been taken offline intentionally.
- Host has planned downtime for maintenance.
- Host has been shut down or decommissioned.
- Autoscaling hosts or shutting down instances in a cloud console.
We rely on Linux and Windows shutdown signals to flag a clean shutdown.
We confirmed that these scenarios are detected by the agent:
- AWS Auto-scaling event with EC2 instances that use systemd (Amazon Linux, CentOs/RedHat 7 and newer, Ubuntu 16 and newer, Suse 12 and newer, Debian 9 and newer)
- User-initiated shutdown of Windows systems
- User-initiated shutdown of Linux systems that use systemd (Amazon Linux, CentOs/RedHat 7 and newer, Ubuntu 16 and newer, Suse 12 and newer, Debian 9 and newer)
We know that these scenarios are not detected by the agent:
- User-initiated shutdown of Linux systems that don't use systemd (CentOs/RedHat 6 and earlier, Ubuntu 14, Debian 8). This includes other modern Linux systems that still use Upstart or SysV init systems.
- AWS Auto-scaling event with EC2 instances that don't use systemd (CentOs/RedHat 6 and earlier, Ubuntu 14, Debian 8). This includes other more modern Linux systems that still use Upstart or SysV init systems.