Create infrastructure "host not reporting" conditions

Tip

NRQL condition guided mode offers a curated experience for creating infrastructure "host not reporting" (HNR) NRQL conditions. This is the preferred alternative to creating infrastructure "host no reporting" conditions.

Use infrastructure monitoring's Host not reporting condition to notify you when we've stopped receiving data from an infrastructure agent. This feature allows you to dynamically alert on groups of hosts, configure the time window from five to 60 minutes, and take full advantage of notifications.

Features

You can define conditions based on the sets of hosts most important to you, and configure thresholds appropriate for each filtered set of hosts. The Host not reporting event triggers when data from the infrastructure agent doesn't reach our collector within the time frame you specify.

Caution

If you have filtered your Host Not Reporting condition using tags or labels and then remove a critical tag or label from a targeted host, the system will open a Host Not Reporting incident, since it will characterize that host as having lost its connection.

This feature's flexibility allows you to easily customize what to monitor and when to notify selected individuals or teams. In addition, the email notification includes links to help you quickly troubleshoot the situation.

Host not reporting condition	Features
What to monitor	You can use the entity filter bar to select which hosts you want to be monitored with the alert condition. The condition will also automatically apply to any hosts you add in the future that match these filters.
How to notify	Conditions are contained in policies. You can select an existing policy or create a new policy with email notifications from the infrastructure monitoring UI. If you want to create a new policy with other types of notification channels, use the UI.
When to notify	Email addresses (identified in the policy) will be notified automatically about threshold incidents for any host matching the filters you have applied, depending on the policy's incident preferences.
Where to troubleshoot	The link at the top of the email notification will take you to the infrastructure Events page centered on the time when the host disconnected. Additional links in the email will take you to additional detail.

Create a "host not reporting" condition

To define the Host not reporting condition criteria:

Create an infrastructure condition.
For the Alert type, select Host not reporting.
Define the Critical threshold for triggering a notification: between 5 and 60 minutes of host unresponsiveness.
(Optional) Enable the Don't trigger alerts for hosts that perform a clean shutdown option to prevent false alerts when hosts are intentionally shut down via command line. This option is currently supported on Windows and systemd-based Linux systems.
Tip
To avoid false "Host not reporting" incidents for intentionally shut down hosts, consider these strategies:
- Tag the host: Add the hostStatus: shutdown or termination: expected tag to the host entity. Learn more about tags.
- Tag the host and enable the Don't trigger alerts setting: Add the hostStatus: shutdown tag to your host along with checking the option mentioned above. This will stop all Host not reporting incidents from opening for that host, as long as that tag is on it, regardless of the agent version or OS. If you remove the tag, New Relic will start opening Host not reporting incidents.

Depending on the policy's incident preferences, it will define which notification channels to use when the defined Critical threshold for the condition passes. To avoid "false positives," the host must stop reporting for the entire time period before an incident is opened.

Example: You create a condition to open an incident when any of the filtered set of hosts stop reporting data for seven minutes.

If any host stops reporting for five minutes, then resumes reporting, the condition does not open an incident.
If any host stops reporting for seven minutes, even if the others are fine, the condition does open an incident.

Investigate the problem

To further investigate why a host is not reporting data:

Review the details in the email notification.
Use the link from the email notification to monitor ongoing changes in your environment from the Events page in our infrastructure UI. For example, use the Events page to help determine if a host disconnected right after a root user made a configuration change to the host.
Optional: Use the email notification's Acknowledge link to verify you are aware of and taking ownership of the alerting incident.
Use the email links to examine additional details in the Incident details page.

Intentional outages

We can distinguish between unexpected situations and planned situations with the option Don't trigger alerts for hosts that perform a clean shutdown. Use this option for situations such as:

Host has been taken offline intentionally.
Host has planned downtime for maintenance.
Host has been shut down or decommissioned.
Autoscaling hosts or shutting down instances in a cloud console.

We rely on Linux and Windows shutdown signals to flag a clean shutdown.

We've confirmed that these scenarios are detected by the agent:

AWS Auto-scaling event with EC2 instances that use systemd (Amazon Linux, CentOs/RedHat 7 and newer, Ubuntu 16 and newer, Suse 12 and newer, Debian 9 and newer)
User-initiated shutdown of Windows systems
User-initiated shutdown of Linux systems that use systemd (Amazon Linux, CentOs/RedHat 7 and newer, Ubuntu 16 and newer, Suse 12 and newer, Debian 9 and newer)

We know that these scenarios are not detected by the agent:

User-initiated shutdown of Linux systems that don't use systemd (CentOs/RedHat 6 and earlier, Ubuntu 14, Debian 8). This includes other modern Linux systems that still use Upstart or SysV init systems.
AWS Auto-scaling event with EC2 instances that don't use systemd (CentOs/RedHat 6 and earlier, Ubuntu 14, Debian 8). This includes other more modern Linux systems that still use Upstart or SysV init systems.