Improve your alerts coverage by implementing the following recommendations and get the most out of your alerts configuration.
And check out this video on finding the root cause for an alert (5:01 minutes):
Read on to learn the best practices for:
- Notification channels
- Incident preferences
- Thresholds and violations
- Muting rules
Use recommended alerts conditions conditions if you are new to alerts or if you want suggestions that optimize your alert coverage.
A policy is a container for similar conditions.
If you’re new to alerts, learn how to create, edit, or find policies.
Organize your policy's scope to a single entity when possible. Assign your policy to the essential team or teams that need to be notified when an incident occurs. This way, you keep policies centralized and focused.
If a team is monitoring several groups of the same entity type, combine those entity clusters (like servers) together into one policy. This way, your team can be notified from one policy rather than navigating several policies at once.
You can use alerts to monitor all of your entities. Consider your role and priorities when assigning yourself to a policy.
- Operations personnel may need notifications for poor back-end performance, such as server memory and load averages.
- The product owner may need notifications for positive front-end performance, such as improved end user Apdex scores or sales being monitored in dashboards.
Set meaningful threshold levels to optimize alerts for your business. Here are some suggested guidelines:
Set threshold levels
Avoid setting thresholds too low. For example, if you set a CPU condition threshold of 75% for 5 minutes on your production servers, and it routinely goes over that level, this will increase the likelihood of un-actionable alerts or false positives.
Experimenting with settings
You do not need to edit files or restart software, so feel free to make quick changes to your threshold levels and adjust as necessary.
Adjust your conditions over time.
You can disable any condition in a policy. This is useful, for example, if you want to continue using other conditions in the policy while you experiment with other metrics or thresholds.
In most of our products (except Infrastructure), the color-coded health status indicator in the user interface changes as the alerting threshold escalates or returns to normal. This allows you to monitor a situation through our UI before a critical threshold passes, without needing to receive specific notifications about it.
There are two violation thresholds: critical (red) and warning (yellow). Define these thresholds with different criteria, keeping in mind the suggestions above.
Warning violations do not open incidents. A critical violation can open incidents, but you must define that decision through your incident preferences.
Loss of signal occurs when New Relic stops receiving data for a while; technically, we detect loss of signal after a significant amount of time has elapsed since data was last received in a time series. Loss of signal can be used to trigger or resolve a violation, which you can use to set up alerts.
You can configure loss of signal settings by condition in the UI or configure loss of signal via the NerdGraph API.
Assuming you're sending an event to New Relic as part of your batch job, you can set up an alert condition to notify you if your batch jobs fail to run.
- Set up a simple count query on the event (
SELECT count(*) FROM MyBatchEvent)
- Set Loss of Signal to open a new violation after 24 hours and 30 minutes (you can adjust this, but it's a good idea to allow for a late-running batch job)
- Make sure to use the Event Timer streaming aggregation method. Since you will only ever get 1 data point every 24 hours, you can set the Timer to its lowest setting, 5 seconds.
By default, gaps in data signals are filled with null values. In cases where you need to be able to create conditions based on those data gaps, you can fill gaps with a custom value or the last known value. You can configure this setting by condition in the UI or configure gap filling values via NerdGraph
Decide when you get incident notifications so you can respond to incidents when they happen.
If you’re new to alerts, learn more about your incident preferences options.
The default incident preference setting combines all conditions within a policy into one incident. Change your default incident preference setting to increase or decrease the number of incidents and incident notifications you receive.
Each team within your organization will have different needs. Ask your team two important questions when deciding your incident preferences:
- Do we want to be notified every time something goes wrong?
- Do we want to group all similar notifications together and be notified once?
When a policy and its conditions have a broader scope (like managing the performance of several entities), increase the amount of incidents you receive. You will need more notifications because two incidents will not necessarily relate to each other.
When a policy and its conditions have a focused scope (like managing the performance of one entity), opt for the default incident preference. You will need less notifications when two incidents are related to each other or when the team is already notified and fixing an existing problem.
Decide how you get incident notifications by using our notification channel best practices.
Tailor notifications to the most useful channel and policy so you can avoid alert fatigue and help the right personnel receive and respond to incidents they care about in a systematic way.
If you’re new to alerts, learn how to set up notification channels.
Notify teams and individuals who needs to stay updated on or resolve a problem when an incident arises.
To stay updated, select a notification channel that is less intrusive, like email.
For vital notifications and incident responses, select a notification channel that is more intrusive, like PagerDuty or Slack.
Do not rely on email for quick notifications in case of delays.
Mute alerts during routine events, such as maintenance or planned downtime.
You can also silence a policy, a specific entity, and a condition when needed. Incidents can still be opened, but you won't be notified.
If you're new to alerts, learn how to create and manage muting rules.
To learn more about using alerts:
- Learn about the API.
- Read technical details about min/max limits and rules.
- Read more about about when you might want to use loss-of-signal and gap-filling settings.