Set proactive alerting: understand and respond to performance issues

The term "alerting" often carries some negative connotations; for too many developers, alerting correlates too closely with errors, mistakes, and ongoing issues. However, for developers who are proactive about alerting, they know they don’t have to stare at their dashboards all day, because effective alerts will tell them when to check in.

Well-defined alerts help you understand the health of your systems, so you can respond to performance problems before they affect your customers.

Prerequisites

This tutorial assumes you have:

1. Define required alerting policies based on Service Level Objectives

A service level objective (SLO) is an agreed upon means of measuring the performance of your service. The SLO defines a target value of a specified quantitative measure, which is called a service level indicator (SLI). Examples service level indicators could be average response time, response time percentile, and application availability. SLOs would then clarify a target value for those SLIs such as:

  • Average response time should be less than 200 ms.
  • 95% of requests should be completed within 250 ms.
  • Availability of the service should be 99.99%.

These SLOs can also be logically grouped together to provide an overall boolean indicator of whether the service is meeting expectations or not (for example, “95% of requests completed within 250 ms AND availability is 99.99%”), which can be helpful for alerting purposes.

Use these SLIs as key performance indicators (KPIs) so your team and organization can measure your service and ensure it’s meeting customer expectations. By breaking down the quantitative performance metrics that are required of your services, you can then identify what kinds of alerts you should set for each. For instance, you could set an alert to notify you if web transaction times go above half a millisecond, or if the error rate goes higher than .20%.

However, not every SLO needs to become an alert. A strong alert strategy takes SLOs and creates a set of simple, actionable alerts. New Relic often finds that our most mature DevOps customers set fewer alerts in general and focus those alerts on a core set of metrics that indicate when their customer experience is truly degraded. As a result, New Relic DevOps customers often use Apdex as part of their alerting strategy to align alerts with signs of degraded user satisfaction.

As you design your alert strategy, keep this question in mind: “If the customer isn’t impacted, is it worth waking someone up?”

For a simple framework of areas to set alerts for, use the following questions and advised metrics and KPIs:

Questions Metrics and KPIs
Are we open for business? New Relic Browser and New Relic APM can be used to alert on site availability.
How's our underlying infrastructure? Set KPIs for key hardware, network, and storage components.
How's the health of our application? Track metrics for JVM performance, queuing, caching, and similar dependencies.
How’s the overall quality of our application? Use an Apdex score to quickly access an application’s quality.
How are our customers doing? Consider real end-user metrics (Browser or APM), synthetic users (Synthetics), and Apdex scores.
How's our overall business doing? Focus on key transactions within an application, and tie that to expected business outcomes to illustrate correlation between your application and business performance.

2. Set specific alerts for performance, correctness, throughput, availability, and dependencies

With New Relic you can set alerts on your instrumented applications, end-user experience, infrastructure, databases, and more. New Relic will alert you if your site’s availability dips or if your error rate spikes above acceptable levels, as defined by your SLOs. You can set warning thresholds to monitor issues that may be approaching a critical severity but don’t yet warrant a pager notification.

Setting thresholds for when alerts should notify teams can be challenging. Thresholds that are too tight will create alert fatigue while thresholds that are too loose will lead to unhappy customers.

Baseline alerts allow you to set dynamic thresholds for alerts based on historical performance. Use baselines to tune your alert to the right threshold. For example, an alert in APM can notify incident response teams if web transaction times deviate from historical performance for an allotted amount of time.

Proactive baseline alerts for devops
alerts.newrelic.com > (selected alert policy) > (selected alert condition) > Define thresholds

You can set this same kind of alert in Browser to catch sub-optimal performance. In the following example, we’ve set both a warning and a violation for throughput:

Proactive baseline alerts - Browser thresholds
alerts.newrelic.com > (selected alert policy) > (selected alert condition) > Define thresholds

As you develop smaller, independent services running on increasingly ephemeral architectures, your environments become significantly more complex. Visibility into outliers can be an important tool for understanding likely performance issues. You should set alerts to automatically fire when you have an outlier -- this can indicate misbehaving hosts, load balancers, or apps.

For example, a load balancer divides web traffic approximately evenly across five different servers. You can set an alert based on a NRQL query, and notification to be sent if any server starts getting significantly more or less traffic than the other servers. Here’s the graph:

outlier-alert.png
alerts.newrelic.com > (selected alert policy) > (selected alert condition) > Define thresholds

And here’s a sample NRQL query:

SELECT average(cpuPercent) FROM SystemSample WHERE apmApplicationNames = 'MY-APP-NAME' FACET hostname

Now you have set static, baseline, and outlier alerts. This can provide a comprehensive awareness of your ecosystem.

Refer to the New Relic Alerts documentation for more details about optimizing your alerts.

3. Identify groups to alert, and set broadcasting methods

Alerting without the proper broadcasting methods leaves you vulnerable. Your alerts strategy should include a notification channel to ensure the appropriate teams are notified if your application or architecture encounters issues. New Relic has many notification integrations, but we recommend that you start simple and add more complexity later.

We recommend that you first send alerts to a group chat channel (for example, using Slack or HipChat). Evaluate these alerts in real time for several weeks to understand which alerts are indicative of important or problematic issues. These are the the types of alerts that warrant waking someone up.

Proactive baseline alerts channels
alerts.newrelic.com > (selected alert policy) > Notification channels

4. Fine-tune alerts and thresholds

As you use New Relic to optimize your application and infrastructure performance, tighten your New Relic Alerts policy conditions to keep pace with your improved performance.

When rolling out new code or modifications that could negatively impact performance over a period of time, loosen your threshold conditions to allow for these short-term changes. For instance, we recommend using pre-established baselines and thresholds to increase efficiency during high-impact times for your business, such as annual events and major releases. Fine-tuning gives you the flexibility you need to increase efficiencies and extend your notification channels.

As noted earlier, we recommend you start with a group chat service when you first establish notifications. Once you’ve identified other tools you’d like to integrate with, set up a notification channel to maintain your momentum. Tools such as xMatters and PagerDuty provide popular integrations, but don’t overlook simple methods, such as webhooks. The goal is to continuously improve your alerting scheme.

Be sure to check your alerts and confirm that they're firing regularly and are still relevant for your customer satisfaction metrics. Use the New Relic platform to create dashboards centered around alerts and incidents for the most common policy conditions and violations.

Proactive baseline alerts - dashboards
insights.newrelic.com > All dashboards > (selected dashboard)

Use the New Relic query language and the Insights query API to create your dashboards. For detailed instructions, check out Sending alerts data to Insights.

The dashboard above was created using the following NRQL queries. It's recommended you recreate them as needed for your alerting dashboards.

  • Incidents by condition:

    SELECT count(*) FROM ALERT_NAME WHERE current_state = 'open' FACET condition_name SINCE 1 week ago
    
  • Incidents by policy:

    SELECT count(*) FROM ALERT_NAME where current_state = 'open' FACET policy_name SINCE 60 MINUTES AGO TIMESERIES
  • Alert trends over time:

    SELECT count(*) FROM ALERT_NAME WHERE current_state IS NOT NULL FACET policy_name SINCE 1 week ago TIMESERIES
  • Incident details:

    SELECT timestamp, incident_id, policy_name, condition_name, details, severity FROM ALERT_NAME SINCE 1 week ago LIMIT 40

Visualizing this data provides a resource you can share with others to refine the alerts and thresholds you’re using.

For a more extensive discussion on notification channels, refer to the incident orchestration tutorial.

Conclusion

Establishing a focused alerts policy helps you pinpoint any degradation that could impact performance in your application or infrastructure. With proactive alerting, you will decrease user-reported incidents, and allow your teams to spend less time firefighting and more time deploying significant changes to your product.

For more help

For more tips and best practices for alerting, see the following documentation:

Recommendations for learning more: