• /
  • EnglishEspañolFrançais日本語한국어Português
  • Se connecterDémarrer

Level 1 - Service error rate scorecard rule

Service error rate measures the percentage of your APM services that experience server-side errors (5xx HTTP responses), which can prevent users from completing critical tasks like purchases, signups, or data access. This scorecard rule helps you identify and prioritize fixing backend issues that directly impact customer experience.

About this scorecard rule

This service error rate rule is part of Level 1 (Reactive) in the digital experience maturity model. It evaluates whether your backend services have unresolved server errors that could be affecting user experience and business operations.

Why this matters: Server errors (5xx responses) indicate that your backend cannot fulfill user requests, leading to failed transactions, broken user flows, and lost business opportunities. Users encountering server errors often abandon their tasks and may not return.

How this rule works

This rule evaluates the percentage of APM services that report 5xx HTTP status code errors in their responses. It identifies backend services where server-side failures may be preventing users from successfully completing their intended actions.

Understanding your score

  • Pass (Green): Low percentage of APM services are experiencing server-side errors
  • Fail (Red): High percentage of APM services have unresolved 5xx errors
  • Target: Minimize server errors across all services, especially those supporting critical user journeys

What this means:

  • Passing score: Your backend services reliably fulfill user requests and support successful task completion
  • Failing score: Users may be encountering failed requests, broken workflows, or inability to complete important actions

Understanding 5xx server errors

Server errors indicate problems with your backend infrastructure or application code:

Common 5xx error types

  • 500 Internal Server Error: General server failure, often due to application bugs or unhandled exceptions
  • 502 Bad Gateway: Upstream server returned invalid response, common with load balancers or proxies
  • 503 Service Unavailable: Server temporarily overloaded or under maintenance
  • 504 Gateway Timeout: Request timeout when communicating with upstream servers

Impact on user experience

  • Failed transactions: Users can't complete purchases, signups, or data submissions
  • Broken workflows: Multi-step processes fail partway through, frustrating users
  • Lost data: Form submissions or user input may be lost during errors
  • Trust degradation: Repeated errors reduce user confidence in your application

How to reduce service error rates

If your score shows high service error rates, follow these steps to identify and resolve backend issues:

1. Identify and prioritize affected services

  1. Review APM service overview: Examine which services are reporting 5xx errors
  2. Assess business impact: Prioritize services that support critical user journeys (payments, authentication, core features)
  3. Analyze error patterns: Look for trends in error timing, frequency, or affected endpoints
  4. Check user impact: Determine how many users are affected by each service's errors

2. Investigate root causes

Application-level issues:

  • Unhandled exceptions: Code errors that aren't properly caught and handled
  • Database connection failures: Connection pool exhaustion or database unavailability
  • Resource exhaustion: Memory leaks, CPU overload, or disk space issues
  • Configuration errors: Incorrect settings causing application failures

Infrastructure-level issues:

  • Server capacity problems: Insufficient resources during peak traffic
  • Network connectivity issues: Communication failures between services
  • Load balancer configuration: Improper routing or health check failures
  • Dependency failures: Third-party service outages affecting your application

3. Implement targeted fixes

Immediate resolution:

  • Fix critical bugs: Address application code issues causing 500 errors
  • Scale resources: Add capacity for overloaded services experiencing 503 errors
  • Configure retries: Implement retry logic for transient failures
  • Update health checks: Ensure load balancers properly route traffic

Systematic improvements:

  • Error handling: Add comprehensive try-catch blocks and graceful error responses
  • Circuit breakers: Implement patterns to handle dependency failures gracefully
  • Monitoring enhancements: Add detailed logging and metrics for faster diagnosis
  • Capacity planning: Right-size infrastructure to handle expected load

4. Establish error tracking and resolution

Use New Relic Error Inbox:

  • Centralized error tracking: View all 5xx errors across services in one place
  • Error grouping: Automatically group similar errors to identify patterns
  • Error attribution: Connect errors to specific deployments or changes
  • Resolution tracking: Mark errors as resolved and track remediation progress

Implement defect tracking:

  • Create tickets: Log errors in your issue tracking system (JIRA, GitHub Issues)
  • Assign ownership: Ensure each error has a responsible team or individual
  • Track resolution: Monitor progress on error fixes through to deployment
  • Measure effectiveness: Verify that fixes actually reduce error rates

Measuring improvement

Track these metrics to verify your service error reduction efforts:

  • Error rate reduction: Decreasing percentage of services experiencing 5xx errors
  • User impact metrics: Improved transaction completion rates, reduced user complaints
  • Error resolution time: Faster identification and fixing of server-side issues
  • Service reliability: Increased uptime and successful request rates

Common service error scenarios

Database connection issues:

  • Problem: Connection pool exhaustion or database timeouts causing 500 errors
  • Solution: Optimize connection pooling, implement connection retry logic, monitor database performance

Third-party dependency failures:

  • Problem: External APIs or services failing, causing your application to return 502/503 errors
  • Solution: Implement circuit breakers, fallback mechanisms, and proper timeout handling

Deployment-related errors:

  • Problem: New releases introducing bugs that cause 5xx errors
  • Solution: Improve testing procedures, implement canary deployments, add rollback capabilities

Capacity and scaling issues:

  • Problem: Traffic spikes overwhelming servers, leading to 503 errors
  • Solution: Implement auto-scaling, load testing, and capacity planning

Advanced error management strategies

Error prevention practices

  • Comprehensive testing: Unit tests, integration tests, and load testing to catch issues before production
  • Code reviews: Focus on error handling patterns and edge case coverage
  • Staging environments: Test thoroughly in production-like environments
  • Gradual rollouts: Use feature flags and canary deployments to minimize error impact

Automated error response

  • Auto-scaling: Automatically add capacity when error rates indicate overload
  • Circuit breakers: Automatically isolate failing dependencies to prevent cascading failures
  • Health checks: Automatic removal of unhealthy instances from load balancer rotation
  • Alert integration: Immediate notifications when error rates exceed thresholds

Change tracking integration

  • Deployment correlation: Connect error spikes to specific deployments or configuration changes
  • Rollback procedures: Quick reversion capabilities when changes introduce errors
  • Change impact analysis: Measure how code changes affect error rates over time
  • Release quality metrics: Track error rates as a key quality indicator for releases

Validating error conditions

Ensure your error tracking focuses on genuine user-impacting issues:

Filter out false positives

  • Health check endpoints: Exclude monitoring system requests from error calculations
  • Internal service calls: Focus on user-facing errors rather than internal system communications
  • Expected errors: Some 5xx responses might be intentional (maintenance mode, rate limiting)
  • Bot traffic: Filter out errors from automated systems that don't represent real user impact

Focus on user-affecting errors

  • Customer-facing services: Prioritize errors in services that directly serve end users
  • Critical business flows: Focus on errors that impact revenue-generating activities
  • High-traffic endpoints: Address errors on heavily used API endpoints or pages
  • Conversion funnels: Prioritize errors that affect user registration, purchases, or key actions

Important considerations

  • Business impact prioritization: Focus first on errors affecting revenue-critical services
  • User journey context: Consider where in the user flow errors occur and their impact on task completion
  • Error frequency vs. severity: Balance fixing frequent minor errors with rare but critical failures
  • Resource allocation: Ensure error resolution efforts align with available development capacity

Next steps

  1. Immediate action: Identify and resolve the highest-impact 5xx errors affecting users
  2. Process improvement: Establish error triage workflows and defect tracking procedures
  3. Prevention focus: Implement better testing and deployment practices to reduce new errors
  4. Monitoring enhancement: Use Change Tracking to correlate errors with deployments
  5. Progress to Level 2: Once service errors are under control, focus on Core Web Vitals optimization

For detailed guidance on service error monitoring and resolution, see our APM error tracking documentation and Error Inbox guide.

Droits d'auteur © 2025 New Relic Inc.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.