Incident Management: Optimizing On-Call Processes

Zaid Akel
5 min readOct 4, 2024

--

Providing reliable software is essential for customer retention and growth. Customers expect to use software without downtime to perform specific tasks within fast response times. While software development teams strive to build robust software, occasional system failures are inevitable.

To quickly respond to such failures and provide a great customer experience, development teams usually adopt an on-call process, where specific team members are available at specific times for the rescue in the event of a failure. Team members take turns in the on-call process according to a schedule, usually rotating weekly or bi-weekly, set by the team or its leader.

Generated using Canva

On-Call Process: What Could Go Wrong?

Many things could go wrong during an on-call rotation, such as:

  • Insufficient knowledge: The on-call engineer may lack the information needed to resolve an issue, leading to stress and delays
  • Multiple alerts: Alerts stemming from the same root cause can overwhelm engineers and slow down resolution
  • Missed critical issues: Important issues may not trigger alerts, or the on-call engineer may be unreachable, extending MTTR (Mean Time to Resolve)
  • False alarms: Invalid alerts can cause unnecessary disruptions to the on-call engineer

Poor incident handling and a lack of a well-defined on-call process can lead to frustrated team members, reduce customer satisfaction, and decrease productivity. To achieve operational excellence, the development team should understand their system’s reliability, anticipate what could go wrong, equip the on-call engineer with reference documentation, and continuously improve their operations.

Assess the reliability of your software

Generated using Canva

Begin by defining monitors for both functional and non-functional requirements to gain insights into how customers are interacting with the system and how the system is performing. For each functionality, track key system metrics such as average response latency, error rates, and resource utilization, such as memory and CPU usage. Additionally, consider business metrics like average transaction counts to gain a complete understanding of system performance and customer behavior.

Defining such metrics helps you assess whether the system is meeting its expectations, and what should be improved. However, at this point, detecting errors and anomalies is a manual process, that if a failure happens, only the customer will know about.

Define severity categories

Failures vary in severity; some might block customers from performing their tasks, which require immediate attention, and others might add unnoticeable latency, that could wait until the next business day. The development team should define severity categories, with clear impact definitions and SLAs for resolution. Different teams use different mechanisms to indicate severity, the following is one example:

  • Critical: major issues that block customers from performing main tasks, such as preventing a customer to complete checkout in an e-commerce website, SLA: 4 hours
  • High Priority: issues that block customers from performing main tasks, while an alternative exists, such as unable to export to Word while PDF option is available, SLA: 24 hours
  • Medium Priority: issues impact customer’s experience, but don’t cause major disruptions, such as response latency slightly increased, SLA: 3 business days
  • Low Priority: trivial issues that have little impact on the system, such as infrequent failed requests, SLA: 5 business days

Define and implement alarms

Once the system’s metrics are established and approved by the product development team, set up alarms to trigger when thresholds are crossed. The development team should anticipate potential failures and define alerts with clear metric, threshold, duration, and severity levels. For example:

  1. If more than 15% of total requests return 5xx or 4xx errors in the last 10 minutes, trigger a critical alert.
  2. Assuming average latency for a specific page is 200ms. If average latency is between 250ms and 300ms in the last 20 minutes, trigger a high priority alert.
  3. If average latency exceeds 300ms in the last 20 minutes, trigger a critical alert.

Once alarms are defined, maybe just documented at this stage, it is time to implement them. I am not going to dive deep into the implementation mechanisms in this article, but there are tons of tools to help in monitoring and creating alarms, such as AWS CloudWatch, Datadog and Azure Monitor.

Documentation: The On-Call Engineer’s Best Friend

An on-call engineer might get paged outside working hours while the rest of the team members are off duty. Given on-call engineers might not be familiar with all system components, up-to-date documentation is the on-call engineer’s best friend, where they should be able to access:

  • System design: architecture diagrams and descriptions of system components, including key services and databases
  • Runbooks and troubleshooting guides: step-by-step instructions for responding to all alerts, handling known error messages, and diagnosing root causes
  • Communication procedures that specify when and to whom issues should be escalated

Define on-call schedule and paging mechanism

Define a rotation schedule, usually each engineer is on-call for one or two weeks, and the on-call shift rotates among all engineers. Tools such as PagerDuty and Squadcast offer scheduling feature along with paging mechanisms, that includes configuring notification channels based on severity and time of the incident. For example, for critical incidents, phone the on-call engineer anytime, but for low priority, send them a Slack message.

In addition to the on-call engineer job to resolve incidents, they should document the root cause, ensure they track Mean Time to Acknowledge (MTTA) and Mean Time to Resolve (MTTR), which should help the team understand their operations' health.

Continuous Improvement: Operational Excellence Reviews

Teams can fall into the trap of repeatedly fixing the same issue without addressing its root cause or dealing with different issues stemming from the same underlying problem. Having a holistic view of operations and incidents, could help teams focus on the underlying issues, which might highlight design deficiencies or fragile algorithms. In addition, the team should revise the defined alerts every while and then to ensure keeping false positive to the minimum, and ensure the team is paged for all incidents. Holding regular operational excellence review meeting, where the team discusses the below metrics and creates action items — such as adding a new alert, modifying the threshold or severity of an existing one, or combining multiple alerts into one to avoid duplicate alerts, should help continuously improve the team’s operations:

  • False positive alerts
  • Recurring incidents
  • Different incidents with the same root cause
  • Average MTTA
  • Average MTTR

Putting it all together

Implementing an efficient on-call process requires a thorough understanding of your system metrics, anticipating potential failures, and setting up alarms to trigger when something goes wrong. It also involves establishing a clear on-call schedule. Implementing an on-call process may entail changes to your system’s design to enable log collection and metric tracking. Additionally, the product development team’s leader must secure the team’s commitment to be available during their assigned on-call shifts.

Below is an illustration of the required changes to implement an on-call process in an existing system, using CloudWatch and PagerDuty.

--

--

Zaid Akel
Zaid Akel

Written by Zaid Akel

Technology leader & consultant | Working @ Amazon | Ex-Expedia | Passionate about growing engineering teams, building scalable solutions and cloud computing

No responses yet