close
close

The rising cost of digital incidents: Understanding and mitigating the impact of outages

Digital disruptions have reached alarming levels. Responding to incidents in modern application environments is frequent, time-consuming and labor-intensive. Our team has first-hand experience I have decades of experience in IT operations and know how to deal with the far-reaching impact of these disruptions and outages. PagerDuty recently published a study1 This sheds light on how broken our existing incident response systems and practices are. The recent Crowdstrike debacle is further proof of this. Despite all the investments in observability, AI Ops, automation, and playbooks, things are not getting better. In some ways, they are even worse: we are collecting more and more data and are overloaded with tools, leading to confusion among users and teams who struggle to understand the holistic environment and all its dependencies. With a Average solution time of 175 minutesEvery digital incident that impacts the customer costs time and money. The industry needs to realign and rethink current processes so we can evolve and change direction.

The impact of outages and application downtime

Failures undermine customer confidence90% of IT leaders report that digital disruptions have eroded customer trust. Protecting sensitive data, ensuring rapid service restoration, and providing real-time updates to customers are essential to maintaining trust during digital incidents. Thorough, actionable post-mortem analysis is essential after incidents to prevent recurrences. And—at the risk of stating the obvious—IT organizations must put operational procedures in place to minimize outages in the first place.

Although IT leaders are aware of the impact on customer trust, incident frequency continues to rise, with 59% of IT leaders reporting an increase in customer-impacting incidents. And the situation won't improve unless we change the way we monitor and contain problems in our applications.

Automation can help, but adoption is slow

Despite the growing threat, many organizations lag behind in automating incident response:

  • Over 70% of IT leaders say key incident response tasks are not yet fully automated
  • 38% of deployment time is spent on manual incident response processes
  • Organizations with manual processes take an average of three hours and 58 minutes to resolve customer impact incidents, compared to two hours and 40 minutes for organizations with automated processes.

You don't have to be an IT expert to know that spending almost half of your time on manual processes is a waste of resources. And companies with automated processes still take almost three hours to resolve incidents. Why is incident response still so slow?

It's not just about automating processes. We also need to accelerate decision automation, driven by a deep understanding of the health of applications and infrastructure.

Causal AI for DevOps: The missing link

Causal AI for DevOps promises to bridge the gap between observability and automated digital incident response. By “causal AI for DevOps,” I mean causal reasoning software that applies machine learning (ML) to automatically understand cause-and-effect relationships. Causal AI has the potential to help development and operations teams better plan changes to code, configurations, or load patterns, allowing them to focus on achieving service level and business goals instead of fighting fires.

Causal AI for DevOps can automate many of the currently manual incident response tasks:

  • When service entities become degraded or fail, impacting other entities that make up the business services, causal reasoning software brings to light the relationship between the problem and the symptoms it causes.
  • The team responsible for the down or impacted service is notified immediately so they can get to work on resolving the issue. Some issues can be resolved automatically.
  • Notifications can be sent to end users and other stakeholders to inform them that their services have been impacted, as well as an explanation of why this happened and when things will return to normal.
  • Post-mortem documentation is created automatically.
  • There are no longer complex triage processes that would otherwise require multiple teams and managers to orchestrate. Digital incidents and outages are reduced and root cause analysis is automated, allowing DevOps teams to spend less time troubleshooting and more time shipping code.

It's time for automated incident response

It's time to move from manual to automated incident response. Causal AI for DevOps can help teams prevent outages, reduce risk, lower costs, and build lasting customer trust. This is a topic we care about at Causely, where we're building a causal reasoning platform that helps organizations ensure continuous application reliability and eliminate human troubleshooting. For more information about us and our platform, visit Causely.io.