DevOps teams often spend far too much time treating recurring symptoms without penetrating to the deeper roots of software and IT issues, making the extra effort to solve problems at their source. But as every doctor knows, plenty of time and money can actually be saved by figuring out exactly why problematic symptoms appear in the first place. Approaching problems with an eye to unearthing such basic casual factors is called root cause analysis, and, as in the case of the smart doctor, it can greatly aid your efforts as a system administrator, developer, or QA professional to prevent a lot of unnecessary suffering.
Although employed as a deductive problem-solving methodology in almost every industry—from aeronautical engineering to book publishing — root cause analysis is especially useful in the arena of software development and IT where complex systems of cause-and-effect relationships are the norm.
Whether you’re maintaining an MMORPG video game consisting of millions of lines of code or monitoring a cloud-hosting solution backed by multiple SANs, understanding how to trace undesirable effects back to their primary cause—or to at least pinpoint a series of interrelated causes—is essential to keeping your end-users happy. And without utilizing the principles of root cause analysis, sysadmins and operations managers may be kept too busy treating symptoms to ever bother digging down to find the roots of chronic conditions, making their role seem like that of a triage nurse, applying endless rolls of gauze to stanch the bleeding of a system that’s barely under control.
Fortunately, modern diagnostic software tools are making it easier than ever to perform a thorough root-cause analysis on Web-based applications without breaking a sweat. These tools also give those who deal in APM and website monitoring an advantage over most other industries when it comes to employing RCA—with fewer flowcharts, Excel sheets, or interdepartmental brainstorming sessions required.
The Origins of Root Cause Analysis
Speaking in the broadest possible terms, the history of root cause analysis could be said to date back to the dawn of our ancient ancestors’ first problem-solving challenges, such as their first successful attempts to prevent themselves from constantly freezing in South African caves during wintertime by taming fire. Once the problem was identified (“winter cold”), the solution (“fire hot”) came within the grasp of their clever hominid brains and led to the undesirable effect (freezing) being intelligently mitigated at its root cause (winter), year after year.
But as a specific, consciously implemented, and systematic approach to problem-solving, root cause analysis was only recently developed in the modern era, spurred on by industrial and engineering accident investigations—such as the Tay Bridge collapse of 1879, the New London school explosion of 1937, or the Challenger space shuttle disaster of 1986—that demanded more rigorous analytical methodologies be invented to pinpoint the precise cause of problems that absolutely could not be permitted to happen again.
One important approach to root cause analysis, known as root-cause failure analysis (RCFA), emphasizes that most problems in complex systems can rarely be attributed to a single specific cause. Rather, they are often the result of a series of interlinked “causal factors.” From poorly educated personnel to design issues to flawed engineering methods, the causal factors behind any problematic event can be ranked in terms of causal culpability while still acknowledging that all of the factors were at play as conditions that, together, spiraled out into an accident. For our purposes here, RCFA will just be treated as an aspect of root cause analysis itself, acknowledging—as the Buddhist philosophers do—that all of reality is a vast interlinked Web of causes and effects, with any given factor impossible to completely separate and isolate from another, but that some causes do bear more responsibility for certain effects than others.
In the case of the infamous Tay Bridge collapse, which occurred in Dundee, Scotland, on December 28, 1879 and resulted in 75 deaths, the chief investigator declared the cause of the accident to be the negligence of the engineer who designed it, Sir Thomas Bouch, insisting that his bridge was “badly designed, badly constructed, and badly maintained.” But as Failure magazine reports, “Bouch alleged that the wind blew the train from the track into the bridge, and that the shock caused the lugs on one of the towers to break, leading to the collapse”—a far more useful (if not entirely sufficient) explanation of specific causal factors. (Based on the case argued in Failure magazine, a new root cause analysis of the incident has been performed here.)
And it is in this example that one finds one of the most important distinctions between root cause analysis and other forms of problem-solving: the crucial distinction between blaming your process rather than your people. In workflows such as Six Sigma, Lean, or Agile, it’s understood that blaming individual employees for inattention or negligence rarely accomplishes anything productive (as every software engineer or IT professional knows all too well from personal experience), whereas using instances of human negligence to develop more robust processes and systems that prevent such personal errors from happening in the first place can accomplish quite a lot.
Your company’s processes tend to be the root causal factors or conditions, and the ways that employees behave within those processes are the effects or symptoms. Merely calling your engineer and his methods “bad” is not particularly helpful, but maybe putting a system in place to routinely review the stress tolerances he’s using in his calculations would be. Root cause analysis, as an organizational strategy, helps to identify flawed processes and to thereby prevent negative symptoms before they strike again.
Understanding Root Cause Analysis
One of the simplest and most common approaches to root cause analysis—as it’s practiced in every field and industry—is the 5-Why approach developed by Sakichi Toyoda, the founder of Toyota Motor Corporation. This approach forms a simple foundation upon which more robust and detailed methods of inquiry can be based. And given that the best way to understand how root cause analysis is actually performed is to consider a specific example, let’s look at a 5-Why approach to getting at the root cause of, say, why Motorola’s site crashed during their Cyber Monday sale in December 2013. As the name implies, the 5-Why method consists of acting like an annoyingly inquisitive child, simply asking the question “Why?” five times in succession, or as many times as might be needed to get to a satisfactory conclusion. Based on Motorola’s explanation and apology to their customers, a somewhat cursory root cause analysis of their Cyber Monday failure using the 5-Why’s approach might look something like this:
Q: Why were sales of our amazing new Moto X so low?
A: Because demand was much higher than we anticipated.
Q: Why was high demand for our new product a problem?
A: Because potential customers couldn’t process orders through our website and left in frustration.
Q: Why couldn’t customers place orders on our website?
A: Because our frontend MotoMaker customization software couldn’t handle the demand, and a traffic spike crashed our site.
Q: Why couldn’t MotoMaker handle the demand?
A: Because our performance testing was insufficient.
Q: Why was our testing insufficient, and how do we fix this problem
A: We failed to test for a high volume of concurrent orders, and we need to fix our MotoMaker software to be able to handle such demands.
Further questions would be needed to drill into the actual specifics of the MotoMaker code and any server-side issues, but this general overview illustrates the approach. The 5-Why’s and other root cause analysis methods of “causal mapping” are typically illustrated in visual form as cause-effect graphs, with the Fishbone Diagram, or Ishikawa diagram, popularized by Kaoru Ishikawa in 1968 being among the most popular. Taking into account a range of causal factors—from processes to people to materials and equipment—they begin with a problem and work back to its causes (or vice versa, as in the image below), generally looking something like this:
(Image provided by Wikimedia, reproduced here under a Creative Commons license.)
Simpler forms of cause-and-effect mapping, such as the Cause Mapping method employed by the company ThinkReliability, or more common Causal Factor Tree charts, such as those used by RealityCharting, are also prevalent in varied approaches to RCA and may serve some organizations’ needs better than others.
Not all problems are created equal, and some may require incredibly extensive causal-factor maps to arrive at one or multiple root causes. As always, the time it takes to sit down with a team for a brainstorming session and generate complex charts has to be weighed against the potential usefulness of such a chart. One should “avoid over-applying root-cause analysis,” advises James Shore at The Art of Agile. “Balance the risk of error against the cost of more process overhead.” Extensive RCA is not a necessary tactic for every problem, but simple versions of it can be helpful in most cases.
Returning to our focus on root cause analysis as it applies to the world of DevOps, root cause analysis is simply about determining, very specifically, the when, the where, and the why of a problem at its source, before it can ripple out to affect the end-user of an application or website a second time. And again, while developers and QA personnel may often need to engage in more traditional methods of root cause analysis, getting together around notepads or whiteboards for extended brainstorming sessions, there are now sophisticated Web-based tools that can do a lot of the job for you, automatically diagnosing the root causes of errors—particularly where Web performance monitoring and real user monitoring are concerned.
The Future: Inductive, Intuitive, and Automated Root Cause Analysis
The traditional practice of root cause analysis is a form of deductive analysis, Sherlock Holmes style, beginning with a known problem and working backward, sifting through the available evidence to identify the culprit. But taking the opposite approach is also possible (Minority Report, dear Watson?).
As automated software testing tools become increasingly sophisticated, understanding the systems they’re monitoring so well that they become predictive, those tools begin to extend into areas of inductive analysis—preventing problems before they ever start.
The practice of inductive analysis, such as the methodology known generally as FMEA (failure mode and effects analysis), in which software testers have to think about the kinds of bugs that might be present in order to write test cases to try to find them (which is particularly useful in Agile environments, where developers need to bear potential bugs and testing in mind as they code), is a form of forward-thinking reasoning that is useful for preventing problems from happening at all. And while humans can get quite proficient at inductive analysis, certain website monitoring solutions, such as AlertSite UXM, do it better, automatically alerting sysadmins to potential problems before they happen by understanding desired performance indicators and intelligently anticipating future deviations from the norm.
Find and Fix Issues Before They Impact Your Users
AlertSite helps you find issues and performance bottlenecks before they impact your users and your business by proactively monitoring your applications 24x7.
- In case of failures, AlertSite's intelligent alerting helps you engage right people at the right time. Various notification options, intelligent routing, alert customization and blackouts help you engage subject matter experts without wasting time.
- The reporting data and monitor run history helps you understand how much time was spent for various network activities when loading your application.
- Our integration with a leading APM platform helps your trace the application errors at the code level. This helps with isolating faulty component and root cause analysis, reducing MTTR.
Written by Tom Huston