Root cause analysis (RCA) is a systematic process for finding why things happened, and although maintenance departments often use it to avoid repeating failures, it's just as effective at helping recreate successes.
Understanding root cause analysis starts with clear definitions.
What is the definition of root cause analysis (RCA)?
Just like the name implies, it's the process of finding the original cause for an incident, but this definition leads us to two important questions:
What's the difference between accidents and incidents?
What's the difference between a cause and a symptom?
Incidents and accidents
Strictly speaking, an incident is something that happened, usually as the result of an earlier event. So, someone gets arrested after committing a crime. Or someone gets promoted after boosting uptime and cutting costs.
In one case, it's something positive while in the other, it's negative. Both are incidents, and this is something important to remember when looking at RCA. It's for both finding out why something didn't work and for finding out why something did.
Accidents are different in that they're always bad. Separate from who's to blame or if it was avoidable or not, you don't want accidents.
Causes and symptoms
The classic example is when you get sick and suffer from:
- Sore throat
Everything on that list is a symptom, a result of your being sick. But they are not the reason you're sick. The reason is likely a bacteria or viral infection. Why is the distinction important?
If you only ever treat the symptoms, you always run the risk of getting sick again. Sure, you can use all kinds of medicines to lessen the effects of the symptoms, but because you aren't attacking the root of the problem, it can come back.
Back to the definition of RCA
A bit more on root cause analysis: there's more than one way to do it, but generally you find the real reason behind an incident either by digging progressively deeper into the causes or by separating the causes, which helps you find the ones with the largest effect.
What are the benefits of root cause analysis?
Which would you rather do, develop a new cough syrup or discover the cure for the common cold?
Once you had the cure, you'd corner the market, making everyone else redundant. No more colds means never having to worry about that long list of symptoms.
But enough of that analogy. Let's look at RCA specifically for maintenance.
The first benefit is that root cause analysis reveals the real reasons something happened, which helps you both now and in the future. First, you can more easily fix a problem when you know what's causing it. For example, if a piece of equipment has a leak, you can patch it, dealing only with the symptom. But if you dig deeper and discover the leak is the result of a busted seal, you can replace the seal, both stopping the leak now and avoiding future leaks.
The second benefit of root cause analysis is that you can take what you learned with the broken seal and apply it to other equipment. You might decide that, after a bit more digging for the root cause, that the team is not inspecting the seals often enough, leading you to add more preventive maintenance inspections and tasks to the schedule. Or, it might be the case that your seals are generally low quality, and moving forward, you want to change suppliers. In either case, RCA helps you develop insights into your operations, and these insights can lead to better decision-making.
Common strategies and examples of root cause analysis
Now that we know what it is and why it's valuable, let's look at how to do it. There are different-but-connected methods.
The 5 Whys and "Could you be more specific, Mr. Stoll?"
The first method for root cause analysis is to ask yourself why something happened. Sounds easy at first, but it can become challenging as you drill down.
Here's a simple example of how repeatedly asking why can become hard. Dr. Clifford Stoll, famous for his book about tracking East German computer hackers, has a great story about defending his doctoral thesis. Everything was going fine until close to the end when one of the professors ask him what seemed like a simple question: "Why is the sky blue?"
Every time he answered, he was met with, "Could you be more specific, Mr. Stoll?" And so he kept digging deeper, until he was at the level of explaining molecular energies, optics, and the inner workings of the human eye.
RCA works the same way. You ask yourself why something happened. Then why that happened. You keep going until you're about five or so levels down. Five is the general rule of thumb for root cause analysis; in some cases, you only need to dig down twice, while in others, it's deeper.
Simple example of 5 Whys for RCA
So, back to our earlier example of the leaking equipment. How could you use the 5 Whys?
Why is it leaking?
The connection between two pieces is not tight enough to keep the liquid from coming out.
Why is it not tight enough?
The rubber seal is damaged, preventing a tight fit.
Why is the seal damaged?
It was not installed properly. When the techs tightened the pieces together, the threads bit into the seal, damaging it.
Why did the techs install it improperly?
The seals are different than the ones they're used to working with, and they did not receive new training.
So, why is there a leak? The root cause of the leak is a lack of proper training on how to install the new seals.
Change analysis and event analysis
In the leaking example, we're drilling down from one connected cause to the next, looking for the original.
For change and event analysis, it's a bit different. Here, we're sorting through different changes, looking for the one that was the root cause. We have to decide if each change leading up to the incident was unrelated, correlated, contributing, or a root cause.
Unrelated means there is no relationship, and the change did not cause the incident. Correlated means there is a relationship, but the change did not cause the incident. How is this possible? The classic example of "correlation is not causation" is murder and ice cream. Whenever ice cream sales increase, there is a corresponding increase in the murder rate. For example, a small increase in sales is followed by a small increase in murder.
But ice cream is not killing people and it's not driving people to kill, either. Instead, there is a lurking variable in the background pushing up both numbers: heat. When the temperature rises, people eat more ice cream. And they have much shorter fuses.
Contributing means the change helped but wasn't the only cause.
Simple example of change analysis and event analysis for RCA
Here's a concrete example of how to use this method for finding a root cause.
The maintenance manager notices an increase in the monthly close-out rate for preventive maintenance tasks and inspections. Hoping to recreate the success, they look for recent changes that could explain it and come up with the following list:
- A new tech started two months ago
- The maintenance department switched supplies for some parts and materials
- A different tech has been taking care of the PMs while the regular tech is on vacation.
Which change is the root cause? The first one turns out to be unrelated. The new tech was hired specifically for a maintenance project that's separate from the PM program. Looking at the parts and materials, it's hard to say they're having an effect on close-out rates. They might last longer and cost less, but that wouldn't make them easier to use.
But are they easier to find? The manager notices that the packaging is better. The writing is clear, and many of the boxes are color coded. It's a small change, and likely only saving the techs a few minutes per PM, so the manager decides it's only contributing.
That leaves the fact that a different tech had been doing the PMs, which looks great for the tech. But is it the root cause for the better close-out numbers? When the maintenance manager compares the new tech to the old one, they're very similar, with roughly the same amounts of experience and time with the company.
Digging deeper and asking between two and 5 whys, the manager finds the root cause. The first tech tends to work on a later shift, which meant they're constantly being called away to deal with on-demand work orders. The second tech usually works the earlier shift, before any equipment has had a chance to break down. They're able to get more PMs closed out because no one is reprioritizing their on-demand work order schedule.
Now that the manager has found the root cause, they can recreate that success by actively scheduling PMs for the earlier shift.
In this example, the maintenance manager was able to find a root cause they could control. That's not always the case. There are times when you can easily identify the root cause but you can't easily do anything about it. What if the difference between the two techs was that one had a young baby at home and the other one didn't? Because the baby's up all night, the tech's not getting enough sleep, and it's affecting their performance? RCA can tell us why something happened, but that doesn't mean it also tells us the best way to fix it.
Ishikawa or fishbone diagrams (aka Fishikawa)
When you're first brainstorming possible causes, there's no such thing as a bad idea. But once you have all the ideas out in front of you, it's time to start organizing them, deciding which are the best ones.
Ishikawa diagrams, named after Kaoru Ishikawa, a key figure in Japanese quality management innovations, show the causes leading up to an event. The name fishbone diagram comes from their resemblance to a fish skeleton with the effect at the head.
By building out the diagram, you can get a better understanding of the causes, their relationships to one another, and their relative contribution to the final effect. From there, you can work on finding ways to either re-enforce or remove the causes.
How can maintenance departments implement root cause analysis?
Now that you know what it is, it's time to get it working for you. One of your most important goals needs to be getting accurate information. If you want to prevent a problem from popping up again, you need to know why it happened in the first place. But we know that there are times when getting to the bottom of things can be challenging.
Make reporting incidents easier
Ask yourself, When something goes wrong, is there a process in place for reporting it? How comfortable are the techs using this process? Does it encourage them to be open and honest?
There are different ways to approach the situation. For example, you can have a frank discussion with the team, assuring them that you're more interested in avoiding problems than punishing people. And when you have the chance, go out of your way to show that you really mean it.
You can also look at ways to allow for anonymous reporting. Techs might feel more comfortable reporting if their name is not attached.
Get standardized maintenance processes with a CMMS
Before you can reliably use RCA, the maintenance team needs to be performing inspections and tasks the same way every time. If your processes are not standardized, there's no way for you to look back for changes. Remember, with change and event analysis, you're looking for what was different. If the team does things differently every time, you can't find it.
Modern CMMS software helps you standardize processes with work orders packed with step-by-step instructions and customizable checklists. Now, instead of techs winging it when they have to complete an unfamiliar task, they can easily access the department's best practices.
And if they have any questions, they can quickly reach out from anywhere by using the CMMS to add comments directly to tasks and work orders.
Get reliable data with a CMMS
A good CMMS makes it easy to capture and share data.
With paper- and spreadsheet-based methods, there're too many chances for bad data to creep in. Old-fashioned paperwork makes it hard to create copies, which are then easy to lose. And with spreadsheets, you can make many copies quickly, but you don't have any way to keep them all connected and up to date.
Modern maintenance management solutions keep everything in a central database your team can access from any connected device, from desktops to smartphones. And because everyone is working from the same data, it's always accurate and up to date.
And that means when you go back and start looking for causes, you know you can trust your data.
If you're hoping to start using root cause analysis, you need a CMMS.
Quick, concise summary
Root cause analysis is a process maintenance departments can use to find out why something happened. From there, they can find ways to avoid or recreate it. One way to discover causes is using the 5 whys. By carefully asking why something happened, maintenance departments can dig down to find the original, root cause. Another common method is looking for recent changes that could explain the new result. Fishbone diagrams can also help departments organize and understand the categories of causes and their relationships. To use RCA effectively, departments need an open process for incident reporting. They also need reliable data, and modern CMMS solutions are a cost-effective way to ensure data is both reliable and available.