In asset-heavy industries, failures are inevitable. Things fall out of alignment, seize, break. But by converting failures into hard, actionable numbers, you can make future failures smaller, less frequent, and more manageable. So, how can you calculate and use failure metrics, mean time to repair (MTTR), mean time before failure (MTBF), and mean time to failure (MTTF)?
But before looking at how to measure and convert failure into metrics, we need to make sure we understand what failure is. Not all failure is the same. Failure has nuance.
Partial vs complete failures
Painting in broad strokes, we can divide failure into two types, partial and complete. With partial, the asset might still work, but it's not going to be working properly.
Complete failures are like being asleep: you are or you aren't. But partial failures exist along a spectrum. They're the same as being tired. You can be everything from a bit sleepy to dead on your feet.
Let's look at an example from manufacturing. Imagine you have a giant pot press with a partial failure. The pots might not be exactly the right shape. Or, they might be coming out the right size and shape, but at the wrong speed, throwing everything off down the line. If it had a complete failure, though, press completely stops pressing.
Are partial failures better than complete failures? It depends on the situation, but it's likely that pressing a bunch of slightly-wrong pots is worse than pressing none at all.
At least with a complete failure, with the line coming to a screeching halt, you know there's a problem and can fix it. But many partial failures are "silent."
There are famous examples of Excel spreadsheets silently failing and causing errors in scientific papers without anyone noticing until much later. The software was interpreting the names of genes as dates, which then corrupted the data. According to The Washington Post, "But when you type these shortened gene names into Excel, the program automatically assumes they refer to dates — Sept. 2 and March 1, respectively. If you type SEPT2 into a default Excel cell, it magically becomes "2-Sep." It's stored by the program as the date 9/2/2016."
Because the software appeared to be working fine, no one knew to fix it.
Partial vs complete, a simple example
Consider a bicycle. We can say a complete failure is when the bike's chain slips the gears and comes off all the way. No matter how hard and fast you pedal, you're not going anywhere.
But what if just the chain guard comes off? In that case, the bike still works, and you might not even realize there's a problem. Moving along the spectrum, we can see failures that are more obvious but still only partial.
Imagine someone's gone and stolen the bicycle's seat. With a bit of determination and balance, you can still ride the bike by standing up on the pedals. It's not a complete failure; it's still only partial.
Just before we do that, though, let's remember that there's actually a third metric, MTTR (mean time to repair), which is equally as important.
We already looked at it in great detail in our blog discussing MTTR. I've included some of the highlights below, but it's worth your time to go and read the earlier post and them come back.
What is MTTR?
Mean time to repair (MTTR) measures how efficiently the maintenance department gets assets back up and running.
How to calculate mean time to repair (MTTR)
To calculate MTTR, the first thing you need to know is how much time you spent repairing an asset over a set period. Say you have a press with a tricky motor. Over a week, you spend a total of four hours working on it.
The first time you work on it for an hour and a half. Then the second time you need another two and a half hours.
Something to remember: In this specific case, the lengths of time to repair the asset are fairly similar. But this is not always the case.
You can still use MTTR with very different repair times. So, on another asset, the first time you fixed it, you needed thirty minutes. The second time, three hours. Third time, two days.
It's okay if the lengths of time are very different from one another. But the people doing the repairs need to be roughly the same in terms of ability and preparation. If the first time the maintenance team worked on the asset, it was three senior techs, but the second time it was one junior tech, the metric is less accurate. It's generally the case that less experienced techs take longer to repair an asset.
To make sure the person doing the work is throwing off your final numbers, you need to know how long a properly trained professional using a clear set of instructions takes to complete the repairs. If some of the data you're collecting is from a new hire working on an asset without an O&M manual, you're not going to end up with a useful result. In some cases, you might want to massage the numbers a bit allow for differences in experience and training.
Use this MTTR calculation formula to calculate your MTTR:
Take the total amount of time (which we already said was four hours) and divide it by the number of times you worked on the asset (which we said was two). Your MTTR is 2.
How to leverage MTTR
Generally, you want this number to be as small as possible. So once you have it, you should look for ways to shrink it.
For example, you might start to think about staffing. Maybe you need more people overall or just more people with specific skill sets.
Additional training for current staff might be another option. Or you can also look at different ways to capture and share "tribal knowledge," which is often all the clever tricks and shortcuts the senior staff knows thanks to hard-won experience with the assets and equipment. For example, a less experienced junior tech might have to run through a long list of possibilities when troubleshooting a stuck conveyor belt. The senior techs, though, know which rollers to check first. That information isn't in the manual, though. The techs know where to check because that's where the belt tends to stick.
By following a five-step process, maintenance departments can ensure everyone on the team has access to the same information. Now junior techs don't have to chase down senior techs to ask them questions. And when a member of the team retires, all their insights don't walk out the door with them.
MTTR is also often used to evaluate which spare parts to keep onsite and to set par levels. If things are taking too long to repair, it could be because tracking down the required parts is taking too much time and effort.
Tracking parts and materials with inventory control software helps ensure you have the parts you need, when you need them, at the right price. Once you associate inventory to work orders, the software automatically adjusts levels in real time when techs close out. Instead of finding you don't have the part you need, you're alerted as soon as you hit your safety par level, which is when you need to submit your next purchase order. A quick example: you tend to use one fan belt per month, and the lead time, the amount of time it takes for one to arrive after you place an order, is a month. When you hit your last two, the software warns you that it's time to place the next order. Now you never have to worry about running out. And you don't have to worry about carrying too much in inventory trying to cover yourself. You always have just enough.
Asset replacement and selection
MTTR can even be helpful when deciding to repair or replace an asset. Over the useful life of an asset, the MTTR tends to move up because older assets take more time to repair. Their failures tend to be more serious. But this is the opposite of what you want, which is to find ways to reduce equipment downtime.
By looking at the changes to its MTTR over time, the front office can better decide when an asset needs to be replaced or if it makes more sense to keep asking the maintenance department to repair it.
The front office can also use MTTR to make better decisions about which new assets to buy. One growing trend for assets is modular design. Imagine you have to fix one tiny spring in an old wristwatch. Just think of how carefully you would need to take the watch apart, replace that one broken piece, and then put everything back together. It's a nightmare. But if that same watch had a more modular design, when you opened it up, there would be only three "pieces." Inside each piece would be all the same little screws, springs, and whatnots you'd find in a regular watch, but here they'd be housed in compartments you can easily remove and replace.
What is MTBF: mean time between failure?
This metric reveals reliability. It shows you how long on average an asset can run before you need to repair it.
The word "repair" is key here: you only calculate MTBF for assets that you can fix. For things that you can only ever replace, for example light bulbs, you use a different metric.
How to calculate mean time between failure (MTBF)
You need three things: the total number of hours the asset was in operation, the number of times it failed, and the amount of time it took to repair after each failure.
You take the total number of hours of operation and divide it by the total number of failures.
One thing you don't need: the amount of time the asset was offline because of preventive maintenance. Calculating MTBF does not include the time you spent trying to avoid problems. It's important not to include the time you worked on the asset for PMs. If you do, you get a much worse result.
Let's look at a simple example. Say you have a press that ran for 24 hours. During that time, it failed twice, and each time it took an hour to get it back up and running.
So, it was in operation for a total of 22 hours (24 hours minus the two hours it took for repairs). Twenty-two divided by two, the total number of failures, equals 11.
Not a great asset. On average it's going to fail every 11 hours. That's not good.
What is the value of MTBF
But don't throw that press out just yet. Generally, when you have a low MTBF, you can trace it back to either operator error or issues with how the asset is being maintained and repaired.
You can likely improve MTBF with additional training and closer oversight for both operators and maintenance technicians.
Not only does MTBF expose issues with past use and repairs, but it also helps set up your preventive maintenance schedule for the future. If you know an asset, on average, fails every 100 hours, you can set PMs at every 90 hours. That way, you're getting the most bang for your PM buck.
What is MTTF: mean time to failure?
Here again, we're looking at reliability, but now it's for things you can't repair. You can only replace them. The easiest example is light bulbs.
How to calculate mean time to failure (MTTF)
When we looked at MTBF, all the numbers were from one asset. But for MTTF, we need a group of identical failed items.
Going back to our basic example, light bulbs, we might have four burnt-out bulbs, and they ran for 20, 22, 26, and 18 hours respectively. We add up those numbers and get 86.
When we divide that by the number of bulbs, which was four, we get an MTTF of 21.5 hours.
What is the value of MTTF?
Looking at our MTTF for the light bulbs, we can see right away you're going to need to switch brands, which is really all you can ever do when you have a low MTTF. You can only improve your results by buying better quality products. Mean TTF is the "you get what you pay for" metric.
MTTF also helps you better manage inventory. If you decide to stay with these awful light bulbs, at least you'll know to keep a lot of them in onsite inventory. Later, if you decide to switch to a better bulb, you know you can reduce carrying costs by keeping fewer of them around.
But sometimes the real power of MTTF is what it can tell you about the reliability of bigger, more complex assets.
In fact, the MTTF for a small part inside a large asset can have a huge effect on that asset's reliability. Think about your car. What happens when one of the interior lights burns out? Aside from some minor inconvenience, nothing.
But what about the fan belt? Like the light, it falls under the MTTF metric because it can't be fixed, only replaced.
But because the car can't run without the fan belt, the fan belt's MTTF can be more important than the car's MTBF when determining the car's overall reliability.
You can only really start to use failure metrics once you have a rock-solid data-collection system in place. Luckily, the easiest way to do that is with equipment maintenance software or work order software.
If you don't have a CMMS yet, now's the perfect time to look into getting one. Older versions required huge upfront investments in IT infrastructure and licensing contracts. Not only that, the software tended to be hard to learn and temperamental.
But a good CMMS today is easy to learn and easy to use, offering a clean, intuitive interface and go-anywhere accessibility.
Providers use cloud-based computing to make sure your data stays secure. And it's always your data; good providers are just babysitting it for you; of course, you can have it back whenever you ask for it.