In asset-heavy industries, failures are inevitable. Things fall out of alignment, seize, break. But by converting failures into hard, actionable numbers, you can make future failures smaller, less frequent, and more manageable.
But before looking at how to measure and convert failure into metrics, we need to make sure we understand what failure is. Not all failure is the same. Failure has nuance.
Partial vs complete failures
Painting in broad strokes, we can divide failure into two types, partial and complete. With partial, the asset might still work, but it’s not going to be working properly.
Complete failures are like being asleep: you are or you aren’t. But partial failures exist along a spectrum. They’re the same as being tired. You can be everything from a bit sleepy to dead on your feet.
Let’s look at an example from manufacturing. Imagine you have a giant pot press with a partial failure. The pots might not be exactly the right shape. Or, they might be coming out the right size and shape, but at the wrong speed, throwing everything off down the line. If it had a complete failure, though, press completely stops pressing.
Are partial failures better than complete failures? It depends on the situation, but it’s likely that pressing a bunch of slightly-wrong pots is worse than pressing none at all.
At least with a complete failure, with the line coming to a screeching halt, you know there’s a problem and can fix it. But many partial failures are “silent.”
There are famous examples of Excel spreadsheets silently failing and causing errors in scientific papers without anyone noticing until much later. The software was interpreting the names of genes as dates, which then corrupted the data. According to The Washington Post, “But when you type these shortened gene names into Excel, the program automatically assumes they refer to dates — Sept. 2 and March 1, respectively. If you type SEPT2 into a default Excel cell, it magically becomes “2-Sep.” It’s stored by the program as the date 9/2/2016.”
Because the software appeared to be working fine, no one knew to fix it.
Partial vs complete, a simple example
Consider a bicycle. We can say a complete failure is when the bike’s chain slips the gears and comes off all the way. No matter how hard and fast you pedal, you’re not going anywhere.
But what if just the chain guard comes off? In that case, the bike still works, and you might not even realize there’s a problem. Moving along the spectrum, we can see failures that are more obvious but still only partial.
Imagine someone’s gone and stolen the bicycle’s seat. With a bit of determination and balance, you can still ride the bike by standing up on the pedals. It’s not a complete failure; it’s still only partial.
Not to get too philosophical here, but part of dealing with failures is admitting that some of them are inevitable. There’s just no way to eliminate them.
In some cases, you want to bake failure right into your maintenance strategy. For example, everyone runs light bulbs until they fail. There’s no quick way to inspect the filament and no easy way to pre-emptively replace it. So you’re always going to go right up to and past that failure.
In other cases, no matter how hard you try, some failures are going to slip by you, especially when you’re working with a newer asset or setting up and fine-tuning a preventive maintenance program. How often should you be inspecting and swapping out fan belts? Your answer develops over time based on both the manufacturer’s recommendations and your direct experience with failures.
But that doesn’t mean you shouldn’t be doing everything you can to avoid failures. And part of that is using each failure as an opportunity to learn more about your asset, equipment, and the parts and materials you need to keep them up and running. By tracking and applying the right metrics, every failure helps you avoid future problems.
Basically, it comes down to “Fool me once, shame on you. Fool me twice, shame on me.”
So, we have the theory that failures are opportunities to learn how to avoid them in the future, but how do we make the jump from a way of thinking to a way of doing, a set of concrete steps we can take?
Let’s look at maintenance failure metrics, what they are, how they work, and how we can work with them.
Mean time to repair (MTTR) measures how efficiently the maintenance department gets assets back up and running.
For a full explanation of MTTR, along with the formula and some concrete examples, check out our post What is Mean Time To Repair (MTTR)?
How to calculate mean time to repair (MTTR)
To calculate MTTR, the first thing you need to know is how much time you spent repairing an asset over a set period. Say you have a press with a tricky motor. Over a week, you spend a total of four hours working on it.
The first time you work on it for an hour and a half. Then the second time you need another two and a half hours.
Something to remember: In this specific case, the lengths of time to repair the asset are fairly similar. But this is not always the case.
You can still use MTTR with very different repair times. So, on another asset, the first time you fixed it, you needed thirty minutes. The second time, three hours. Third time, two days.
IUse this MTTR calculation formula to calculate your MTTR:
Take the total amount of time (which we already said was four hours) and divide it by the number of times you worked on the asset (which we said was two). Your MTTR is 2.
How to leverage MTTR
Generally, you want this number to be as small as possible. So once you have it, you should look for ways to shrink it.
For example, you might start to think about staffing. Maybe you need more people overall or just more people with specific skill sets.
Additional training for current staff might be another option. Or you can also look at different ways to capture and share “tribal knowledge,” which is often all the clever tricks and shortcuts the senior staff knows thanks to hard-won experience with the assets and equipment. For example, a less experienced junior tech might have to run through a long list of possibilities when troubleshooting a stuck conveyor belt. The senior techs, though, know which rollers to check first. That information isn’t in the manual, though. The techs know where to check because that’s where the belt tends to stick.
By following a five-step process, maintenance departments can ensure everyone on the team has access to the same information. Now junior techs don’t have to chase down senior techs to ask them questions. And when a member of the team retires, all their insights don’t walk out the door with them.
MTTR is also often used to evaluate which spare parts to keep onsite and to set par levels. If things are taking too long to repair, it could be because tracking down the required parts is taking too much time and effort.
Tracking parts and materials with inventory control software helps ensure you have the parts you need, when you need them, at the right price. Once you associate inventory to work orders, the software automatically adjusts levels in real time when techs close out. Instead of finding you don’t have the part you need, you’re alerted as soon as you hit your safety par level, which is when you need to submit your next purchase order. A quick example: you tend to use one fan belt per month, and the lead time, the amount of time it takes for one to arrive after you place an order, is a month. When you hit your last two, the software warns you that it’s time to place the next order. Now you never have to worry about running out. And you don’t have to worry about carrying too much in inventory trying to cover yourself. You always have just enough.
Asset replacement and selection
MTTR can even be helpful when deciding to repair or replace an asset. Over the useful life of an asset, the MTTR tends to move up because older assets take more time to repair. Their failures tend to be more serious. But this is the opposite of what you want, which is to find ways to reduce equipment downtime.
By looking at the changes to its MTTR over time, the front office can better decide when an asset needs to be replaced or if it makes more sense to keep asking the maintenance department to repair it.
The front office can also use MTTR to make better decisions about which new assets to buy. One growing trend for assets is modular design. Imagine you have to fix one tiny spring in an old wristwatch. Just think of how carefully you would need to take the watch apart, replace that one broken piece, and then put everything back together. It’s a nightmare. But if that same watch had a more modular design, when you opened it up, there would be only three “pieces.” Inside each piece would be all the same little screws, springs, and whatnots you’d find in a regular watch, but here they’d be housed in compartments you can easily remove and replace.
This metric reveals reliability. It shows you how long on average an asset can run before you need to repair it.
For a full explanation of MTBF, along with the formula and some concrete examples, check out our post What is Mean Time Between Failures.
How to calculate mean time between failures (MTBF)
You need three things: the total number of hours the asset was in operation, the number of times it failed, and the amount of time it took to repair after each failure.
You take the total number of hours of operation and divide it by the total number of failures.
Let’s look at a simple example. Say you have a press that ran for 24 hours. During that time, it failed twice, and each time it took an hour to get it back up and running.
So, it was in operation for a total of 22 hours (24 hours minus the two hours it took for repairs). Twenty-two divided by two, the total number of failures, equals 11.
Not a great asset. On average it’s going to fail every 11 hours. That’s not good.
What is the value of MTBF
But don’t throw that press out just yet. Generally, when you have a low MTBF, you can trace it back to either operator error or issues with how the asset is being maintained and repaired.
In some cases, what you need is more and better standardization across inspections, maintenance, and repairs.
When everyone on the team has their own way of checking, maintaining, and fixing your assets and equipment, there’s no way to ensure anyone is doing the work correctly. Instead, what you need is a way to get everyone following best practices.
The benefits are twofold. First, the maintenance department can better look after assets and equipment when everyone is doing the best possible work. Second, it’s so much easier to find and correct problems when you’re starting from a consistent baseline.
For example, if you have an asset that keeps overheating, it’s tough to know why when everyone is checking levels, adding different lubrications, and in different ways. Is it the product? Is it the procedure? But if everyone on the team checks the same way, adds the same product using the same process, you have very variables the check.
Not only does MTBF expose issues with past use and repairs, but it also helps set up your preventive maintenance schedule for the future. If you know an asset, on average, fails every 100 hours, you can set PMs at every 90 hours. That way, you’re getting the most bang for your PM buck.
Here again, we’re looking at reliability, but now it’s for things you can’t repair. You can only replace them. The easiest example is light bulbs.
For a full explanation of MTTF, along with the formula and some concrete examples, check out our post What is Mean Time To Failure?
How to calculate mean time to failure (MTTF)
When we looked at MTBF, all the numbers were from one asset. But for MTTF, we need a group of identical failed items.
Going back to our basic example, light bulbs, we might have four burnt-out bulbs, and they ran for 20, 22, 26, and 18 hours respectively. We add up those numbers and get 86.
When we divide that by the number of bulbs, which was four, we get an MTTF of 21.5 hours.
What is the value of MTTF?
Looking at our MTTF for the light bulbs, we can see right away you’re going to need to switch brands, which is really all you can ever do when you have a low MTTF. You can only improve your results by buying better quality products. Mean TTF is the “you get what you pay for” metric.
MTTF also helps you better manage inventory. If you decide to stay with these awful light bulbs, at least you’ll know to keep a lot of them in onsite inventory. Later, if you decide to switch to a better bulb, you know you can reduce carrying costs by keeping fewer of them around.
But sometimes the real power of MTTF is what it can tell you about the reliability of bigger, more complex assets.
In fact, the MTTF for a small part inside a large asset can have a huge effect on that asset’s reliability. Think about your car. What happens when one of the interior lights burns out? Aside from some minor inconvenience, nothing.
But what about the fan belt? Like the light, it falls under the MTTF metric because it can’t be fixed, only replaced.
You can only really start to use failure metrics once you have a rock-solid data-collection system in place. Luckily, the easiest way to do that is with equipment maintenance software or work order software.
If you don’t have a CMMS yet, now’s the perfect time to look into getting one. Older versions required huge upfront investments in IT infrastructure and licensing contracts. Not only that, the software tended to be hard to learn and temperamental.
But a good CMMS today is easy to learn and easy to use, offering a clean, intuitive interface and go-anywhere accessibility.
Providers use cloud-based computing to make sure your data stays secure. And it’s always your data; good providers are just babysitting it for you; of course, you can have it back whenever you ask for it.