Dreaded NFF

NFF (No Fault Found) is an often-used KPI to which may get a whole new meaning with the introduction of predictive maintenance. Let’s go back to the origin of this KPI. Some two decades ago it gained in attention as companies increasingly focused on customer satisfaction, people found out that many so-called ‘bad parts’ that were injected in the reverse supply chain tested perfectly and were therefore flagged NFF. There has been an ongoing struggle between the field organisations’ and the reverse supply chain’s goals. Field service did all it could to increase the number of interventions per engineer per day, even at the expense of removing too many parts, under the motto that time is more expensive than parts. Removed parts then get injected in the reverse supply chain where they’re typically get sent to a hub to be checked and subsequently repaired. To the reverse logistics/repair organisation, NFF create ‘unnecessary’ activity – parts needing to be checked without any demonstrable problem being uncovered. These parts then need to be re-qualified: documented, repackaged,… Therefore, NFF is really bad to the observed performance of that organisation.

Back to predictive analytics: whereas CBM (Condition Based Maintenance) will order for a maintenance activity based on the condition of the equipment/part – and therefore do so when there’s demonstrable cause for concern – predictive maintenance will ideally generate a warning with longer lead time. Often before any signs of wear/tear become apparent! Provided certain conditions are met (see previous posts on criticality/accuracy/coverage/effort), parts will be therefore be removed which technically will be NFF! Because the removal of these parts will prevent a much costlier event this is not a problem per se, but it will require rethinking not only internal and external processes but also KPI’s! If we want adoption of these predictive approaches to maintenance and operations, KPI’s need to be rethought to reflect the optimised nature of these actions. We can’t allow anybody to be penalised for applying the optimal approach!

Predictive (or by extension, prescriptive) maintenance has huge potential for cost savings and as we’ve seen before (see previous blog entries), these savings should be looked at from a holistic point of view. Some costs may actually go up in order to bring down overall costs. Introducing such methodologies therefore also demand a lot of attention to process changes and to how people’s performance is measured. The good news is that introduction of predictive maintenance can be gradual; i.e. start with those areas that offer high confidence in the predictions and high return. Nothing helps adoption better than proven use cases!

Today’s focus area for operations: increase uptime!

As other domains such as procurement, supply chain, production planning, etc. get increasingly lean, attention focuses on the few remaining areas where large gains are expected from increasing efficiency. Fleet uptime or machine park uptime is thé focus area today. Indeed, investors increasingly look at asset utilisation to determine whether an operation is run efficiently or not. As we know, in the past many mistakes have been made by focusing on acquisition cost at the cost of quality. This has led to a lot of disruption with regards to equipment uptime which, in turn, renders inefficient any of the lean initiatives mentioned above. So, what are the important factors determining uptime? We’ll look at the two most important ones:

– reducing the number of failures

– reducing time to repair (TTR)

Reducing the number of failures sounds pretty obvious: purchase better equipment and you’re set. Sure, but how do you know the equipment is better? Sometimes, it’s easily measurable; i.e. I’ve known a case where steel screws were replaced by titanium ones. Although the latter were maybe five times more expensive, their total cost on the machine may have been less than 1,000$ whereas one failure caused by a steel screw cost 25,000$. Taking an integrated business approach to purchasing saved a lot of money over the lifetime of the equipment. In other cases, the extra quality is hard to measure and one has to trust the supplier. This ‘trust’ can be captured by SLA’s, warranty contracts or even fully servicized approach (where the supplier gets paid if and when the equipment functions according to a preset standard).

Number of failures can also be reduced by improving maintenance; pretty straightforward for run of the mill things such as clogging oil filters, etc. One just sets a measurement by which a trigger is set off and performs the cleaning or replacement. This is what happens with your car; every 15,000 miles or so certain things get replaced, whatever their status. The low price of both the parts involved and the intervention allows for such an approach. Things become more complex when different schedules need to be executed on complex equipment: allow all of the triggers to work independently (engine, landing gear, hydraulics, etc. on a plane for instance) may cause maintenance requirements almost every day. At least some of these need to be synchronised and ideally, the whole maintenance schedule should be optimised. Mind you, optimisation doesn’t necessarily mean a minimisation of the number of interventions! It should rather focus on minimising impact on operational requirements.

In order to further reduce the number of failures, wouldn’t it be great if we could prevent those events that occur less often? This involves predicting the event and prescribing an action in order to minimise its impact on production. This is exactly the focus of prescriptive maintenance; combining predictions (resulting from predictive analytics) with cause/effect/cost analysis to come up with the most appropriate course of action. Ideally, if maintenance is prescribed, it enters the same optimisation logic as described above. Remember, the goal is to optimise asset utilisation.

Reducing TTR is too often overlooked or just approached by process standardisation. However, many studies have shown that TTR is highly impacted by the time it takes to diagnose the problem and the time to get the technician/parts on site – especially in the case of moving equipment. Predictive analytics may help reduce both: the first, by providing the technician with a list of the systems/parts most at risk at any moment in time and the second by making sure the ‘risky’ parts are available. There’s nothing worse than having to set in motion an unprepared chain of actors (technical department, supplier, tier 1,…) for tracking down a hard to find part. This is even worse when the failing machine slows down or halts an entire production chain…

Poor ROA (Return On Assets) is often a trigger for takeovers because the buyer is confident they can easily improve the situation. It’s one of the telltale signs of a poorly run or suboptimal operation and has to be avoided at all cost. If your sights are not yet set on this domain, chances are other people’s are!

What did the Coffee Pot say to the Toaster?

The Internet of Things (IoT) is at the precipice of the Gartner Hype cycle and there is no shortage of the “answers to everything” being promised. Many executives are just now beginning to find their feet after the storm wave that was the transition from on-premise to cloud solutions and are now being faced with an even faster paced paradigm shift. The transformative tidal wave that is IoT is crashing through CEO, CTO, and CIO’s offices and they are frantically searching for something to float their IoT strategies on but often are just finding drift wood.


Dr. Timothy Chou and his latest book Precision: Principles, Practices, and Solutions for the Internet of Things is your shipwright. The framework presented by Dr. Chou cuts through the fog that surrounds IoT and provides a straight forward no jargon explanation of IoT and the power that is harnessable. Dr. Chou then goes on to present a showcase of case studies that are real life profitable IoT solutions by a variety of traditional and hi-tech businesses.

One of the case studies Dr. Chou features is based on my work at New York Air Brake where we utilized instrumented and connected locomotives to create the world’s most advanced train control system that has saved the rail industry over a billion dollars in fuel, emissions, and other costs. It was this work that gave me a taste of the power IoT has and gave me the passion to want to make a bigger impact in the rail and transportation industries utilizing IoT data and thus join the Predikto family.

A predictive maintenance example

A prediction doesn’t mean that something will happen! A prediction merely says something may happen. Obviously, the more accurate that prediction gets, the closer it comes to determining something will happen. Yet, we often misinterpret accuracy or confidence in a prediction; when something has 20% chance of failing or 90% chance of failing, we often mistake the result of the failure for the chance of failing. In both cases, when the failure occurs, the result is the same; it is only the frequency of this failure happening that changes.

What I describe above is one of the reasons why managers often fail to come up with a solid business case for predictive analytics. Numbers – and especially risk-based numbers – all to often scare off people when they’re really not that hard to understand. Obviously, the underlying predictive math is hard but the interpretation from a business point of view is much simpler than most people dare to appreciate. We’ll illustrate this with an example: Company A is in the business of sand. Could hardly be simpler than that. It’s business consists of unloading barges, transporting and sifting the sand internally and then loading it onto truck for delivery. To do this, they need cranes (to unload the ships), conveyor belts and more cranes (to load the trucks). Some of these items are more expensive (the ship-loading cranes) or static (the conveyor belts) than others (the truck loading cranes). In this case, this has led to a purchasing policy which has focused on getting the best cranes available for offloading the ships (tying down a ship because the crane is broken is very expensive), slightly less stringent on the conveyor belts (if it’s broken, at least the sand is on our yards and we can work around the failures with our mobile cranes) or downright hedged by buying overcapacity on the, cheaper, mobile cranes. This happens quite often: the insurance strategy changes with either the value of the assets as well as with their criticality to the operations. Please also note that criticality goes up with diminishing alternatives… A single asset is typically less critical (from an operational point of view) when part of a fleet of 100 than if it were alone to perform a specific task.

All these assets are subject to downtime; both planned and unplanned. We’ll focus on the unplanned downtime. When a fixed ship-loading crane fails, the ship either can’t be off-loaded any more or it has to be moved in reach of another such cranes (if that one’s available). Either way, the offloading is interrupted and the failure not only yields repair costs (time: diagnose, get the part, fix the problem – parts – people) but also delays the ship’s departure, which may result in additional direct charges or costs due to later bay availability for incoming ships. When a conveyor belt breaks down, there’s the choice of waiting for it to be repaired or for finding an alternative such as charging the sand on trucks and hauling it the processing plant. Both situations come at a high cost. Moreover, both the cranes and the conveyor may cause delays for the sifting plant, which is probably the most expensive asset on site whose utilisation must be maximised! For the truck loading cranes, the solution was to add one extra crane for every 10 in the fleet. That overcapacity should ensure ‘always on’ but comes at the cost of buying spare assets.

Let’s now mix in some numbers. Let’s say a ship-loading crane costs €5,000,000; a conveyor costs €500,000 and a mobile crane costs €250,000. The company has three ship docks with one crane each, 6 conveyors and a fleet of 20 mobile cranes, putting their total asset value at €22,000,000. If we take a conservative estimate that 6% of the ARV (Asset Replacement Value) is spent on maintenance, this installed base costs €1,320,000 to maintain every year. Let’s further assume that 50% of the interventions are planned and 50% are unplanned. We know that unplanned maintenance is 3-9 times more expensive than planned so for this example we’ll take the middle figure of 6x. We can now easily calculate the cost of planned and unplanned events by: €1,320,000 = 0.5x + 0.5*6x, where x is the total planned maintenance cost. Result: of the total maintenance cost, roughly €190,000 is spent on planned maintenance whereas a whopping €1,130,000 is due to unplanned downtime! If the number of maintenance events is 200, that means that one planned maintenance event costs €1,900 and one unplanned event costs €11,300. . Company A has done all it can to optimise the maintenance processes but can’t seem to drive down the costs further and therefore just decided this is part of doing business.

Meanwhile on the other part of town… Company B is a direct competitor of Company A. And for the sake of this example, we’ll even make it an exact copy of Company A but for one difference: it has embarked on a project to diminish the number of unplanned downtime events. They came to the same conclusion that for the 200 maintenance events, the best way to lower the costs was if they could magically transform unplanned maintenance into planned maintenance. They did some research and found that, well, they could – at least for some. Here’s the deal: if we can forecast a failure with enough lead time, we can prevent it from happening by planning maintenance on the asset (or component that is forecasted to fail) either when other maintenance is planned to happen or during times when the asset is not required for production. While the event is still happening, the prevent-fix being planned costs €1,900 as compared to a break-fix costing €11,300 – that’s a €9,400 difference per event!

The realisation that the difference between a break-fix and a prevent-fix was €9,400 per event allowed them to avoid the greatest pitfall of predictive maintenance. Any such project requiring a major shift in mindset is bound to face headwind. In predictive analytics, most of the pushback comes from people not understanding risk-based decision making or people not seeing the value associated with introducing the new approach. The first relates to the fact that many people still believe that predictions should be spot-on. Once they realise this is impossible, they often (choose to) ignore the fact that sensitivity can be tuned to increase precision albeit at a cost: higher precision means less coverage (if we want to get higher prediction confidence, we can get this but out of all failures, we’ll catch a smaller portion). “If you can predict all failures, then what’s the point?” is an often heard objection.

Company B did it’s homework though and concluded that they could live with the high enough prediction accuracy at a 20% catch rate. The accuracy at this (rather low) catch rate meant that for every 11 predictions, 10 actually prevented a failure and 1 was a false positive (these figures are made up for this example). Let’s look at the economics: a 20% catch rate means that of 100 unplanned downtimes, 20 could be prevented, which resulted in a saving of 20 x €9,400 = €188,000. However, the prediction accuracy also means that for catching these 20, they actually had to perform 22 planned activities; the 2 extra events costed 2 x €1,900 = €3,800. The resulting savings were therefore €188,000 – €3,800 = €184,200; savings of more than 16% on the total maintenance budget!


What’s more, there are fringe benefits: avoiding the unplanned downtime results in better planning, which ultimately results in higher availability with the same asset base. Stock-listed companies how important ROCE (Return On Capital Employed) is when investors compare opportunities but even private companies should beware: financial institutions use this kind of KPI’s to evaluate whether or not to allow for credit and at what rate (it plays a major role in determining a company’s risk profile). Another fringe benefit – and not a small one – is that on the fleet sizing for the mobile cranes (remember they took 10% extra machines just as a buffer for unplanned events), fleet size can be adjusted downward for the same production capacity because downtime during planned utilisation will be down by 20%. Even if they play it very safe and just downsize by one crane, that’s a €250,000 one-time saving plus an annual benefit of 6% on that: €15,000!

Company B is gradually improving flow by avoiding surprises; a 20% impact can’t go unnoticed and has a major effect on employee morale. They also did their homework very well and passed (part of) the reduced operational costs on to their clients. Meanwhile, at Company A, employees constantly feel like they’re running after the facts and can’t understand how Company B manages to undercut them on price and still throw a heck of a party to celebrate their great year!

The next efficiency frontier?

Mountains of consulting dollars have been invested in business process optimisation, manufacturing process optimisation, supply chain optimisation, etc. Now’s the time to bring everything together and with all these processes optimised, our whole production apparatus utilisation rate becomes ever higher. When all goes well, this means more gets done per invested dollar, making CFO and investors happy through better ROA (Return On Assets). However efficient, this increasing load on the machine park comes at a price: less wriggle room in case something unexpected happens. When in the past, companies had excess capacity, this not only served to absorb demand variability; it also came in very handy when machines broke down by allowing the demand to be re-routed to other equipment.
There’s no more place to hide now, so there are a number of options one can consider in order to avoid major disruptions:

  • increase preventive maintenance: this may or may not help. Law of diminishing returns applies, especially as preventive maintenance tends to focus on normal wear and tear and parts with a foreseeable degradation behaviour. A better approach is to improve predictive maintenance; don’t overdo where there’s no benefit but try to identify additional quick wins. Your best suppliers will be a good source of information. Suppliers than can’t help; well, you can guess what I think of those.
  • improve the production plan: too many companies still approach production planning purely reactively and lack optimisation capabilities. Machine changes, lot’s of stop and go, etc. all add to the fragility of the whole production apparatus (not to mention they typically – negatively – influence the output quality as well).
  • improve flow: I’m still perplexed when I see the number of hick-ups in production lines because ‘things bumped into each other’. Crossing flows of unfinished parts is still a major cause of disruption (and a major focus point for top performers such as Toyota). As most plant managers why machines are in a certain place and they either “don’t remember” or will say “that’s the place where they needed the machine first” or even “that was the only place we had left”. Way too rarely do plant layouts get re-considered. Again, the best-in-class do this as often as once a year!
  • shift responsibilities: if you can’t (or won’t) be good at maintenance, then outsource it! Get a provider that can improve your maintenance and ideally can work towards your goal, which is usually not to have shinier machines but to get more and better output. If you really decide you don’t care about machine ownership at all, consider performance- or output-based contracts.
  • get better machines: sounds trivial but current purchasing approaches often fail to capture the ‘equipment quality’ axis and forget to look at lifetime cost in light of output. Just two months ago I heard of a company buying excavators from a supplier because for every three machines, they got one for free. This was presented as an assurance that the operator would never run out of machine capacity. In this case, it had the adverse effect as the buyer thought why they needed to throw in an extra machine if they claimed they were as reliable as the best.
  • connect your machines: this is a very interesting step. Recognising that machines will eventually fail but at least making sure you get maximum visibility on what/where. Most of the time resolving equipment failures is spent… waiting! Waiting for the mechanic to arrive, waiting for the right part, etc.
  • add predictive analytics: predictive analytics not only allow you to prevent failures from happening but, relating to the previous point, to the what/where axis, predictive analytics allows the addition of why. Determining why something failed or will fail is crucial in optimising production output. Well-implemented predictive analytics allow us to improve production continuity by avoiding unplanned incidents (through predictive maintenance) but also allows for more efficient (faster) and effective (resulting in better machine uptime) maintenance.

So which of these steps should we take? Frankly, all of them. Maybe not all at once and (hopefully) some of them may already have been implemented. Key is to have a plan. Where are we now, what are our current problems, what are we facing,…? Formulating the problem is half the solution. Then – and this may surprise some – work top down. Start with the end goal, your “ultimate production apparatus”, and work your way back to determine how to get there. All too often people start with the simplest steps without having looked at the end goal and after having taken two or three steps they find out they need to backtrack because they took the wrong turn earlier in the process.

At any step, whether it’s purchasing equipment or to install sensors or whatever, look at whether your supplier understands your goals and is capable of integrating in “the bigger plan”. The next efficiency frontier is APM: Asset Performance Management. Not individually, but from a holistic point of view. While individual APM metrics are interesting for determining rogue equipment, only the overall APM metrics matter for the bottom line; did we deliver on time, was the quality on par, at what cost,…

Predictive Maintenance – a framework for executives

We typically expect statements like “there’s a 20% chance of part A failing over the coming two weeks” from a predictive analytics solution. More important than the prediction though, is the interpretation of that statement and what it means to operations, maintenance, etc.
Predictions are at the core of predictive maintenance applications. Understanding and by extension, applying predictions is not a given. The four-axis framework laid out in this blog should allow any executive to not only fully grasp the impact of prediction-driven decisions but also to make sure the whole organisation grasps predictive concepts. This framework not only looks at the risk-based decision process but it does so in plain language, no math degree required (should be a relief to most of us).
In plain language, the accuracy of a prediction measures how often the prediction turned out to be correct after the facts. So, if a prediction says a part will fail and accuracy is 20%, it means that you’re over-predicting five times. For the maintenance teams this means they’ll have to inspect/replace five parts in order to prevent one failure. While this doesn’t sound impressive, there are cases where you still want to go ahead and act. Which is why we introduce a second axis – Criticality.
If we take the prediction from above and look at two different situations; in case A, that 20% accuracy prediction is about in-flight entertainment. While a hassle (and for some airlines a no-go part), we can safely assume that most airlines wouldn’t act on a prediction for this non-critical part at such a low accuracy. However, if in case B the same prediction is made about the landing gear, it may warrant somewhat more attention. An example from every-day life would be; there’s a 20% chance you caught the flue versus there’s a 20% chance you caught SARS – no doubt, both statements would enlist different reactions, both from the patient and from the doctor! Therefore, please take note that not only operational criticality is important, but also safety (in some sectors more so than in others).
Which is better? You made 1 failure prediction and it was correct but there were 10 in total (high accuracy but low coverage). Or, you made 30 failure predictions and caught all 10 that actually occurred but at the cost of 20 unnecessary (a posteriori) interventions (high coverage but lower accuracy). Answer: it depends. For instance, criticality plays a big role in determining which is better. Even more than that, it’ll depend on where you are with the project. During phase-in, many clients choose to focus on high accuracy and not so much on coverage. The reason is straightforward: any failure caught pre-emptively is a win and sticking with highly accurate predictions builds trust throughout the organisation about introducing risk-based concepts for maintenance. In order to make a really educated decision, a fourth axis needs to be introduced. I call it effort.
It’s actually very tempting to call this fourth axis Cost but that would be an over-simplification. We’ll just consider Cost to be Financial Effort. Please note that, should you choose to really plot effort on an axis, net effort should be taken into account; i.e. (in a very simple form) the cost of predictive maintenance versus the cost of a failure*. If the net cost is negative, I shouldn’t act. Really? What about criticality. i.e. if a part is predicted to fail and such failure could lead to bodily harm, surely I should act. Well, if such is the case, this should be represented in the “cost of failure”. Which takes us back to why I prefer referring to Effort instead of cost. Equally, very low accuracy (i.e. 5%) could lead to a lot of dissatisfaction within the maintenance teams because most inspections they do lead to NFF (No Fault Found). If such inspection is as simple as reading out a log file, the Effort is different than if it requires dismantling an engine. Net Effort is therefore both crucial AND very hard to get right.
Understanding and applying the four axis mentioned above is crucial for operational deployment of predictive analytics for maintenance. Executives should educate and train themselves to become comfortable with these concepts. And make sure the whole organisation understands them. You’re using a CMMS (Computerised Maintenance Management System)? Great! Keep using it. Predictive analytics only provide an extra, smart layer on top of your operational systems in order to drive actions. In the field, processes don’t really change (frequencies typically change) but at the decision-taking level, it’s like putting on coloured glasses. We need to start looking at aftermarket processes from a risk-based point of view. And while that scares many people – “because they don’t understand statistics” – the four-axis approach described above should demystify things quite a bit.
* just do a quick internal check-up by asking what is the cost of a failure in your company; most of the time these figures are hard to get by and if someone has them, they’re typically greatly underestimated. I still have to come across a case where predictive maintenance has no positive ROI… other than a business not being capable of deploying PdM (i.e. lack of data, wrong processes,…)

One prediction, many users

“Houston, we have a problem” must have carried a different meaning depending on whether you were an astronaut on board Apollo XIII, the astronaut’s family, in mission control or a rocket engineer on the Saturn project. While the example seems obvious, many people have a vague idea on where to apply predictive maintenance in their business. When we ask about whose jobs will be impacted we very often don’t get beyond “the maintenance engineer”. And how is said maintenance engineer going to use the predictions?

Let’s have a look at some roles/business fields impacted most by predictive maintenance analytics. And let’s start with the above-mentioned maintenance engineer.

Maintenance engineer
The goals of predictive maintenance analytics are multiple but some of the main subjects are: moving unplanned to planned, better scheduling of maintenance, improved maintenance activity prioritisation. The maintenance engineer really wants to get an improved worksheet which tells him what to focus on during the upcoming activities. Predictive analytics should therefore drive an ‘intelligent’ worksheet, which combines preventive maintenance with prioritised predictive maintenance activities. The Maintenance engineer’s contact with predictive maintenance should involve little more than a revised worksheet.

Maintenance Scheduler
The maintenance scheduler may get impacted a bit more; instead of spreading out maintenance activities based on counters (time, cycles, mileage,…), predictive maintenance schedules are more dynamic and combine the former (i.e. due to legal requirements) with the predictive information. As a first step, the traditional schedules should be left untouched but activities augmented with predicted visits (improved worksheets). As a second step, the maintenance schedule should be optimised; in fact, predictive analytics isn’t even required for this step but I’m puzzled to never see truly optimised maintenance schedules…! A third step would involve spreading out maintenance visits; if need be, even negotiating with the legislator to allow this within certain limits. This third step will almost certainly require collaboration between OEMs, Operators and MROs.

Reliability Engineer
The reliability engineer is crucial for two main benefits from predictive maintenance: understanding the past and improving the future. The reliability engineer is not just interested in the predictions as such, but really in why the predictions were made. The improved insight should allow the reliability engineer to find root causes, define behavioural patterns (need to find them in order to avoid them), propose solutions, etc. Better insights into what causes certain failures and how well they can be predicted will also allow the reliability engineer to come up with new maintenance scheduling information.

Fleet Planner
Because predictive maintenance analytics focus on maintenance, people tend to forget the main goal lies outside maintenance: optimal uptime at the lowest possible cost. In my book, uptime means more than just guaranteeing equipment can be operated; it should really mean it’s fit for the task. When a large industrial robot has one of it’s grippers broken but the one required for a specific task works fine, that machine is 100% available for that task, even if from a technical point of view, it’s broken…! Giving fleet planners (machine park planners – let’s use fleet as a generic description) deep insight into fleet health allows them to assign the right machine to the right task.

While the CFO doesn’t generally care about the mechanics of maintenance, they typically do care about the cost of maintenance and operational risk. Predictive analytics can give an insight in both. Maintenance management or the COO may actually be interested in simulating the impact of budget constraints on fleet availability, maintenance effectiveness, etc.

These are just a few examples of the impact of predictive maintenance analytics on different corporate roles; one can easily come up with more. The main lesson of this thought exercise is that PdA impacts the whole organisation, predictions (or derivative information) should be presented appropriately so that every role can correctly interpret the results and apply the best conclusions. Whoever sat through an hour-long discussion between statisticians on the interpretation of a prediction knows what I’m talking about: keep it simple, contextualised and usable!