Why things break: Just the facts
A root cause failure analysis takes the guesswork out of downtime
Root cause failure analysis interprets the features of a system or a component to determine why it doesn’t perform as intended. This could be a part, machine, operating area or entire plant. Using detective skills and a team of involved people, the key is to understand the physical, human and latent root causes of the inability to meet those goals.
Two influential authorities in root cause failure analysis are Charles Latino, one of the pioneers of industrial reliability engineering movement, and Dr. James Reason, a recognized expert in the area of human reliability. Different practitioners of root cause failure analysis in industry might use different terms, but the approaches are generally similar.
The most important aspect of root cause failure analysis is that there can be no guesswork or opinions as to causes. Concrete facts have to support every decision that leads to the roots.
There are three main types of root causes that build upon each other:
- Physical roots. The actual physical mechanism of the failure.
- Human roots. The human practices that allowed the physical roots to exist.
- Latent roots. The way the site is managed and/or designed that creates the human roots.
Physical causes are the easiest to determine, while the human and latent roots are not recognized as easily. Finding and eliminating those human and latent causes has a far greater financial return.
In one example, a machine caught fire, caused a lengthy production shutdown and damaged the surrounding building. A simple failure analysis that solves the physical cause of the machine fire might prevent future fires and improve the reliability of that one machine. But a true root cause failure analysis that goes into the human and latent roots can eliminate whole categories of failures. It also hits the bottom line: putting the findings from a root cause failure analysis into practice often results in more than a thousandfold return on the cost of the analysis, according to Neville Sachs, presenter of the Root Cause Failure Analysis Webinar.
Latent root causes arise from the way a site is managed or designed and permit the human root causes. For example, a story about a crane whose cable broke shows how latent management practices contributed to the failure.
Sachs and a manager were investigating the failure of a broken cable that dropped a 30-ton vat of molten metal. While they were on the floor, where another crane operator and his spotter saw them, a second event happened. The crane operator picked up another 30-ton vat of molten metal while the crane was moving, contradicting one of the cardinal rules of crane operation: a crane should never be moving when initially picking up the load because the inertial forces combine to be substantially greater than just the weight of the load. That this happened in front of Sachs and management showed that unsafe procedures were used routinely and that management tolerated them.
Common latent failure roots include:
- Having designers work outside their expertise such as mechanical engineers doing electrical design, chemical engineers doing mechanical design, not having a knowledgeable lubrication engineer in a largely mechanical plant, etc.
- Not performing inspections when receiving maintenance items.
- Not having regular workforce training.
- Failing to recognize the importance of skilled employees.
- Making the new (or lowest-rated) maintenance employee responsible for equipment lubrication.
Untrained people rarely recognize there is always a chain of events with multiple contributors that leads to a failure or accident. There is a common tendency to solve the physical root of the problem and then wonder why the failure recurs.
Errors can be thought of as the result of a chain of events or a series of linked holes. Latino created chains of errors that lead to a failure or accident (see Figure 1). Reason uses the example of rotating slices of Swiss cheese—the pieces represent errors and when the holes line up, there is an incident.
Errors eventually will link and cause failures, but we don’t know exactly when or how often. Sometimes there is no incident if either luck or an observant person breaks the chain.
Root cause failure analysis of the crane that dropped the 30-ton vat revealed two very obvious latent causes and two human causes that led to the physical cause of a deteriorating cable (see Figure 2).
Sometimes those errors don’t link up, and there is no failure or accident. At this site, a second bay was using the same process and not had a failure yet, probably due to luck.
The most important part of root cause failure analysis is accurately pinpointing the physical root causes. Because the physical roots lead to the human and latent roots, people must truly understand the physical roots of a failure if they ever hope to find the larger causes.
There are always multiple root causes—physical, human and latent. Detailed studies of some critical disasters, such as those listed below, reveal multiple causes in each case—physical, human and latent:
- RMS Titanic
- The Halifax explosion
- Bhopal disaster
- Exxon Valdez oil spill
- Chernobyl disaster
- Deepwater Horizon (Gulf of Mexico) oil spill.
If the people doing the analyses don’t recognize there are always multiple causes, they will never reach an accurate conclusion. One example of this occurred at an international conference where two organizations analyzed the same data looking for the causes of some component failures. Their bar charts of the failure causes were informative but very different because each analyzed the failures as each having only a single cause.
Multiple root causes
Experience shows that in all failures and accidents, there may be only one physical root, but there are always multiple human and latent roots. The number of roots typically depends on the complexity of the problem. For example, a simple industrial plant failure of something such as a pump or a fan usually has between four and seven roots. On the other hand, a complex system such as an aircraft crash generally has 12-14 roots, according to the U.S. National Transportation Safety Board (NTSB).
The number of root causes found also might depend on the depth of the analysis. Given its widespread importance to human safety, the NTSB has much more funding and time to analyze accidents. Unlike the airline industry, the goal of many root cause failure analyses in industrial plants is to get the plant up and running again. The problem with such cursory industrial analyses is the remaining unrecognized and uncorrected roots will come back to cause another breakdown.
Within the cost limits of any particular analysis, the goal of your root cause failure analysis program should be to break that error chain and eliminate as many roots as possible (see Figure 3). Often, with some close attention, some of the root causes can be eliminated at no to very low cost.
Physical and human roots
A study of over 120 detailed failures, looking at just the major human and physical failure causes, found the following averages. The most common type of human error is in design:
- Physical roots—1.4 errors/failure, with many of them involving fatigue and corrosion
- Human roots—multiple roots of these types:
- design errors (59%)
- maintenance (38%)
- operating (24%)
- installation (16%)
- manufacturing (12%).
In this example (see Figure 4), a variable speed turbine is driving a reciprocating compressor through a reducer. This setup resulted in multiple failures due to two types of vibration: torsional and translational. A major design error was that the specifying engineer did not consider what resonant frequencies the bearings in the gearbox were sensitive to. After repeated failures, they decided to only operate the turbine at specific speeds.
We all make mistakes. But how often do you think the average person makes a significant error? A significant error is one that could lead to the need for medical treatment or results in a substantial financial loss. The probability of an accident goes up when a person talks on a cellphone while driving a car or truck. The chance of falling goes up when you don’t use a handrail when walking down stairs.
Latino often quoted reason as saying the average person makes six significant errors per week, requiring medical treatment or financial loss. But plant surveys that Sachs has conducted since 1993 show the typical person thinks they make a significant error only once every five months.
Sachs’s surveys of over 3,000 personnel, typically from maintenance, engineering and plant supervision, found the average person rated themselves as above average and their co-workers below, even though the average person is by definition at a 50% skill level:
- Own skill level: 72%.
- Co-workers made twice as many errors.
- Less than 1.7% of the 3,000—50 people—said their skill level was below average.
- Only one ever, since 1993, said they were below average in safety awareness.
A good root cause failure analysis program can reduce errors and improve savings. Sachs says two typical examples include a plant where, over an eight year period, managers cut the number of motors they had to replace per year from 500 to 250. The reduction in replacement costs amounted to 50 times the investment and didn’t include the effect on lost production that the maintenance staff can be smaller or that product quality improves because of more consistent runs.
The second example was a pharmaceutical plant that cut maintenance costs from $115/hp to $60/hp for motors used over a four-year period. They also cut pharmaceutical product loss due to equipment failures by 60%.
How to do analysis
The first thing that is absolutely required is management support that is committed and lasting. Results will need to be carried through, and without management behind that activity, the analysis is a waste of time and possibly money.
Second, select the topic to be analyzed, which can be very easy if something’s broken but more complicated for a division or department within a company. In the division or department cases, start at the highest level you can. But for a broken machine, you can start by just listing all the visible physical damage—in other words, the observed items or conditions.
Third, create a diverse, unbiased team with a leader or coordinator who can guide the team toward perceptive questions.
Fourth, use a logic system to understand the interaction of causes. Finally, guard against opinions or prejudices governing the outcome, especially one dominant person taking over the analysis.
Choose analytical method
Logic trees are simple and good for getting in-depth analysis. They are an easy-to-follow technique for the participants as well as present results to others such as management.
According to Sachs, he has found that other methods such as 5 Whys, Fishbone diagrams and Kepner-Trego are not as effective or efficient in finding the true root causes and generally not as easy for others to understand.
Figure 5 shows a logic tree for the analysis of a catastrophic failure in a paper mill where a 3,000-hp wood chipper that eats logs at a foot per second failed dramatically. This will serve as an example to go through the process. Initially the woodchipper displayed a dozen broken or failed parts or conditions, noted in the yellow level in the logic tree.
The coordinator and team
The coordinator is a critical job. The coordinator chairs the meetings and acts as the moderator. Although the coordinator doesn’t have to understand the problem area, a general idea of how things work will make the job easier.
The coordinator’s key job is that of a detective driving an investigation. He or she helps the team ask questions about how an event or a feature could happen and ensures the answers are based on fact or a true expert opinion.
The coordinator is responsible for team communications, including developing, organizing and presenting the latest charts, data and assignments.
The five- to 10-member analytical team should be drawn from a wide range of the disciplines and departments involved. Although the coordinator leads the team, all team members should be viewed as equals.
Commercial software is available to develop and present logic trees, such as the Reliability Center’s PROACT.
The coordinator inserts the failure or selected starting point into the top box on the logic tree chart. The next level shows all the detectable features.
With system problems, it is more difficult to pick out a starting point. For example, with a plant that isn’t producing well, you might start with an analysis of the optimum production capacity. Then look at each item or event that restricts that capacity. Next, draw a Pareto chart outlining the effect of those causes, which provides a good visual to start with. Begin root cause analysis on the most expensive, from either time or cost restriction. When that is solved, the team will move on to solve the cause of the next most expensive, and so on.
Figures 6 and 7 show the chipper being analyzed for this logic tree example before the failure and after. Logs are fed in at the spout at bottom right. Two motors drive the gearbox, which turns a shaft, which spins the chipper disk. The chipper disk is a 130-inch flywheel with knife blades on it.
In the massive failure, the chipper disk crushed the chute, which was made from two-inch thick steel.
The chipper analysis team
- Two maintenance craftsmen
- Maintenance supervisor
- Operations supervisor
- Operations superintendent
- Plant reliability engineer
- Plant reliability engineering supervisor
- Outside consultant.
Conducting the analysis
First, protect the parts from contamination. Then spread them out over an area—be it a table or warehouse—so the team can inspect them. The coordinator starts by asking the group, “How can this event (or symptom) happen?” Add every suggestion to the tree, even if it looks ridiculous. Then look for answers—answers supported by facts, not opinions. Keep everyone involved. Assign duties for needed information that comes up during the discussion.
Creating the logic tree
Sticky notes are an easy way to begin to build the logic tree. Different colored sticky notes or pen colors can help build levels in the tree. Institute a numbering system on the sticky notes or boxes to keep track of the connections between levels. The coordinator then organizes the tree and presents it at the next meeting. Continue to meet until the logic tree is complete, makes sense and provides satisfactory root causes all the way back to human and latent roots.
In the case of the chipper (see Figure 5), the team determined the physical cause was fatigue on bolts, a very common root cause. One of the latent root causes was in design, where the bolts couldn’t be tightened adequately. Another involved trying to use a stamped part that couldn’t meet the tolerances needed. Other human and latent roots included the fact that management had eliminated engineering reviews of machinery revisions because they thought it was an unnecessary step in the process; also, some sensors were disconnected because of “too frequent” alarms.
Things to avoid
A dominant person’s attempt to take over the process can affect the quality of the logic tree and the resulting root causes. Often there will a person with an official—or unofficial—leadership position who will try to control the process, whether inadvertently or on purpose. The coordinator must then ask the person to let the others participate more.
Secondly, to eliminate latent roots, a champion is needed to follow up with management and effect change in the way the site is operated.
Lastly, don’t be afraid to talk to a professional. Outside experts can offer insight that team members might be unaware of.
The article first appeared in the STLE magazine and is based on a webcast presentation from Neville W. Sachs, P.E., a professional engineer and a graduate of Stevens Institute of Technology in Hoboken, New Jersey. STLE is a CFE Media content partner