Track down the facts to analyze systematic failures

Looking at the clues of any mechanical breakdown can lead to improvement, not just repair.


Figure 1: Logic tree example. Courtesy: Reliability Center Inc. (Enlarge using slideshow at end of article)As T.S. Eliot so correctly observed, “Failure is relative—it is what we can make of the mess we have made of things.” It is easy to see that the key to success when analyzing failure is not to react to a problem, but to be proactive by treating the failure as an opportunity to learn.

Understanding the causes of failure can sometimes seem to be a daunting task for a failure scene investigator called on to look at the causes of system or equipment failure in a manufacturing plant. This is completely understandable given the apparent chaotic circumstances that usually surround an incident under investigation.

Too often failure is reacted to in a manner that will put everything back in a known "acceptable condition" as fast as possible without any real consideration given to actually solving the incident through a well-thought-out investigative process. Symptoms are not noted, or ignored, and the evidence is either cleaned up or destroyed.

When this occurs the failure will again manifest itself and typically reoccur at an unexpected time. The good news is that when this happens often enough we become more efficient at reacting to the problem and therefore seemingly better at correcting the situation more rapidly, thus decreasing downtime or mitigating the consequences of the incident being investigated. This mind-set, or restraining paradigm, is a failure unto itself.

Analysts must realize that the life of any component is not infinite but predetermined based on the stresses being subjected to the component. Therefore, engineering designs not only transform a need into a description of a product, but also take into account the design’s compatibility with the related expected physical stresses induced into the component based upon its functional requirements. This includes the life of the product (as measured by its performance over time), reliability, and maintainability.

Equally important to the analyst is the realization that failure seldom occurs for a single reason or comes from a single force or input. This quickly becomes evident when chasing all of the possibilities that the evidence leads the investigator to explore. Therefore, most failures are typically a result of a multiplicity of inputs and errors and are depicted by the logic tree in a multiplicity of legs illustrating all the cause-and-effect relationships of the failure.  

Reading the clues

Every incident analyzed will occur within a specific timeline representing the time between when the anomalous conditions of the failure first manifested themselves to when the failure was safely isolated. The failure data that is found within this timeline provides the clues or evidence that is needed to uncover the cause of any incident or failure—be it sporadic or chronic.

Every piece of data will beg a question as to “how can” this data be in the condition found or the position found. When the investigator can answer the “how can” questions, and tie the anomalies to a specific point within the timeline, then he or she has successfully followed the path to failure for the incident under investigation. In short, the investigator has found the root cause(s) of the failure.

Equally as important as understanding that the clues exist within a specific timeline is the fact that any failure can be analyzed by also understanding the principles of how failure occurs within that timeline.

The following three principles of failure analysis can be used during an investigation to follow the path that led to an incident:

  • Order and pattern
  • Determinism
  • Discoverability

Each can be used during the investigation to follow the path that led to the incident.

Order and pattern

Figure 2: Color tells the story. Courtesy: Reliability Center Inc.There is order and pattern to everything in the universe; the sun comes up and the sun goes down, the tides go in and the tides go out, there are four seasons in a year, and so on. There is also order and pattern to failure, and by understanding this simple principle it is only logical to conclude that an order and pattern of failure exist within the timeline of the failure under investigation. The key is to read the clues to uncover the order and pattern that led to failure.


Just as there is an order and pattern that exist within the timeline of any failure, there are also determinable effects that exist within the order and pattern. To state it in simple terms, every input will produce a set of known outputs, and every produced output came from a known input, that is, the determinable effects. The key to determinism is to make sure that the inputs and outputs are in the correct order and pattern. (The cause is below the effect in the logic tree.)

For example, consider the following:

“Does misalignment cause high vibration or does high vibration cause misalignment?”

Both are possible but which one occurred? This means that the analyst must determine which cause-and-effect relationship is correct. In this scenario either the equipment was initially misaligned or it was aligned correctly and became misaligned.

Once this is determined then the cause-and-effect relationship is known—that is, if misaligned, initially misalignment caused high vibration, or if aligned correctly and became misaligned, high vibration caused misalignment. 


Seldom is there a single cause or a single path to failure. Discoverability, when applied in the investigative process, helps the analyst to ensure that all the possible causes have been explored and accounted for by the analysis. The key is to start as broad and all-inclusive as possible while working through the specifics.

By asking the question “how can” over and over again, and systematically working through the cause-and-effect relationship of the failure, all the possible root cause scenarios are explored and accounted for in the analysis.

Following the concepts of order and pattern, determinism, and discoverability makes it easy for the analyst to graphically illustrate the investigation on the logic tree and document the analysis as shown in the example below.

The science of the clues

Figure 3: Case crushing due to lubrication breakdown. Courtesy: Reliability Center Inc. (Click to enlarge)The clues uncovered during the investigation can always be accounted for by a scientific explanation of the anomaly. For example, you can't have electricity outside the realm of Ohm's law, and you can't have a fire without a heat or ignition source, a fuel source, and an oxygen source.

Even something as simple as color provides scientific evidence for the analyst. Color changes in materials indicate different exposures to temperature or corrosive products.   

The color of smoke changes with different fuel sources. The color of lubrication products changes with the loss of additives, contamination, temperature, or pressures that overcome the film barrier.

Another key principle for the analyst is the mechanics of fractures. Fracture mechanics, or "fractology," is the study of the propagation of cracks in materials. It is based on the use of analytical solid mechanics to calculate the driving force on a crack and experimental solid mechanics to characterize the material's resistance to fracture. Fracture mechanics is therefore an important tool for determining the expected mechanical performance of materials and components.

By applying the physics of stress and strain (in particular the theories of elasticity and plasticity) to the microscopic crystallographic defects (morphology of the fractured surface) found in materials, it is easy to predict and understand the macroscopic signatures of mechanical failure seen on the components’ fractured surface face.

This technique is used to understand the theoretical causes of failures and also to verify the forces that must be present based on the pattern of the fractured surface. In essence, the fractured face tells what kind of force caused the failure. So if the type of force necessary to cause the failure can be found, it can be eliminated or mitigated, thus reducing the likelihood of recurrence at some future date.

Often when an analyst fails to achieve success, the tendency is to simply change the definition of success to a level that can be more easily obtained. Although this will allow the investigator to quickly move on, it obviously limits the payback from the root cause analysis effort. By changing this restraining paradigm to one that seeks the maximum return by only accepting true success and proactively conducting a fact-based analysis driven by the evidence, the paybacks then become expediential instead of incremental.

The quality of the clues of the failure, and the correct interpretation of what the clues tell the analyst, are what determines the degree of success for the incident under investigation.

Ronald L. Hughes, ME, is a member of the American Society of Mechanical Engineers and the American Society of Training and Development. He is currently a senior consultant for the Reliability Center Inc., an engineering and consulting firm.

Top Plant
The Top Plant program honors outstanding manufacturing facilities in North America.
Product of the Year
The Product of the Year program recognizes products newly released in the manufacturing industries.
System Integrator of the Year
Each year, a panel of Control Engineering and Plant Engineering editors and industry expert judges select the System Integrator of the Year Award winners in three categories.
October 2018
Tools vs. sensors, functional safety, compressor rental, an operational network of maintenance and safety
September 2018
2018 Engineering Leaders under 40, Women in Engineering, Six ways to reduce waste in manufacturing, and Four robot implementation challenges.
GAMS preview, 2018 Mid-Year Report, EAM and Safety
October 2018
2018 Product of the Year; Subsurface data methodologies; Digital twins; Well lifecycle data
August 2018
SCADA standardization, capital expenditures, data-driven drilling and execution
June 2018
Machine learning, produced water benefits, programming cavity pumps
Spring 2018
Burners for heat-treating furnaces, CHP, dryers, gas humidification, and more
October 2018
Complex upgrades for system integrators; Process control safety and compliance
September 2018
Effective process analytics; Four reasons why LTE networks are not IIoT ready

Annual Salary Survey

After two years of economic concerns, manufacturing leaders once again have homed in on the single biggest issue facing their operations:

It's the workers—or more specifically, the lack of workers.

The 2017 Plant Engineering Salary Survey looks at not just what plant managers make, but what they think. As they look across their plants today, plant managers say they don’t have the operational depth to take on the new technologies and new challenges of global manufacturing.

Read more: 2017 Salary Survey

The Maintenance and Reliability Coach's blog
Maintenance and reliability tips and best practices from the maintenance and reliability coaches at Allied Reliability Group.
One Voice for Manufacturing
The One Voice for Manufacturing blog reports on federal public policy issues impacting the manufacturing sector. One Voice is a joint effort by the National Tooling and Machining...
The Maintenance and Reliability Professionals Blog
The Society for Maintenance and Reliability Professionals an organization devoted...
Machine Safety
Join this ongoing discussion of machine guarding topics, including solutions assessments, regulatory compliance, gap analysis...
Research Analyst Blog
IMS Research, recently acquired by IHS Inc., is a leading independent supplier of market research and consultancy to the global electronics industry.
Marshall on Maintenance
Maintenance is not optional in manufacturing. It’s a profit center, driving productivity and uptime while reducing overall repair costs.
Lachance on CMMS
The Lachance on CMMS blog is about current maintenance topics. Blogger Paul Lachance is president and chief technology officer for Smartware Group.
Material Handling
This digital report explains how everything from conveyors and robots to automatic picking systems and digital orders have evolved to keep pace with the speed of change in the supply chain.
Electrical Safety Update
This digital report explains how plant engineers need to take greater care when it comes to electrical safety incidents on the plant floor.
IIoT: Machines, Equipment, & Asset Management
Articles in this digital report highlight technologies that enable Industrial Internet of Things, IIoT-related products and strategies.
Randy Steele
Maintenance Manager; California Oils Corp.
Matthew J. Woo, PE, RCDD, LEED AP BD+C
Associate, Electrical Engineering; Wood Harbinger
Randy Oliver
Control Systems Engineer; Robert Bosch Corp.
Data Centers: Impacts of Climate and Cooling Technology
This course focuses on climate analysis, appropriateness of cooling system selection, and combining cooling systems.
Safety First: Arc Flash 101
This course will help identify and reveal electrical hazards and identify the solutions to implementing and maintaining a safe work environment.
Critical Power: Hospital Electrical Systems
This course explains how maintaining power and communication systems through emergency power-generation systems is critical.
Design of Safe and Reliable Hydraulic Systems for Subsea Applications
This eGuide explains how the operation of hydraulic systems for subsea applications requires the user to consider additional aspects because of the unique conditions that apply to the setting
click me