Track down the facts to analyze systematic failures

Looking at the clues of any mechanical breakdown can lead to improvement, not just repair.


Figure 1: Logic tree example. Courtesy: Reliability Center Inc. (Enlarge using slideshow at end of article)As T.S. Eliot so correctly observed, “Failure is relative—it is what we can make of the mess we have made of things.” It is easy to see that the key to success when analyzing failure is not to react to a problem, but to be proactive by treating the failure as an opportunity to learn.

Understanding the causes of failure can sometimes seem to be a daunting task for a failure scene investigator called on to look at the causes of system or equipment failure in a manufacturing plant. This is completely understandable given the apparent chaotic circumstances that usually surround an incident under investigation.

Too often failure is reacted to in a manner that will put everything back in a known "acceptable condition" as fast as possible without any real consideration given to actually solving the incident through a well-thought-out investigative process. Symptoms are not noted, or ignored, and the evidence is either cleaned up or destroyed.

When this occurs the failure will again manifest itself and typically reoccur at an unexpected time. The good news is that when this happens often enough we become more efficient at reacting to the problem and therefore seemingly better at correcting the situation more rapidly, thus decreasing downtime or mitigating the consequences of the incident being investigated. This mind-set, or restraining paradigm, is a failure unto itself.

Analysts must realize that the life of any component is not infinite but predetermined based on the stresses being subjected to the component. Therefore, engineering designs not only transform a need into a description of a product, but also take into account the design’s compatibility with the related expected physical stresses induced into the component based upon its functional requirements. This includes the life of the product (as measured by its performance over time), reliability, and maintainability.

Equally important to the analyst is the realization that failure seldom occurs for a single reason or comes from a single force or input. This quickly becomes evident when chasing all of the possibilities that the evidence leads the investigator to explore. Therefore, most failures are typically a result of a multiplicity of inputs and errors and are depicted by the logic tree in a multiplicity of legs illustrating all the cause-and-effect relationships of the failure.  

Reading the clues

Every incident analyzed will occur within a specific timeline representing the time between when the anomalous conditions of the failure first manifested themselves to when the failure was safely isolated. The failure data that is found within this timeline provides the clues or evidence that is needed to uncover the cause of any incident or failure—be it sporadic or chronic.

Every piece of data will beg a question as to “how can” this data be in the condition found or the position found. When the investigator can answer the “how can” questions, and tie the anomalies to a specific point within the timeline, then he or she has successfully followed the path to failure for the incident under investigation. In short, the investigator has found the root cause(s) of the failure.

Equally as important as understanding that the clues exist within a specific timeline is the fact that any failure can be analyzed by also understanding the principles of how failure occurs within that timeline.

The following three principles of failure analysis can be used during an investigation to follow the path that led to an incident:

  • Order and pattern
  • Determinism
  • Discoverability

Each can be used during the investigation to follow the path that led to the incident.

Order and pattern

Figure 2: Color tells the story. Courtesy: Reliability Center Inc.There is order and pattern to everything in the universe; the sun comes up and the sun goes down, the tides go in and the tides go out, there are four seasons in a year, and so on. There is also order and pattern to failure, and by understanding this simple principle it is only logical to conclude that an order and pattern of failure exist within the timeline of the failure under investigation. The key is to read the clues to uncover the order and pattern that led to failure.


Just as there is an order and pattern that exist within the timeline of any failure, there are also determinable effects that exist within the order and pattern. To state it in simple terms, every input will produce a set of known outputs, and every produced output came from a known input, that is, the determinable effects. The key to determinism is to make sure that the inputs and outputs are in the correct order and pattern. (The cause is below the effect in the logic tree.)

For example, consider the following:

“Does misalignment cause high vibration or does high vibration cause misalignment?”

Both are possible but which one occurred? This means that the analyst must determine which cause-and-effect relationship is correct. In this scenario either the equipment was initially misaligned or it was aligned correctly and became misaligned.

Once this is determined then the cause-and-effect relationship is known—that is, if misaligned, initially misalignment caused high vibration, or if aligned correctly and became misaligned, high vibration caused misalignment. 


Seldom is there a single cause or a single path to failure. Discoverability, when applied in the investigative process, helps the analyst to ensure that all the possible causes have been explored and accounted for by the analysis. The key is to start as broad and all-inclusive as possible while working through the specifics.

By asking the question “how can” over and over again, and systematically working through the cause-and-effect relationship of the failure, all the possible root cause scenarios are explored and accounted for in the analysis.

Following the concepts of order and pattern, determinism, and discoverability makes it easy for the analyst to graphically illustrate the investigation on the logic tree and document the analysis as shown in the example below.

The science of the clues

Figure 3: Case crushing due to lubrication breakdown. Courtesy: Reliability Center Inc. (Click to enlarge)The clues uncovered during the investigation can always be accounted for by a scientific explanation of the anomaly. For example, you can't have electricity outside the realm of Ohm's law, and you can't have a fire without a heat or ignition source, a fuel source, and an oxygen source.

Even something as simple as color provides scientific evidence for the analyst. Color changes in materials indicate different exposures to temperature or corrosive products.   

The color of smoke changes with different fuel sources. The color of lubrication products changes with the loss of additives, contamination, temperature, or pressures that overcome the film barrier.

Another key principle for the analyst is the mechanics of fractures. Fracture mechanics, or "fractology," is the study of the propagation of cracks in materials. It is based on the use of analytical solid mechanics to calculate the driving force on a crack and experimental solid mechanics to characterize the material's resistance to fracture. Fracture mechanics is therefore an important tool for determining the expected mechanical performance of materials and components.

By applying the physics of stress and strain (in particular the theories of elasticity and plasticity) to the microscopic crystallographic defects (morphology of the fractured surface) found in materials, it is easy to predict and understand the macroscopic signatures of mechanical failure seen on the components’ fractured surface face.

This technique is used to understand the theoretical causes of failures and also to verify the forces that must be present based on the pattern of the fractured surface. In essence, the fractured face tells what kind of force caused the failure. So if the type of force necessary to cause the failure can be found, it can be eliminated or mitigated, thus reducing the likelihood of recurrence at some future date.

Often when an analyst fails to achieve success, the tendency is to simply change the definition of success to a level that can be more easily obtained. Although this will allow the investigator to quickly move on, it obviously limits the payback from the root cause analysis effort. By changing this restraining paradigm to one that seeks the maximum return by only accepting true success and proactively conducting a fact-based analysis driven by the evidence, the paybacks then become expediential instead of incremental.

The quality of the clues of the failure, and the correct interpretation of what the clues tell the analyst, are what determines the degree of success for the incident under investigation.

Ronald L. Hughes, ME, is a member of the American Society of Mechanical Engineers and the American Society of Training and Development. He is currently a senior consultant for the Reliability Center Inc., an engineering and consulting firm.

No comments
The Top Plant program honors outstanding manufacturing facilities in North America. View the 2013 Top Plant.
The Product of the Year program recognizes products newly released in the manufacturing industries.
The Engineering Leaders Under 40 program identifies and gives recognition to young engineers who...
A cool solution: Collaboration, chemistry leads to foundry coat product development; See the 2015 Product of the Year Finalists
Raising the standard: What's new with NFPA 70E; A global view of manufacturing; Maintenance data; Fit bearings properly
Sister act: Building on their father's legacy, a new generation moves Bales Metal Surface Solutions forward; Meet the 2015 Engineering Leaders Under 40
Cyber security cost-efficient for industrial control systems; Extracting full value from operational data; Managing cyber security risks
Drilling for Big Data: Managing the flow of information; Big data drilldown series: Challenge and opportunity; OT to IT: Creating a circle of improvement; Industry loses best workers, again
Pipeline vulnerabilities? Securing hydrocarbon transit; Predictive analytics hit the mainstream; Dirty pipelines decrease flow, production—pig your line; Ensuring pipeline physical and cyber security
Upgrading secondary control systems; Keeping enclosures conditioned; Diagnostics increase equipment uptime; Mechatronics simplifies machine design
Designing positive-energy buildings; Ensuring power quality; Complying with NFPA 110; Minimizing arc flash hazards
Building high availability into industrial computers; Of key metrics and myth busting; The truth about five common VFD myths

Annual Salary Survey

After almost a decade of uncertainty, the confidence of plant floor managers is soaring. Even with a number of challenges and while implementing new technologies, there is a renewed sense of optimism among plant managers about their business and their future.

The respondents to the 2014 Plant Engineering Salary Survey come from throughout the U.S. and serve a variety of industries, but they are uniform in their optimism about manufacturing. This year’s survey found 79% consider manufacturing a secure career. That’s up from 75% in 2013 and significantly higher than the 63% figure when Plant Engineering first started asking that question a decade ago.

Read more: 2014 Salary Survey: Confidence rises amid the challenges

Maintenance and reliability tips and best practices from the maintenance and reliability coaches at Allied Reliability Group.
The One Voice for Manufacturing blog reports on federal public policy issues impacting the manufacturing sector. One Voice is a joint effort by the National Tooling and Machining...
The Society for Maintenance and Reliability Professionals an organization devoted...
Join this ongoing discussion of machine guarding topics, including solutions assessments, regulatory compliance, gap analysis...
IMS Research, recently acquired by IHS Inc., is a leading independent supplier of market research and consultancy to the global electronics industry.
Maintenance is not optional in manufacturing. It’s a profit center, driving productivity and uptime while reducing overall repair costs.
The Lachance on CMMS blog is about current maintenance topics. Blogger Paul Lachance is president and chief technology officer for Smartware Group.