Track down the facts to analyze systematic failures

Looking at the clues of any mechanical breakdown can lead to improvement, not just repair.

By Ronald L. Hughes February 14, 2013

As T.S. Eliot so correctly observed, “Failure is relative—it is what we can make of the mess we have made of things.” It is easy to see that the key to success when analyzing failure is not to react to a problem, but to be proactive by treating the failure as an opportunity to learn.

Understanding the causes of failure can sometimes seem to be a daunting task for a failure scene investigator called on to look at the causes of system or equipment failure in a manufacturing plant. This is completely understandable given the apparent chaotic circumstances that usually surround an incident under investigation.

Too often failure is reacted to in a manner that will put everything back in a known "acceptable condition" as fast as possible without any real consideration given to actually solving the incident through a well-thought-out investigative process. Symptoms are not noted, or ignored, and the evidence is either cleaned up or destroyed.

When this occurs the failure will again manifest itself and typically reoccur at an unexpected time. The good news is that when this happens often enough we become more efficient at reacting to the problem and therefore seemingly better at correcting the situation more rapidly, thus decreasing downtime or mitigating the consequences of the incident being investigated. This mind-set, or restraining paradigm, is a failure unto itself.

Analysts must realize that the life of any component is not infinite but predetermined based on the stresses being subjected to the component. Therefore, engineering designs not only transform a need into a description of a product, but also take into account the design’s compatibility with the related expected physical stresses induced into the component based upon its functional requirements. This includes the life of the product (as measured by its performance over time), reliability, and maintainability.

Equally important to the analyst is the realization that failure seldom occurs for a single reason or comes from a single force or input. This quickly becomes evident when chasing all of the possibilities that the evidence leads the investigator to explore. Therefore, most failures are typically a result of a multiplicity of inputs and errors and are depicted by the logic tree in a multiplicity of legs illustrating all the cause-and-effect relationships of the failure.  

Reading the clues

Every incident analyzed will occur within a specific timeline representing the time between when the anomalous conditions of the failure first manifested themselves to when the failure was safely isolated. The failure data that is found within this timeline provides the clues or evidence that is needed to uncover the cause of any incident or failure—be it sporadic or chronic.

Every piece of data will beg a question as to “how can” this data be in the condition found or the position found. When the investigator can answer the “how can” questions, and tie the anomalies to a specific point within the timeline, then he or she has successfully followed the path to failure for the incident under investigation. In short, the investigator has found the root cause(s) of the failure.

Equally as important as understanding that the clues exist within a specific timeline is the fact that any failure can be analyzed by also understanding the principles of how failure occurs within that timeline.

The following three principles of failure analysis can be used during an investigation to follow the path that led to an incident:

  • Order and pattern
  • Determinism
  • Discoverability

Each can be used during the investigation to follow the path that led to the incident.

Order and pattern

There is order and pattern to everything in the universe; the sun comes up and the sun goes down, the tides go in and the tides go out, there are four seasons in a year, and so on. There is also order and pattern to failure, and by understanding this simple principle it is only logical to conclude that an order and pattern of failure exist within the timeline of the failure under investigation. The key is to read the clues to uncover the order and pattern that led to failure.

Determinism

Just as there is an order and pattern that exist within the timeline of any failure, there are also determinable effects that exist within the order and pattern. To state it in simple terms, every input will produce a set of known outputs, and every produced output came from a known input, that is, the determinable effects. The key to determinism is to make sure that the inputs and outputs are in the correct order and pattern. (The cause is below the effect in the logic tree.)

For example, consider the following:

“Does misalignment cause high vibration or does high vibration cause misalignment?”

Both are possible but which one occurred? This means that the analyst must determine which cause-and-effect relationship is correct. In this scenario either the equipment was initially misaligned or it was aligned correctly and became misaligned.

Once this is determined then the cause-and-effect relationship is known—that is, if misaligned, initially misalignment caused high vibration, or if aligned correctly and became misaligned, high vibration caused misalignment. 

Discoverability

Seldom is there a single cause or a single path to failure. Discoverability, when applied in the investigative process, helps the analyst to ensure that all the possible causes have been explored and accounted for by the analysis. The key is to start as broad and all-inclusive as possible while working through the specifics.

By asking the question “how can” over and over again, and systematically working through the cause-and-effect relationship of the failure, all the possible root cause scenarios are explored and accounted for in the analysis.

Following the concepts of order and pattern, determinism, and discoverability makes it easy for the analyst to graphically illustrate the investigation on the logic tree and document the analysis as shown in the example below.

The science of the clues

The clues uncovered during the investigation can always be accounted for by a scientific explanation of the anomaly. For example, you can’t have electricity outside the realm of Ohm’s law, and you can’t have a fire without a heat or ignition source, a fuel source, and an oxygen source.

Even something as simple as color provides scientific evidence for the analyst. Color changes in materials indicate different exposures to temperature or corrosive products.   

The color of smoke changes with different fuel sources. The color of lubrication products changes with the loss of additives, contamination, temperature, or pressures that overcome the film barrier.

Another key principle for the analyst is the mechanics of fractures. Fracture mechanics, or "fractology," is the study of the propagation of cracks in materials. It is based on the use of analytical solid mechanics to calculate the driving force on a crack and experimental solid mechanics to characterize the material’s resistance to fracture. Fracture mechanics is therefore an important tool for determining the expected mechanical performance of materials and components.

By applying the physics of stress and strain (in particular the theories of elasticity and plasticity) to the microscopic crystallographic defects (morphology of the fractured surface) found in materials, it is easy to predict and understand the macroscopic signatures of mechanical failure seen on the components’ fractured surface face.

This technique is used to understand the theoretical causes of failures and also to verify the forces that must be present based on the pattern of the fractured surface. In essence, the fractured face tells what kind of force caused the failure. So if the type of force necessary to cause the failure can be found, it can be eliminated or mitigated, thus reducing the likelihood of recurrence at some future date.

Often when an analyst fails to achieve success, the tendency is to simply change the definition of success to a level that can be more easily obtained. Although this will allow the investigator to quickly move on, it obviously limits the payback from the root cause analysis effort. By changing this restraining paradigm to one that seeks the maximum return by only accepting true success and proactively conducting a fact-based analysis driven by the evidence, the paybacks then become expediential instead of incremental.

The quality of the clues of the failure, and the correct interpretation of what the clues tell the analyst, are what determines the degree of success for the incident under investigation.

Ronald L. Hughes, ME, is a member of the American Society of Mechanical Engineers and the American Society of Training and Development. He is currently a senior consultant for the Reliability Center Inc., an engineering and consulting firm.