Systematic approach for eliminating failures

Eliminating equipment failures may be every plant engineer's dream, but relatively few have a clear roadmap for accomplishing it. This system, or a close derivative of it, has enabled numerous manufacturing plants to achieve significant continuous improvements in plant availability, run rates, safety, and production quality.


Key Concepts
  • A formalized failure elimination program will lead to continuous improvement.

  • Documentation of root causes and their correction is essential.

  • Assignment of responsibilities to specific teams and individuals ensures accountability.

Publication of performance expectations and goals by the leadership team (1)
Establishment of operating plan to achieve expectations and goals (2)
Variance analysis of actual vs. plan (3)
Reports of major incidents and root cause analyses (4)
Variance and incidents preventable by known actions? (5)
Prioritization of loss reasons (6)
Loss contributions and action plans developed for root causes (7)
Action items added to action database (8)
Action plans approved? (9)
Action item delay noted in database (10)
Action item assigned to responsible person with completion target, budget (11)
Action item completion (12)
Action item status reports (13)
Effectiveness audits (14)
Database updates (15)
Routine performance and action item review by leadership team (16)

Eliminating equipment failures may be every plant engineer's dream, but relatively few have a clear roadmap for accomplishing it. This system, or a close derivative of it, has enabled numerous manufacturing plants to achieve significant continuous improvements in plant availability, run rates, safety, and production quality. Benchmarking efforts focused on these metrics have shown use of this or a similar system to be a global best practice.

The system provides predictable positive results for the following reasons:

  • Clear accountabilities are established throughout the organization

  • Decisions are data driven without bureaucracy

  • Continuous improvement is systematic because raising actual performance and expectations become ingrained in the plant's standard operating procedures and, eventually, the culture

  • ISO, QS, and OSHA corrective action requirements are met as a natural course of doing business.

    • The flow chart in Fig. 1 shows each of the process elements in the system. Reference numbers in the chart correspond to the points in the following text.

      Fig. 1. A formalized failure elimination process like this one can ensure continuous improvement in many areas of plant operations and maintenance.

      Publication of performance expectations and goals by the leadership team (1)

      Typically, the first step of each improvement loop is for the plant or corporate leadership team to challenge those who operate a department or facility to reach specific potentials for availability, run rates, quality, etc. The potential is based on performance benchmarks achieved elsewhere or on projected results of specific, planned improvement action items. The longer the failure elimination system has been used, the more accurately the expectations and goals can be projected.

      Establishment of operating plan to achieve expectations and goals (2)

      Assembling a clear scheme for how a department or plant can actually achieve the performance potential in the coming year is part of the operating plan. When this planning process demonstrates that a change in expectations is appropriate for the leadership team to consider, the iterative logic of goal setting and feasibility evaluation is called "catchball." The end result is a set of challenging goals confirmed to be realistic through definitive action plans.

      Variance analysis of actual vs. plan (3)

      The variance analysis involves comparing actual performance versus what was promised in the operating plan. The analysis is typically completed or finalized by departmental leadership.

      Poorer than planned or projected performance triggers an analysis into the cause of the variance. Cumulative annual and period (i.e., daily, weekly, monthly, quarterly) statistics are normally tracked. As experience with the failure elimination process grows, the detail level of this analysis grows.

      Initially, the variance review may focus on total downtime, run rates, defect levels, mean times between failures, mean times to restore, and accident rates. Later, the focus is often down to the level of specific downtime causes, defects, or rate losses.

      Over time, the analysis becomes an audit of actual failure rate statistics versus the rates assumed in failure modes and effects, reliability centered maintenance, HAZOPS, or other risk analysis.

      Reports of major incidents and root cause analyses (4)

      An extremely important part of rapid failure elimination is removal of significant incidents. An initial step is to define what a significant failure is. Typically, that definition is in terms of downtime minutes, accident/near miss severity, production rate deterioration, and/or units of defective product.

      Next, the responsibility for reporting when such an incident has occurred and for analyzing the root cause of that incident becomes standard operating procedure.

      A form for reporting significant failures and capturing information related to a root cause failure analysis is provided in Fig. 2.

      Fig. 2. Failure analysis report should capture all data and activities related to a failure, including root cause and correction verification.

      Decisions on whether to proceed with the suggested preventive or mitigation action plans should be made by the analyst or departmental leadership team in order to maximize the speed of implementation, solution ownership, and accountability.

      Variance and incidents preventable by known actions? (5)

      Many times, the appropriate response to performance variances or major incident reports is not obvious. A standing team or ad hoc team to deal with these situations is necessary. When the failure elimination system has been in place for less than a year or two, external support typically helps departmental teams to address these issues. Later, when uptime, run rates, and quality are under control, departmental leadership teams decide how to proceed if the required solution is not obvious.

      Prioritization of loss reasons (6)

      When it is not clear what to do to address losses of availability, run speed, or quality throughput, there is a need for more detailed causal data and analysis. This need exists because failure reasons are usually codified at a summary level in order to simplify reporting and performance statistics.

      Pareto charts will reveal which of the failure reasons codified at a summary level are deserving of the extra work necessary to eliminate the underlying root causes that drive the losses. Figure 3 illustrates what a typical downtime Pareto chart looks like. This activity is often performed by an ad hoc team when the elimination process is first implemented. Later, it becomes the responsibility of departmental leadership or reliability and/or quality teams.

      Fig. 3. A Pareto chart summarizing production downtime attributed to various causes quickly reveals where priorities should be set fro improving performance.

      Loss contributions and action plans developed for root causes (7)

      The reasons for downtime, product defects, and run speed defects are typically codified by the equipment component or type of operator error involved rather than the root cause of the problem. For example, downtime caused by a motor failure will be logged against the drive rather than against "improper lubrication," "bad rewinding practice," "excessive start frequency by operator," or another potential root cause. This situation arises because there is insufficient time (and sometimes insufficient resources) to perform root cause analyses of all incidents and because there is a time lag between the time that the incident is recorded and the time at which the root cause is determined.

      Activities to remove failures must be directed at specific root causes to be effective. Consequently, when developing action plans to reduce losses, one must apportion the total downtime, defects, etc., associated with a loss reason among the likely root causes. Figure 4 (Please scroll to bottom of page) is a form for accomplishing this apportionment. Although the process is not completely objective or data driven, it beats the alternatives.

      Once root causes are identified and their impact quantified, potential action plans must be postulated with estimates of time and cost to complete the plans. This step compiles the information necessary to select the action plans that make the best business sense.

      Action items added to action database (8)

      In plants that suffer from numerous reliability and quality problems, it is common for plant personnel to dance from one crisis to another without focusing on a problem or a plan until resolved. Keeping action plans current in a database with priorities, target dates, and accountabilities viewable by the general plant population promotes the discipline needed to systematically eliminate the failures. The responsibility for completing this activity usually falls to operating and maintenance leads, engineers, or standing reliability teams.

      Action plans approved? (9)

      Many of the action plans developed will involve minor changes in operating or maintenance practices. These changes mandate minimal red tape for approvals so that the rate of improvement is maximized. Some actions, however, require capital or collaboration across teams. These plans require a simple, well understood structure for getting the appropriate people involved in deciding how to proceed.

      Action item delay noted in database (10)

      When an action item cannot or should not be implemented immediately, the reason and logic for not approving it must be communicated to those involved in the previous analysis and planning. Similarly, the status of that action item should be maintained in the database so it can be reconsidered easily when appropriate.

      Action item assigned to responsible person with completion target, budget (11)

      Accountability for completing the action plan in a timely, cost-effective fashion is crucial. Although teams of resources may provide support, accountability with sufficient resources must be assigned to a single individual. Frequently, this assignment is automatic, because the responsibility is already part of someone's job description.

      Action item completion (12)

      The failure elimination process does not end with the completion of action items. Effectiveness of the solution must be validated, including assurance that the fix did not create new problems.

      Action item status reports (13)

      Simple and routine status reports by those accountable for action items maintain focus and promote schedule and budget discipline.

      Effectiveness audits (14)

      If the ultimate actual impact of each action item isn't quantified, work teams lose many valuable lessons learned from their successes and misfires. In addition, a trial-and-error mentality evolves and errors are repeated by successors.

      Once successes are validated, they are easily extrapolated to other items that are susceptible to the same or similar failure modes.

      Database updates (15)

      Using the action item database to track and communicate the status of each action item provides a useful tool when a failure reoccurs or is not completely removed as planned.

      Routine performance and action item review by leadership team (16)

      By religiously reviewing performance statistics and the status of key items in the action database, plant and departmental leaders can properly allocate resources, adjust priorities, focus improvement, and raise the bar for expected performance.

      Edited by Richard L. Dunn, Editor 630-288-8779,

      More info

      The author is available for further information on these procedures at .

      Reliability Improvement Plan

      Root Cause Delay Cont. Item Action Item Description Est'd Cost New Delay Delta Responsible Person Target Date
      Debris falls on the line during operation2.21Cover the line. (Cost estimate is for 12 foot section)2.8K0.22G. Billings10/28/97
      Lack of visual inspection2.22Develop operations visual inspection PM0.22J. Smith/B. Snyder10/28/97
      Welder deflector roll contributes to line dimples/pickup2.23Recover welder deflector roll (Chrom Roll)15K0.22P. Sense J. Wont12/2/97
      Wetting Bar plugging2.24Install new design modes which are written up to allow more flow (Need is being evaluated)52K0.22Jim Smith2/15/98
      Conductor Roll Failures2.25Develop roll life base line data for roll change outs prior to life cycle end. Develop PM program for conductor rolls.&2K0.22Wont/Snyder Sense10/28/97

      Fig. 4. Preparing written reliability improvement plans to address the root causes of failures helps quantify costs and improvements.

Top Plant
The Top Plant program honors outstanding manufacturing facilities in North America.
Product of the Year
The Product of the Year program recognizes products newly released in the manufacturing industries.
System Integrator of the Year
Each year, a panel of Control Engineering and Plant Engineering editors and industry expert judges select the System Integrator of the Year Award winners in three categories.
November 2018
2018 Product of the Year finalists, mild steel welding: finding the right filler, and new technique joins aluminum to steel.
October 2018
Tools vs. sensors, functional safety, compressor rental, an operational network of maintenance and safety
September 2018
2018 Engineering Leaders under 40, Women in Engineering, Six ways to reduce waste in manufacturing, and Four robot implementation challenges.
October 2018
2018 Product of the Year; Subsurface data methodologies; Digital twins; Well lifecycle data
August 2018
SCADA standardization, capital expenditures, data-driven drilling and execution
June 2018
Machine learning, produced water benefits, programming cavity pumps
Summer 2018
Microgrids and universities, Steam traps and energy efficiency, Finding help with energy projects
October 2018
Complex upgrades for system integrators; Process control safety and compliance
November 2018
Analytics quantify processes, Fieldbus networking and IIoT, Choosing the right accelerometer

Annual Salary Survey

After two years of economic concerns, manufacturing leaders once again have homed in on the single biggest issue facing their operations:

It's the workers—or more specifically, the lack of workers.

The 2017 Plant Engineering Salary Survey looks at not just what plant managers make, but what they think. As they look across their plants today, plant managers say they don’t have the operational depth to take on the new technologies and new challenges of global manufacturing.

Read more: 2017 Salary Survey

The Maintenance and Reliability Coach's blog
Maintenance and reliability tips and best practices from the maintenance and reliability coaches at Allied Reliability Group.
One Voice for Manufacturing
The One Voice for Manufacturing blog reports on federal public policy issues impacting the manufacturing sector. One Voice is a joint effort by the National Tooling and Machining...
The Maintenance and Reliability Professionals Blog
The Society for Maintenance and Reliability Professionals an organization devoted...
Machine Safety
Join this ongoing discussion of machine guarding topics, including solutions assessments, regulatory compliance, gap analysis...
Research Analyst Blog
IMS Research, recently acquired by IHS Inc., is a leading independent supplier of market research and consultancy to the global electronics industry.
Marshall on Maintenance
Maintenance is not optional in manufacturing. It’s a profit center, driving productivity and uptime while reducing overall repair costs.
Lachance on CMMS
The Lachance on CMMS blog is about current maintenance topics. Blogger Paul Lachance is president and chief technology officer for Smartware Group.
Material Handling
This digital report explains how everything from conveyors and robots to automatic picking systems and digital orders have evolved to keep pace with the speed of change in the supply chain.
Electrical Safety Update
This digital report explains how plant engineers need to take greater care when it comes to electrical safety incidents on the plant floor.
IIoT: Machines, Equipment, & Asset Management
Articles in this digital report highlight technologies that enable Industrial Internet of Things, IIoT-related products and strategies.
Randy Steele
Maintenance Manager; California Oils Corp.
Matthew J. Woo, PE, RCDD, LEED AP BD+C
Associate, Electrical Engineering; Wood Harbinger
Randy Oliver
Control Systems Engineer; Robert Bosch Corp.
Data Centers: Impacts of Climate and Cooling Technology
This course focuses on climate analysis, appropriateness of cooling system selection, and combining cooling systems.
Safety First: Arc Flash 101
This course will help identify and reveal electrical hazards and identify the solutions to implementing and maintaining a safe work environment.
Critical Power: Hospital Electrical Systems
This course explains how maintaining power and communication systems through emergency power-generation systems is critical.
Design of Safe and Reliable Hydraulic Systems for Subsea Applications
This eGuide explains how the operation of hydraulic systems for subsea applications requires the user to consider additional aspects because of the unique conditions that apply to the setting
click me