Selecting safety system designs
It would be pretty easy to understand how process facilities operate at many different levels of risk depending on how and what they’re processing. In addition, there are also many different methods for designing safety instrumented systems to address this risk.
It would be pretty easy to understand how process facilities operate at many different levels of risk depending on how and what they’re processing. In addition, there are also many different methods for designing safety instrumented systems to address this risk. Questions regarding which technology should be used — hard-wired relay, pneumatic or programmable; what level of redundancy is appropriate — single, dual or triple; and how often should the system be tested — monthly, quarterly, yearly or once per shutdown — are being asked by users and engineering firms alike. Debate continues as to how one even makes these choices (past experience, qualitative judgment, quantitative analysis, etc.).
Current national and international standards (as well as existing guidelines) on the design of safety instrumented systems are performance oriented, rather than prescriptive. The standards do not mandate the type of technology, level of redundancy or test intervals. They outline what must be done, but not how to do it.
For example, U.S. and international documents identify three safety integrity levels (SIL) for the process industries, and describe the performance requirements for each level. Simply put, different levels of risk require different levels of safety system performance. This is in line with the national legislation in the U.S. (29 CFR 1910.119) on Process Safety Management (PSM), which states that users must “ determine and document that systems are designed and operating in a safe manner .”
Safety system history
The design of safety systems 30 years ago was fairly straightforward and involved few choices such as emergency shutdown systems and interlocks. In most cases, the process control system consisted of single-loop controllers, and the safety system was a separate, independent system, usually consisting of hard-wired relays (Fig. 1).
These relay systems were relatively simple and their failure characteristics were well known. A properly designed relay system was — relatively speaking — “safe.” The drawback of relay systems, however, quickly becomes apparent when the number of inputs and outputs (I/O) increases much beyond 20. The wiring is cumbersome, the logic is difficult to change, documentation must be done manually, there is no form of automatic diagnostics, no digital communications and the list goes on.
Relay systems have a low initial price, but the overall cost of ownership can be relatively high. Since relay systems are typically simplex, and are inherently safe by design, they suffer from nuisance trips — shutting the process down when nothing is actually wrong. This results in lost production and lost income, and is obviously not desirable.
Introduction of the PLC — The introduction of the programmable logic controller (PLC) in 1969 brought about many changes. PLCs were specifically designed to replace hardwired relay control systems. They offered many potential advantages such as software flexibility, self-documentation, smaller size and lower life cycle costs.
While PLCs did offer many advantages for many different applications, most were not suited for safety due to their failure mode characteristics, as they have a much higher degree of failing dangerously as compared to a relay (Fig. 2.). Unfortunately however, many users were (and some still are) not aware of this simple fact. The main limitations, as far as safety is concerned, is the lack of effective diagnostics, especially in the I/O modules.
Alternative technologies — In order to overcome the drawbacks of relay and general purpose PLC systems, other systems were developed — namely solid state systems. These systems were designed specifically for safety and offered numerous features and benefits such as complete testing and diagnostics, secure implementation of bypasses and digital communications.
These systems did not use software, so there were no concerns over possible software bugs. For some, this lack of software flexibility could be viewed as detrimental (e.g., batch processes). The main drawback of these systems however, was the cost — typically more than $1,500 per I/O for a fully engineered system.
First-generation safety PLCs — The aerospace industry in the U.S. funded research in the area of fault-tolerant computers. This research ultimately led to the development of triple modular redundant (TMR) programmable systems. Triplicated systems were designed to run properly even in the presence of faults — the channel that disagrees is simply outvoted — hence the term fault-tolerance . For many small, low risk process applications, TMR logic systems are often viewed as overkill, especially when one considers the fact that a majority of TMR logic systems are connected to simplex field devices.
Advanced diagnostics safety systems — In the early 1990s, a new era of safety PLCs began. These systems used the advances in microprocessor performance to adapt system-level diagnostics that could improve both safety and availability. The concept was that by using the system’s intelligence, one could lessen the requirement for “extra” components providing “arguably” a more cost effective system with the same level of safety and availability of that of a TMR system.
System architectures that employed these advanced “self” diagnostics are typically defined as 1oo1D, 1oo2D, etc., where the “D” stands for diagnostics capable of bringing down the system, and while many systems “claim” some level of diagnostics, the important aspect here is that these diagnostics have the ability to automatically drive the system operation to it’s known “safe state” when a dangerous failure is detected (Fig. 3.).
Process Safety Management legislation in the U.S. (29 CFR 1910.119) mandates that employers “determine and document” that safety system equipment be “suitable” for the particular application, and that safe operation be “assured.” It is easy to be confused by such statements. Just how does one determine and then document the performance of safety control equipment? How does one justify the decision to use a particular system?
Many use some form of qualitative, intuitive decision process. Unfortunately, it is extremely difficult to come up with any scaleable, repeatable, subjective process for considering all of the relevant factors in the design and selection of a safety system. Certain industry groups were aware of this problem and formed committees to develop standards in this critical area.
In order to make comparisons of different systems, one must first have a meaningful way to measure system performance. After all, if you can’t measure it, you can’t manage it.
Today, it is well known that safety systems can fail in either two modes. They may “fail safe,” or initiate a nuisance trip and shut the plant down when nothing is actually wrong. Or they may fail dangerous, suffer an inhibiting fail-to-function failure and fail to respond to an actual shutdown demand.
Spurious trip performance
Many are familiar with the term “availability,” which is determined by dividing uptime by total time. However availability is often misunderstood. In terms of spurious trip performance, what does an availability of 99.9% tell us? The number sounds good, but what does it really mean ?
A system that initiates a nuisance trip once a month and is down for 40 minutes has an availability of 99.9%. The same could be said about a system that initiates a nuisance trip once per year, but is down for nine hours. So would a system that initiates a nuisance trip once every 10 years, but is down for 90 hours. The importance is that one must assume the “actual” downtime of the process in order to perform the calculation. How meaningful would your equipment manufacture’s claim be if they assumed a one hour repair time, but it’s well understood that your facility would be down for 12 hours, taking into consideration the time required to actually get the process online again. Obviously, an invalid assumption leads to an invalid answer.
The term used in the ANSI/ISA-84.00.01-2004 standard 84, Functional Safety: Safety Instrumented Systems for the Process Industry Sector for performance in this mode is “MTTFsp,” or Mean Time To Fail spurious (nuisance trip). The difference between a system that causes a nuisance trip once every six months, versus once every six years, 60 years or 600 years is readily apparent. The user knows how long the process will be down if the event happens, he just wants to know how often it might happen.
Different people use different terms for the safety performance of these systems. Figure 4 shows the concept of how safety shutdown systems are used to reduce the risk inherent in a process down to acceptable levels. The performance described here is often referred to as the level of risk reduction. Simply put, a “safer” safety system has a “better” (larger) risk-reduction factor (RRF) than a “poorer” safety system. Phrasing it another way, the RRF can be defined as “the amount by which the system lowers the overall risk of the facility, versus not using the system at all.”
The ISA S84 and IEC 1508, International Electrotechnical Commission standard 1508; Functional Safety — Safety Related Systems documents are design standards. Both standards use the concept of safety integrity levels as a means to relate the required safety system performance to the level of risk inherent in the process. The table IEC and ISA performance requirements summarizes the requirements using several terms, which can all be directly related to one another.
Integrity levels are an indirect way to qualify risk, instead of more traditional measures such as death rates, fatality rates or hazard rates. The reason is simple: an international standard cannot state that it is “acceptable” or “tolerable” for up to four people per 100 million man-hours to die in a refinery.
The table also shows the simplicity in the number range used for RRF. The difference between an “availability” of 99% and 99.99% would not appear to be significant (it’s less than 1%.). The difference, however, between an RRF of 10 and 10,000 is obvious — even to a small child. The difference between both , however, is two orders of magnitude .
The field of engineering is a quantitative discipline. Decisions involving the sizing of valves, piping, fans and turbines are based on quantitative calculations. Today, when choosing a safety system, similar quantitative approaches can be made.
Meeting your safety and availability requirements should always carry precedence. However once you have achieved those targets, you can focus on other areas that will help reduce your total cost of ownership such as programming languages, connectivity issues, cyber security protection, physical size, remote I/O capability and many other issues that affect reliability and the bottom line.
|Probability of failure on demand (PFD) 1 — safety availability
|Risk Reduction Factor (RRF) 1 / PFD
|99.9% – 99.99%
|0.001 — 0.0001
|1,000 – 10,000
|99% – 99.9%
|0.01 — 0.001
|100 – 1,000
|90% – 99%
|0.1 – 0.01
|10 — 100
|Process Control – Not Applicable
|Charles M. Fialkowski, Certified Functional Safety Expert (CFSE) has been a Safety Systems Specialist for more than 10 years, with a focus on process safety. He is the Chairman for ISA’s Safety Division on Fire and Gas systems and a member of the ISA’s technical committee SP84 on Safety Systems. He has instructed ISA’s BMS and LOPA courses, published numerous papers on Safety Instrumented Systems, and is a developer of a BMS course for Exida.com. Mr. Fialkowski is a National Process Safety promoter with Siemens Energy & Automation.