Achieve fault tolerance with a real-time software design

04/29/2013


Process replication, writer-side peer awareness

To avoid disadvantages, another option is to make Manager processes aware of each other at the application level. If all Manager processes know which other Manager processes are currently participating in the infrastructure, then they all could decide for themselves, according to a system-wide consistent selection algorithm, which of the replicated Manager processes is the currently active one. All Manager processes would be participating in the pattern and updating the relevant states accordingly—but only the active Manager would actually be communicating those updates to the data-space. If the active Manager leaves the system for some reason, then the remaining Manager processes need to decide for themselves whether or not they are the new active Manager and adjust their behavior accordingly.

DDS supports the necessary features to implement this approach. By subscribing to so-called built-in Topics, applications can be made aware of other DDS Entities in the cloud. This kind of subscription also can notify the application of lost liveliness or destruction of each of these Entities. Those two aspects combined are sufficient for this example.

This approach has no application-level impact on subscribing components in the system and, consequently, they do not require code changes. This kind of replication also does not consume any extra bandwidth or resources, as plain replication does.

There is some impact at the application level for the Manager processes. The mechanism of peer-awareness and leader selection has to be created. This is not trivial, and the impact of bugs in that piece of code could be high. (It could lead to none of the processes stepping up as the leader!)

Process replication, exclusive ownership

The last option here takes advantage of a native DDS feature called exclusive ownership. Courtesy: Real-Time InnovationsThe last option here takes advantage of a native DDS feature called exclusive ownership. Normally, DDS allows multiple publishers to update the same data-item simultaneously. This behavior is called shared ownership, and it is the default setting for the ownership quality of service (QoS) setting. If the nondefault setting, called exclusive ownership, is selected, the infrastructure will make one of the publishers the owner of the data-item—at any time, only the owner of the data-item can update its state. For this leader selection, another policy called ownership strength can be adjusted. This integer value will be inspected by the middleware to identify the current leader at any time, that is, the publisher with the highest ownership strength. This selection is done consistently through the system.

Other than selecting the right QoS, this solution is transparent to all system components. The different Manager processes are unaware of each other and execute their tasks independently. Similarly, the subscribing components do not (have to) know that in fact, multiple publishers are present in the system. Everything needed to achieve the required functionality is handled within the middleware. As a consequence, no extra code has to be added, nor tested, nor maintained.

Exclusive ownership functionality is described in the DDS specification and was designed with the mechanism of process replication in mind. In that sense, it is not a surprise that this solution is usually the best fit. The specification does not indicate how this functionality should be achieved by the middleware; that is left up to the product vendors. However, it is good to know that typically, for optimal robustness, implementations do send data over the wire for each of the active publishers, and the leader selection takes place at the receiver side. This means that the feature does result in extra networking traffic.

Recovering from faults

Process replication is an important aspect of increasing fault tolerance in a system. However, there is more to it. If a system, by virtue of the process replication, can continue after a fault, then it will have entered a stage in which it is less fault-tolerant simply because one of the replicating processes is gone. It is still essential that the system recovers as quickly as possible from the fault and returns to the fault-tolerant state. In our example, this means that the failing Manager process will have to restart, obtain the current state of all task data-items, and start participating in the pattern again. DDS can help make the task of recovery easier as well.

- About the author: Reinier Torenbeek is a systems architect at Real-Time Innovations (RTI). He has experience with all aspects of architecting, designing, and building distributed, real-time systems. He has been a developer of real-time publish/subscribe middleware for several years. In the field, he has worked with customers in many kinds of domains. He has been involved in a wide range of middleware software activities for 15 years. Currently, he serves as a consultant for customers creating their systems using the RTI Connext product suite, focusing on commercial and industrial automation applications. Edited by Mark T. Hoske, content manager, CFE Media, Control Engineering, mhoske(at)cfemedia.com

See this article at www.controleng.com/archive for more about the author.

www.rti.com/products/dds/ 


<< First < Previous 1 2 Next > Last >>

No comments
The Top Plant program honors outstanding manufacturing facilities in North America. View the 2013 Top Plant.
The Product of the Year program recognizes products newly released in the manufacturing industries.
The Leaders Under 40 program features outstanding young people who are making a difference in manufacturing. View the 2013 Leaders here.
The new control room: It's got all the bells and whistles - and alarms, too; Remote maintenance; Specifying VFDs
2014 forecast issue: To serve and to manufacture - Veterans will bring skill and discipline to the plant floor if we can find a way to get them there.
2013 Top Plant: Lincoln Electric Company, Cleveland, Ohio
Case Study Database

Case Study Database

Get more exposure for your case study by uploading it to the Plant Engineering case study database, where end-users can identify relevant solutions and explore what the experts are doing to effectively implement a variety of technology and productivity related projects.

These case studies provide examples of how knowledgeable solution providers have used technology, processes and people to create effective and successful implementations in real-world situations. Case studies can be completed by filling out a simple online form where you can outline the project title, abstract, and full story in 1500 words or less; upload photos, videos and a logo.

Click here to visit the Case Study Database and upload your case study.

Bring focus to PLC programming: 5 things to avoid in putting your system together; Managing the DCS upgrade; PLM upgrade: a step-by-step approach
Balancing the bagging triangle; PID tuning improves process efficiency; Standardizing control room HMIs
Commissioning electrical systems in mission critical facilities; Anticipating the Smart Grid; Mitigating arc flash hazards in medium-voltage switchgear; Comparing generator sizing software

Annual Salary Survey

Participate in the 2013 Salary Survey

In a year when manufacturing continued to lead the economic rebound, it makes sense that plant manager bonuses rebounded. Plant Engineering’s annual Salary Survey shows both wages and bonuses rose in 2012 after a retreat the year before.

Average salary across all job titles for plant floor management rose 3.5% to $95,446, and bonus compensation jumped to $15,162, a 4.2% increase from the 2010 level and double the 2011 total, which showed a sharp drop in bonus.

2012 Salary Survey Analysis

2012 Salary Survey Results

Maintenance and reliability tips and best practices from the maintenance and reliability coaches at Allied Reliability Group.
The One Voice for Manufacturing blog reports on federal public policy issues impacting the manufacturing sector. One Voice is a joint effort by the National Tooling and Machining...
The Society for Maintenance and Reliability Professionals an organization devoted...
Join this ongoing discussion of machine guarding topics, including solutions assessments, regulatory compliance, gap analysis...
IMS Research, recently acquired by IHS Inc., is a leading independent supplier of market research and consultancy to the global electronics industry.
Maintenance is not optional in manufacturing. It’s a profit center, driving productivity and uptime while reducing overall repair costs.
The Lachance on CMMS blog is about current maintenance topics. Blogger Paul Lachance is president and chief technology officer for Smartware Group.