System repairs: 6 steps to find the root cause

Automation systems will eventually need to be repaired. Following a regular procedure while troubleshooting to find the root cause can be a big help.

CFE Media

Every automation system eventually develops a situation requiring advanced engineering support. This type of “break-fix support” may be due to any number of causes—power outages, server maintenance, operator error, etc. But no matter what the root issue turns out to be, sooner or later, every system will need it. And that’s why it’s equally sure that system integrators will at some point find themselves helping to support customers and keep their manufacturing processes running.

Troubleshooting is a unique and special set of skills, and each person may have a slightly different approach to resolving an issue. When I find myself in a break-fix situation, I tend to follow a regular procedure to try to not only fix the problem, but also to determine the root cause of the issue:

Step 1: Ask questions

I always begin by discussing the symptoms of the issue with the person reporting it. If you think about it, how can you solve a problem if you don’t know what the problem is? Asking the right questions in this first phase of the support process is vital to enabling a successful resolution.

Step 2: Replicate the issue yourself

Sometimes the information you’ve gathered in step one may not quite paint the full picture of the situation. When I try to replicate the issue, I often gain insight into what the user is actually reporting.

Step 3: Check the log files

A well-built system will provide evidence of what is happening in the event something is not working properly. If I’m lucky, error messages will provide the context for understanding the actual problem. Even if the system hasn’t generated any error messages, the system logs can often provide details regarding behind-the-scenes issues in a script or database transaction. Analyzing these messages often can reveal the issue at hand.

Step 4: Trace backwards

I start at the point in the system where the issue has been reported and trace backwards. For example, let’s assume the user is experiencing an issue on a specific application screen. My approach to solving the problem begins with drilling down into the specific elements of the screen which are not working–for example, a button.

I dig into the code/function behind the button to see how it’s supposed to work. Perhaps the button triggers a script that queries a database for data, but that data isn’t displaying on the screen. Tracing through these individual elements/functions can often help me understand where in the process the malfunction occurs.

Step 5: Restart/redeploy the system

Usually, it’s not going to be possible to restart servers in a manufacturing system without taking down other, still functional parts. However, I find it amazing how often “turning if off and on again” will fix a system when some underlying aspect gets out of sync.

Step 6: Document the findings

It’s always good practice to document the issue, both for the customer’s benefit and to provide insight to the support team. One of the main benefits of documentation in a support situation is to provide some guidance should the same situation reoccur. You don’t want to spend valuable time trying to re-analyze an issue if you don’t have to.

Ed Miller is a project engineer at Avanceon.