Andy’s Almanac on Accidents, Part Seven
Safety

Andy’s Almanac on Accidents, Part Seven

By Sphera’s Editorial Team | August 17, 2021

Andy Bartlett is back to discuss Swiss cheese in the Digital Age, and ways current software and technology might have helped prevent one of the most well-known refinery incidents in history. Listen to other episodes of Andy’s Almanac on Accidents.

 

James Tehrani:

Welcome to the SpheraNOW podcast, a program focused on safety, sustainability, and productivity issues. I’m James Tehrani, Spark’s editor-in-chief. Today, Andy Bartlett returns for part seven of the always popular Andy’s Almanac on Accidents. Andy is Sphera’s solution consultant for operational risk management. And, we’ll be talking about Swiss cheese in the digital age. Thanks for joining me today Andy, how are you doing?

Andy Bartlett:

I’m doing fine, James. Nice bit of weather here in the UK. We have some rain; we have some sun. It’s summer.

James Tehrani:

A little mixture. We like a little mixture. It keeps the plants happy, right?

Andy Bartlett:

Yeah.

James Tehrani:

All right. Let’s cut to the chase, or let’s cut to the cheese. OK, that’s a horrible joke. But can you talk a little bit about the Swiss cheese model and the evolution of risk visualization?

Andy Bartlett:

Yeah. It’s a fascinating subject to me. When I first saw a poster saying, follow the Swiss cheese model, there was this Swiss cheese with four slices and some holes. And I thought, well, what’s that got to do with safety? I started to do a bit of research. The question I had then was, what’s the name of the barriers and what’s the name of the holes?

Andy Bartlett:

Well, as I got more into my journey in safety, I found out that the barriers do have names. The holes do have names. So, the barriers are fundamental barriers. There’s eight of them in the model that we use. The holes are the safety critical elements, or safety critical environment elements, as they call it today. The holes are actually the deviations in those safety critical elements. When you get an instrument that’s outside of its operating window, it sends an alarm to the DCS system, but also into the barrier model, showing that there is a deviation. You can also end the deviations manually, such as unexpected corrosion rates on a piping loop. If you have several deviations in one area, you start to see a picture of risk.

James Tehrani:

It sounds like something that would be handy throughout the decades, but I’m guessing that kind of technology didn’t exist. Can you talk a little bit throughout your career, the evolution of this risk visualization?

Andy Bartlett:

Yeah. In the 1970s, when I started out, we had analog instruments on a panel. We had circular paper charts, with a little pen that recorded what the pressures and temperatures were. If something went out of the operating window, you would see it, you’ll get an alarm, and it will be followed up by the engineer the next day. But we didn’t have a concept of what risk. Nobody put together, at that time, that you could have several excursions outside of operating windows at the same time, in the same area. They would actually be brazing the risk profile, but we couldn’t see that. We used to check these instruments every 90 days. If we needed to bypass an instrument, to check it, so that it didn’t trip the equipment, that also would have raised the risk of a bypass being in place, because the instrument wouldn’t be working. All of this was controlled by a paper permit to work system.

James Tehrani:

What do you mean by paper permit to work system?

Andy Bartlett:

You would write a piece of paper out saying, you have permission to go and do the calculations on this instrument in the field. Check it’s zero, check it span everything, so that it was working correctly. A paper permit was something that, again, what was the risk of having hot work permit in the same area? We didn’t know because we didn’t have a system that told us that. We couldn’t-

James Tehrani:

So, there is no communication. I mean, you just had a piece of paper. There’s nothing to tell you that something could be… A risk pathway could be developing.

 

building-a-resilient-safety-strategy
E-bookBuilding a Resilient Safety Strategy
As the pace of change accelerates, infrequent, incomplete inspections and outdated, siloed data will simply not be enough.

 

Andy Bartlett:

That’s exactly right. Yeah. We couldn’t see that.

James Tehrani:

After the 1970s, what happened in the ’80s?

Andy Bartlett:

OK. In the ’80s, we started to bring in digital control systems, which is called distributed control, where we had instruments in the field feeding into a computer system, and you could see all of it together. We still did the critical checks et cetera. But, with this system, you had a librarian that told you what had happened. If you had a trip in the plant, you could see what had happened. What we had then, was the design contractor, who worked with engineering to mark which instruments were safety critical. Now we knew which ones were safety critical, because those were the ones that were routinely tested. Again, we couldn’t [connect ] the risk pathway, because nothing was linked. It was just an instrument system telling me you had alarms. Alarms had come in and they’d been actioned.

James Tehrani:

Is that what safety experts refer to as lagging indicators? Is that part of this?

Andy Bartlett:

Well, you could say that, if you were going back to look at a system after it tripped. But the indicators as such, were just processed variables. They weren’t a collection of data that showed you what was going on. It was just instrumentation.

James Tehrani:

I see. OK. Cool. All right. So, then we’ve gone through the ’80s very quickly, I will say. We want to move on to some of the more recent topics. So, what about the 1990s?

Andy Bartlett:

OK. In the 1990s, safety critical element’s concept was introduced into the UK, in the offshore oil and gas sector, by the legislation enacted, following the Piper Alpha disaster in 1988. So, a lot of criticism about process safety not being monitored. The ability to monitor process safety, to complicate a business, and it was all done on spreadsheets, still is in some places. But, now there’s technology. We can change this from a time-consuming exercise to nearly real time. So, what was done then was, that the barriers were named. The safety critical elements for a facility were part of the safety case. So, there was a legislation to make sure everybody was doing this in the North Sea.

James Tehrani:

This was where the James Reason comes in, with the Swiss cheese model, correct?

Andy Bartlett:

Well, the Swiss cheese model was developed in the ’90s. However, putting it into practice was done later, which is later on after Piper Alpha. The actual model of the barriers and the safety critical elements can be found in several places. There are different versions of it, but it’s all the same principle. IF the Swiss cheese had its slices named, and had the holes identified. The one I always use is the protection one, which is the gas detectors out in the field. So, you have gas detectors that alarm, when you have a hydrocarbon gas or a toxic gas. Those particular detectors have to be serviced every so often. When you service them, you have to take them offline to do it. So, at that time, you had a deviation while you do the test. If the alarms fail, then the instrumentation is smart enough to be able to tell you it’s got a problem. That’s if you’ve got the latest version of everything. Some plants don’t, some plants do.

James Tehrani:

About how often do you have to take those offline? How often does that maintenance take, or that a check take place?

Andy Bartlett:

Some places it’s monthly. Some places two, monthly. I’ve seen three, monthly as well. It depends on-

James Tehrani:

Fairly regularly.

Andy Bartlett:

Yeah, fairly regularly. It’s not something you wait until it breaks, you want to have it in a routine. These are digital instruments. There is, as I said previously, there is digital instruments you can buy today, that will tell you when they’re not working. Everything moves on with technology. Everything’s getting smarter.

James Tehrani:

Perfect. Moving on. What about the 2000s? Was that the biggest change you’ve seen, in terms of the Swiss cheese digital age evolution?

Andy Bartlett:

Well, automated workplace monitoring, is something I saw recently. I think it’s a good term. In the jobs I’ve been associated with in the last few years, we’ve got the SCCs, the barriers, and the critical elements mapped to the barriers. We’ve got the critical equipment instrumentation mapped to the equipment, such as vibration on a process pump, high flow on a firewater pump, things like that, that would alarm and light up the barrier that they link to. So, an excursion from an operating window gives you the relevant barrier as a deviation. Then, you can start to look at the deviations and the barriers. Are they in the same place? Are they in the same area? Is your risk pathway starting to build? Are you exposing yourself to major accident hazards? With the dashboards we’ve got today, you can see that the risk can be communicated all the way up through management.

Andy Bartlett:

Those who can view the system can say, OK, we know we’ve got high risk in this particular part of the plant, what’s causing it? They can actually have a look themselves, or they can get on the phone to the person in charge and say, what you’re doing to bring down this risk? What mitigation measures are you putting in place? This knowledge gets captured, and you build up a history of what’s done, as this happens. You can build up a way of responding to these things, virtually instantaneously as they happen.

James Tehrani:

That’s a huge change there. I was wondering if you could take a moment to take a step back. I know you mentioned Piper Alpha before, but I was wondering if we can take this through a US related incident, that took place at Texas City. What is possible now, and I know you can’t go back in time, but what would be possible now that simply wasn’t possible back then? Were there things that could have been caught, because of technology that just wasn’t available back then?

Andy Bartlett:

Well, I’ve had a look at this a few times over the years. And I would think that with the instrumentation that you’ve got, several of the key operational indicators and alarms were inoperative, they weren’t working. They had faults with them, off and on, for months. With the most modern systems available today that would be highlighted, that the instrumentation wasn’t working. The high level on the splitter tower, which was one of the main problems. There was no signal at all coming from that instrument. The person inside the control room in Texas City, didn’t know that the signal he was getting was false. However, that with smart technology, I think they have a triple out system now where there’s like three signals have to agree on emergency instrumentation and critical instrumentation.

Andy Bartlett:

So, values of instrumentation outside safe operating limits, will be indicated on the deviations in the barrier model. One of the other things, when I was reading this again, was the shift handover and the handover process were not very good. They were criticized. But you have automated shift logs now, where the shift log takes readings from the system itself, and at the handover you can add annotations, and the person coming in has to signify that they’ve read all of this, and they’ve agreed to it. So, that’s another point that can help.

Andy Bartlett:

You can see the risk pathway building. You have the communication of the shift handover, is a bit more stringent than it was in those days. I mean, in my days we had a logbook. You wrote in it at the end of the shift, what you remembered. It wasn’t in a chronological fashion, you tried to write down what was important. But, having go through it with the next guy coming in, we didn’t have time for that, we were waiting for the bus to get outside of the gate.

James Tehrani:

Definitely. I mean, it’s a fascinating conversation. Obviously, like I said, you can’t go back in time and change things, but it’s interesting to see what’s possible now, that wasn’t back then. We just mentioned two of the biggest incidents, in our lifetimes at least. But you mentioned something in your answer that I thought was interesting. Can you explain a little bit more about process safety deviations? Because, I think that’s a critical point to really understanding this technology.

Andy Bartlett:

Yeah. All right. First of all, we have our barriers, which are the structural integrity, process contaminant, ignition control, detection systems. Then we have the shutdown systems, emergency response capabilities, and last of all and not least is the life-saving capabilities. So, as something goes wrong in the plant, and if it’s mapped to the barrier, then that would raise a deviation. Let’s see. We have corrosion, I mentioned, where the inspection department have measured the piping corrosion. It’s going faster than they expected, or they’ve found a weakness in that. They could raise a deviation on the structural integrity, piping systems or vessels.

Andy Bartlett:

There are several incidents that happened, where pipe work failed due to corrosion. There was one quite recently, I should look that one up. Then you can look at the ignition system. This is where the permit comes in. So, you’ve issued a permit for hot work and you’re welding. So, you have introduced ignitions into the system. If at the same time your gas alarm was to go off, then you have two parts of the triangle. You have the fuel, you have the heat, and you already have the oxygen, so you could end up with a fire or explosion. So, you would not want to be issuing a hot work permit in an area where the gas detection system is offline and has a deviation against it. You would reschedule the work, until you’d fixed the gas detection system.

Andy Bartlett:

Basically, that’s in its simplified form. There’s a lot of other things, of course, the firefighting systems. Is the pump on its last test? Did it reach its capacity? When you do a flow test on a firewater system, all the engineers are involved, and actually, you map the flow through the system. You map the pressures at different places. You look at the pump and say, is the flow coming out in the pump, where it’s supposed to, at that particular power usage? Whether it’s a diesel or whether it’s electric. And, if that particular pump didn’t reach its required output, then you would have a deviation until that was fixed. So, you would know that if you did have a fire and you had to do some firefighting, then you might have to get some additional sources from the fire department to help out, such as temporary pumps, until you’d got that fixed.

Andy Bartlett:

So, that’s how the system works. It alarms you to a fault, a deviation to the system, and you need to mitigate that to bring your risk level down to ALARP, As Low As Reasonably Practical. You have to look at all of the work that’s going on, the simultaneous operations. Are you taking samples? Are you releasing gas? Do have people working in trenches? Do you have people working inside vessels? All of those jobs have their own risks. As the risk pathway starts to develop, you need to mitigate and maybe stop some of those jobs, and reschedule them, until you bring your risk levels down.

James Tehrani:

I want to go a step further here. It’s just more of a philosophical question. When you think about all of this capability that companies have at their disposal, with the software, and you still see incidents happening. Every year there are multiple incidents in plants across the world. Why are we still seeing those types of incidents? Is it more that companies aren’t heeding the information? Aren’t reading the information properly? Or, they’re still just not taking the extra step to digitalize, as they should?

Andy Bartlett:

I think the extra step to take in the technology is not everywhere. It’s in few places. I mean, I worked on a big contract, where it took us weeks to get the critical instrumentation from engineering, to get it into the system, to map it all. It’s difficult, and of course there’s financial considerations, everything costs money. The situation the world’s been in recently, people aren’t so keen on spending money. But, as you know, the cost of an accident can never be put in dollar figures really, because it’s not just what you lost and the production you lost, it’s the people involved. There’s a local community around the place. So, nobody has yet brought in laws that require you to have this type of technology. It’s all voluntarily. I was talking with a process safety engineer a few months ago, he’s still using spreadsheets. And he reports his deviations monthly. Some of them would be 30 days behind when they happen, because it’s a monthly report.

James Tehrani:

Does he realize the potential problems with that?

Andy Bartlett:
Oh, yeah. He would love his management to buy the technology, but convincing management that they need to spend, whatever it is to buy it, is a different case. Some other places, they see it, the software, they see the benefits of it, and immediately they want to buy it. One of the jobs we worked on was for a mining company, the minute they saw it they said, oh, we’ve got to have this. We need to know what our risks are.

James Tehrani:

Well, thanks a lot, Andy. It was really interesting. Before I let you go, is there anything else that you wanted to add?

Andy Bartlett:

Yeah. One of the clients that I’m dealing with at the moment, have a very thorough safety management system on performance standards. They have detailed, for every safety critical element and its instrumentation, what the performance standard is for that particular piece of equipment. So, how it has to be checked. What the expected response is. And, what to do if it doesn’t reach that. In the UK, we have one of the advisory boards called Step Change for Safety. They have a performance standard system called Farsi, which also lays out how that should be done, to maintain safety critical elements.

Andy Bartlett:

That’s the last part of the circle. You define your safety critical elements. You define which barriers they go to. You define what instrumentation maintains those safety critical elements. When a deviation happens, what risk it’s going to bring. And then, of course, the last part, as I said is, how will you maintain those, the performance standards. Do you expect the valve to close in one minute, and it closes in a minute and a half, and it’s an emergency shut down valve? What’s wrong with it? What’s gone wrong in the past? All of that’s in the history. That’s the end of the circle. That was about all I had to say, James.

James Tehrani:

Great stuff, Andy, as always. Thank you so much again for joining me today. I’m sure we’ll reconvene this Fall, for another exciting episode of Andy’s Almanac on Accidents. Thank you.

Andy Bartlett:

OK. Bye for now, James. Thanks so much.

James Tehrani:

All right, thanks.

 

 

The Best of Spark Delivered to Your Inbox
Sphera
Sphera is the leading provider of Environmental, Social and Governance (ESG) performance and risk management software, data and consulting services with a focus on Environment, Health, Safety & Sustainability (EHS&S), Operational Risk Management and Product Stewardship.