9Jan2015, updated 5Oct2015
This page is in group Technology and discusses the problem area of how to reply to the rightful question: given one dangerous and undetected fault found in the software, are there any similar problems lurking?
It’s dangerous for the users of a safety critical systems if it’s not being reported that the system is not able to do its primary job. However, the IEC 61508 Safety Integrity Level (SIL) accepts some risk of dangerous and undetected faults, ranging from highest risk (worst) as SIL 1 to the lowest risk (best) as SIL 4 (with ten times risk decrease per SIL level up). [Ladkin, 2011] writes about the ideas put forth when the standard was developed in the late 1990’s:
It is based on the approach, novel at that time, of quantifying and reducing risk until it is acceptable, rather than the then-prevailing paradigm of finding out everything that could go dangerously wrong with a system or subsystems and fixing it so that it doesn’t, an approach deemed Sisyphean.
This blog note takes the form of an anachronistic, naïve fairy-tale. Hopefully it triggers some discussion.
The hole in the ship’s hull was not easy to discover. It was small and square, but an observant service man nevertheless found it. He had felt there was something close to a bulkhead frame, inside a cabinet in a room – also on some of his previous inspection visits – and just after some aport frozen, minus ten, chilly winds. The hole was above the waterline, but potentially fatal. He sent us a letter explaining the problem.
I found the source of problem by comparing the drawings with the problem description, and then into the production papers. At some point, during building, a part of a plank in the hub had been cut one cm too short, leaving a one by one cm hole.
Our procedure required us to do Relentless Root Cause Analysis (RRCA) of the situation. How could this happen, what can we learn from it, and (how?) do we know there are no other (round?) holes in the hull of the ships?
I went back to the engineering material again and explained the design, showing that the short plank was a side effect of two other planks that were cut from even length planks. The cutting frame template had an error: that plank would come out with a groove. This sort of operation couldn’t f.ex. leave a round hole. Our final test procedure hadn’t revealed the groove.
If there could be any other holes, undetected square or round holes I could not guarantee, but I could reason it as very unlikely. Stronger: I could even set up a falsifiability or refutability criterion: if we find that then we’ll have to rethink.
And we hadn’t had any reports from the thousand ships, had we? No wet cabinets in those rooms? And they were still sailing?
But show me, my boss said. We took one ship to the quay and had divers inspect it thoroughly. All fine. We took one back to the dry-dock, and we saw no place where water poured out of the ship either. (Ok, I stretched the tale fairly far now, didn’t I?)
I continued, saying that since we hadn’t made a model suitable for formal verification of the whole ship’s design and production, this is the best guarantee we can give. Yes, there is a risk. We might improve some of our work process and make the risk ten times less in the future, but we can’t completely remove it.
The probability of the incident (or occurance: 1-5) (including data of criticality group based on number of incidents seen per year) multiplied with consequence (or severity: 1-10) is this total (rather relative) risk. You could also multiply into the formula a relative measure of how well the observation may not be detected: 1-10, 10 being not detected at all. The higher the product, the more risk. Observe that the incident may be the result of a series of other “incidents”, both when it comes to the problem “all of a sudden” appearing in the product and how the cause of the problem arose.
To get an overview of the matters, we used a Failure mode and effects (and criticality) analysis (FMEA/FMECA) diagram and some Fault tree analysis. This was not contrary to common sense but we certainly kept that in mind, too.
The matters listed above, compared to the requirements set by the authorities defined whether we should fix no ship, fix on next inspection or fix all immediately.
Since this is a naïve tale I propose an Open End.
Anyhow, what would you have felt if you were me, my boss, or my company? Or most importantly: if you were a sailor or passenger – assuming knowledge about this matter?
After the Open End and after the conclusion based on authority regulations there is another point. Based on what we decide to do and how our customer understands it, it will in some way make it to the bottom line.
However, be sure to analyse the right problem. Maybe the wet spot in the ships was from a rusted iron water pipe internally. They tended to rust at this location on many of the ships, where the leaks spilt down to the other spot. This was regarded as no big problem, and fixed by the people on board. But this did not reach the investigators of the RRCA. They thought it was the single event of the hole in the hull above the water line. So, how to find out that it’s the correct problem that’s investigated is a challenge. If it’s the wrong problem that’s fixed, has the problem been fixed? Was there a problem at all, or was it a coincidence?
Epilogue. Read about Tony Hoare’s billion dollar mistake. And see how the first aeroplanes failed in The Wind Rises. Finally, study the 1879 Tay Bridge disaster. Finally, some safety-related poems by Don Merrell (about neglect to see the risk, I’d say), especially I Could Have Saved A Life That Day. Failures certainly have human context. To err is deeply human. It’s not only about square holes or broken material. But to fall after having failed has even more human context. Try not to do that. And do care.
From Den nye altmuligboka (The new book of everything) by Per Hagen (J. W. Cappelens Forlag, Oslo 1958), from http://urn.nb.no/URN:NBN:no-nb_digibok_2012110706228
The pictures show the pirate ship I built in the woodwork class of Rollsløkken skole at Hamar, Norway when I was 13 (in 1963). I downscaled every measure in the book by 75%. Some of the pictures show a dusty model, as it has been hanging in my woodshop for quite some years now. The sea is some glossy wrapping. Since 1963 I have done some more woodwork!
Norwegian search word: årsaksanalyse