The science of fault finding (28 Jun 2007)
When bad things happen, it's a science tracking them down. I had a big one today and I've been thinking about how I go about it (in the hope that I can go about it faster in the future).
A fault/failure has a chain of events from the thing that changed to the thing that failed to the signal that let you know that something was wrong. Sometimes the failure and the signal are the same thing (what failed? It crashed. What's the signal? It crashed). And some times the thing that changed is the same as the thing that failed (what changed? The O-ring seal burst. What failed? The O-ring). The difference between the thing that changed and the thing that failed is that the latter is the first thing in the chain of events which you can make a value judgment about. Change happens, but failure is bad.
In the system I'm dealing with we have many, many (many) signals about what's going on. Lots of those signals changed today. Some of them don't have value judgments; they're aren't saying that anything is wrong, just that something is different. The chain of events has many branches and not all of them cause anything bad. However, several important indicators (error rate, latency) do have value judgments and they were creeping up.
Now beings the science: have ideas, test them out. You can start from both ends of the chain; trying to figure out what changed and trying to work back from the signals. Since this was affecting the whole world there was one very obvious thing that changed at about the right time, but there were several other possibilities. Someone went off to investigate the other possibilities but mostly we concentrated on the big change, although we had no idea how it could have caused a problem.
Now, at this point I think it would have been helpful to scribble on a whiteboard or on paper to record our facts about the problem. Otherwise you spill working memory and you forget why you discounted ideas. I'm very much thinking of something like the differential meetings in House (the TV show).
However, I'm mostly thinking about how it took so many people so long to figure out where the failure was. In hindsight, we had all the clues needed fairly quickly and I even knew that they were important because I kept looking at the two signals which turned out to be critical. Neither were out of range, but they told contradictory states of the world. If you had tracked me down in the corridor and asked “How can both A and B be true?” I could have told you pretty quickly. But for some reason I was looking for other factors which could influence the more indirect of the signals. It didn't help that I didn't know the system all that well, but I still should have worked through the logic assuming that they were correct first and not taken 20 minutes to do so.
Of course, everything is obvious in hindsight, but I still feel that I've missed the lesson somewhere here. Maybe I just need to start writing things down when I get into that state. It's similar to explaining something to someone; it helps you organise your thoughts too. I guess I'll see how that goes next time.