The '2/3' bug in Dilbert

Designer's notes #6 - Home - Prev - Next
Øyvind Teig,  Trondheim,  Norway (http://www.teigfam.net/oyvind/

Don't close your eyes to the screen's blind spot!

We had long noticed a '2/3' character in the Dilbert banner page, printf'ed to our terminal window. Dilbert consisted of slashes and spaces and the sort to build him, but not really a '2/3' character. It was hardly noticed - and in any case, set aside for future contemplation. The Finite State Machine (FSM) runtime system that we built - and the application proper, of course - were more important to get up and running, than caring for the  '2/3' bug look-alike sitting in the Dilbert page.

We were designing a platform to build embedded software on, and building the FSM runtime system ourselves. It gave us a tool whereby we could make embedded systems based on the SDL (Specification and Description Language) methodology. Every time a small program changes state as seen from the outside, it engages in an event with other programs (or tasks or processes). Sending and receiving messages and waiting for a timeout would be places where a process could safely abandon its whereabouts for a while. 

Then, whenever the process is kicked running again caused by some message in its message queue, it will switch-case its way to next state where it will decode the message. The process' context is restored by the runtime system. In SDL such messages are asynchronous, meaning that the queue grows and shrinks in size, and a producer must only be so much ahead of a consumer process to avoid buffer overflow. 

Then we decided to increase the methodology tool suite to build synchronous message passing over channels, according to the CSP (Communicating Sequential Processes) methodology and process algebra.

In that paradigm a producer will be in phase with a consumer if this is what one wants - which it quite often is.

At any time a channel between two processes may be ripped up, and a buffer-process may be inserted, restoring asynchronous behaviour. A consumer may  decide not to serve a particular channel, and no data will be lost, because the producer will block - meaning that the process will be returned from and not reentered or rescheduled before the receiver has connected and is ready to let data pass. This is done by a memcpy directly from the producer's context to the consumer's. The whole channel layer cost two weeks of coding with virtually no changes in the FSM runtime system's code.

It worked as expected when it worked, but the problem was, it only worked occasionally!

We saw that inserting one character in a printf string could have it fall over. And adding one byte somewhere, like in main,  without even using the variable - could have it work again! 

(Next page)


In the CSP test system we had replaced the Dilbert banner for a quote by the 13th century philosopher William of Occam.

"Entia non sunt multiplicanda praeter necessitatem" - "entities should not be duplicated beyond necessity" or "things should be made as simple as possible, but not simpler". It is also called Occam's Razor

We had not noticed or even looked for a '2/3' character in this quotation banner. 

A month of pondering of why the Dilbert system had "no obvious error" and the quotation "obviously had an error" (quotations from C.A.R. Hoare's Turing Award lecture), surfaced one of the gurus. He found that, yes - there was something wrong, and it was within the tool chain.

Programming the flash on the embedded unit and reading it back was fine, it verified ok. But then we discovered that with the debugger it actually worked! It used cof, not hex file format. Many hours later we learned that neither our own flash programmer tool nor one supplied by the processor dealer had warned of an unsupported hex file record type! 

The '2/3' character value was one byte of executable code which was mapped to low addresses. Inside the Dilbert banner for no error to reveal, but seen as code proper in the latin quotation, where the error easily revealed itself. When the quotation text was loaded over the correct code the single and new important instruction data seen as '2/3' would not always cause the code to fail, one never knows.

The guru did find a fix. Rewrite some mapping statements in the makefile. Fix the missing error report, and warn the OEM of his missing error report. 

So, don't ever let a bug sit there - it least not if you can see it! A second thought: maybe the  philosophers are best at distilling problems - to get to the correct solutions! Sorry Dilbert, you are good, but you didn't boil out this one!

01.2004

Other publications at http://www.teigfam.net/oyvind/pub/pub.html