Perturbing concurrent processes

New 28July2015, updated 4Aug2015

This page is in group Technology and is a note about how you might add a “perturbation” (any better term? Random testing isn’t it either) in one of your concurrent software processes to see if there is an(other) error lurking.

Intro

You (and the other test guys) have tested and shown your system to function as specified. The box is ready for shipping with new software on board. Or the software is ready for older units to get this update. You know you haven’t formally proven it to work, like modelled it in Promela (tool Spin) or CSPm (tool FDR3) [1], where absolutely all combinations are traced. To be honest, you know that there would be a lot of combinations left. Even if you have used known, deadlock-free patters (like the “knock-come” pattern [2]), you are only 99.9% sure that it’s coded correctly. Even with a very high test code (and test type) coverage. So, is there more you can do?

Smoking out two matters

  1. The first goal is to catch internal work state errors, where some condition has been assumed and then some action has been done. It has always worked. However, when all conditions are “empty” (not valid, void) then the bottom common handling code block might fail. Oops’ all coding was done assuming that there was a reason for the process to run, like data on a channel or a timeout. Then something was done. But what if something isn’t done?
  2. The second goal is to catch race between concurrent processes. When you watched the log you saw that these processes were running like they should, meaning when they should. It looked like they were running rather randomly. Dependencies between the processes should be of such a scale that changing them somewhat should be invisible to the specification. If we use CSP-based paradigms with synchronous channels such scalability is probably better controllable than with purely asynchronous schemes (admittedly, some would argue the opposite – I have blogged about this for years, pick from the Technology group). But that’s only 99% of the story; the 1% left is the other processes’ dependency on something else: like communication with a second microprocessor.

Perturbing a process

I am not suggesting a while, for or repeat loop (livelock) in a process, just to burn cycles. It could potentially take point 2 above, and it’s nice for testing how dependent external clients would be on response times. It would stress reactivity and response times, also with preemptive scheduling – which is not what I think my system has. It would have cooperative scheduling, run to completion or run to synchronizing point. I have discussed some of this in earlier blog notes, like in Not so blocking after all or even the strange case in Spin-and-burn loop leveling. And described the system I mostly work with [3].

I am thinking about all of a sudden changing a timeout or some triggering event to give the scheduler a reason to schedule a process more often than usual; basically “always”. Providing it will let the other processes run in between, so that there is a yield to other in between.

I did 99 zero timeouts per one old timeout that did something reasonable. 99 schedulings for no reason whatsoever. The one reasonable timeout did start communication with another process, which then communicated with another processor on the same board. While this happened (since it’s based on CSP) that process refused to engage with any other process. It shuts the door for a while. But required responsiveness is handled by driver processes that do that kind of work.

My perturbation was anything but a livelock. Any such pattern would do, there is no correct way. Think up yours. But what you actually test would to some degree depend on the perturbation. Like, how much does this stretch the “required responsiveness” above?

The results are

To my surprise I had several crashes after this, and discovered pathological behaviour. We use a lot of ASSERT(true) in our code, so all crashes were from those.

I saw that I had common code, used for all causes of scheduling (channel data/signal communication or timeouts also signalled on channels) that wasn’t ready for a no-cause scheduling in all cases. This was basically ok and even correct, but since I introduced this test to see what would happen if I did allow for background processing, mostly without “side effects” (=communication) then it would not have been ok. And it’s nice to code in such a way that if there is common code it would behave in all imagined cases.

Why have such common code at all? Simply to find out if there is something more to do when the common state says that’s ok; to pick out something from a “work pool” or f.ex. an array of data from smoke detectors.

The behaviourally pathological problem I found was a combination with a short between two oscilloscope probe tips; what was happening after that. That’s when I decided that a protocol driver should always read all incoming bytes and report instead of some times throw away a byte on its own behalf. Like when you have received a full reply frame and no byte should arrive before a new directive is sent. If there still arrives a byte after the full frame don’t take it in and throw it. Just process the incoming message, send out a new directive and when you have received a full frame again, that extra byte will have been seen as garble on that frame, or as a separate byte that you can pick out and throw at that time. This way it’s easier not to loop in an out-of-synch state.

It all started with trying to see what might happen when I needed to insert background processing in a process. So I made this perturbation to find out. Now I can remove the perturbation for real code.

References

Wiki-refs: Code coverage, Cooperative schedulingCSP, PreemptiionPromela, Test vector

  1. FDR3 — The CSP Refinement Checker. FDR3 analyses programs written in CSPM, which combines the operators of Hoare’s CSP with a functional programming language. It also has support for analysing timed systems, via tock CSP. From https://www.cs.ox.ac.uk/projects/fdr/
  2. The “knock-come” deadlock free pattern, Øyvind Teig (me), see http://oyvteig.blogspot.co.uk/2009/03/009-knock-come-deadlock-free-pattern.html
  3. New ALT for Application Timers and Synchronisation Point Scheduling by Øyvind Teig (me) and Per Johan Vannebo, see http://www.teigfam.net/oyvind/pub/pub_details.html#NewALT