Timing out design by contract with a stopwatch

New: 26June2016, last update 23Dec2016 (more in figure and a Disclaimer chapter). Spelling fixes 09Mar2023.
This page is in group Technology. It’s rather in media res and may therefore be hard to follow. My background is in safety-critical, concurrent embedded systems – and since I am most familiar with channel-based systems, that’s the ground colour of this text. But I try to raise it from this, to hopefully become applicable to a wider audience.

Intro

Can time be part of a contract? Is it a contract when time is included?

“Shall we meet at half past eight at the Pub for a beer?” Let’s steer past that one.

Timing requirements are important, they are extensively used in hard real-time systems. In some protocols a client does not need to send any “end of request” signal as the server times out after a period of inactivity to close the connection (like HTTP persistent connections). This blog note tries to discuss when timeouts are fine to use, and when they do turn the components’ interactions into a quiz.

Your design may sooner than you think “time out” if you base your designed assumed contract on a “stopwatch” (somewhat paraphrasing the title).

Related blog notes could be Timeouts and antiresonance(?) stop. I also have an important disclaimer at the end of this note: Disclaimer_”Non-interference_between_software_elements”.

OK with delay or sleep iff

If I define delay or sleep to be some part of code like delay(duration) or sleep(duration), inserted into some code to give the user time to go and fetch a cup of coffee, then that’s not what I am talking about. Another OK would be blinking of LEDs (later).

Assumed contractual delay or sleep vs. Real contract

However, that’s provided the delay or sleep is not used to wait for the next move of an assumed contract. Here’s the short summary:

Assumed “contract” Real contract
start(some_event) start(some_event)
delay(duration) wait(for_dataReady_signal)
get(&assumedData_orStatus) get(&theData_orStatus)

It’s the left column that I’ll try to explain that in most cases isn’t a very good contract (even if it may be ok for explicit polling, like polling for data collected in an interrupt). Observe the absence of duration in the wait-call to the right. The wait is forever, meaning that the waiter waits (many say block) for as long as it takes. So it’s important that at this stage the caller has nothing else to do. However, the other threads that are runnable will continue. I have discussed this in Not so blocking after all.

The left column I often see with an a busy poll like (fill in missing code yourself):

while(no_data_yet) {delay(duration); no_data_yet = !poll_for(&data)}

While this ensures that data is in fact ready when the while loop is finished and it may (as mentioned) be ok some times, it’s basically a red light for you. Often the left column collapses to a busy poll to ensure data in case they weren’t ok anyhow. Like if a buffer in asynchronous communication is full. I’ll come back to this.

The result is that the real contract in the right column above on the average is also faster.

Also, it’s safer since it is a contract. Safer meaning that being a contract between components (in this case concurrent processes or threads) then cooperation of all these components will run forever – with no calls to your company’s service department. Provided good software engineering is used to ensure that all liveness criteria are ok (never a free lunch). Like no deadlock and no livelock. It’s probably not as complicated as you might think to avoid this: use deadlock-free patterns (like this) and sound programming practices to avoid livelock (like ensuring that a loop isn’t forever). Or model it and verify it formally (try Promela/SPIN).

What process model will let something that delays for a known duration and then does its job be less safe than one that potentially waits forever? Not all process models make it possible to rule out the whole problem domain of what to do after a timeout. Try again? How to handle a message that might arrive just after a timeout?

(I have written a lot about process model in these notes. There is a search field on the top here. This also goes for many of the other terms that I freely throw in without much explanation. You will also find some at Wikipedia (refs below), and at [2])

I’ll come back to some language mechanisms that need to be present to be able to wait for a signal. Or even better: wait for the whole data set.

Drivers at the edges and internal components

This is about layered design. As we have seen a driver component that talks with external devices need to time out to determine that the connection is lost. When it has been lost, the potential waiting forever mentioned above will get a proper status response back from the driver: “connection down”. That solves the problem of waiting forever: it won’t. Since the driver would have needed a timeout in any system we now also need only one timeout. That’s where safety really comes in: several concurrent timeouts for the same action won’t step on each other’s feet. The concurrent processes inside that communicate with other internal processes or the drivers at the edge will not block forever. They will only wait as long as it takes. But there must be enough parallel slackness or enough concurrent processes that they will not starve any other processes waiting for them. This is where high (or optimal) cohesion matched with low (or optimal) coupling comes in – for free.

Using this methodology you will be more or less forced not to talk with two USB memory sticks in the same process.

More or less forced not to blink with a LED and control the keyboard in one process. (Observe that the LED blinking does need a time scheme. However, this is just some event ordered from the run-time system that triggers the timeout. In a proper conditional choice (below) it may also wait for a message to stop the blinking and terminate the timer. As mentioned earlier this is another not problematic timeout, along the line of the timing for drinking coffee.)

Here we see that the drivers at the edges of several units, where cables may become broken, will talk with each other, not by such a Real contract that I advice here for internal components. Since it needs timeout, maybe we could call this a Protocol for the scope of this note?

With these definitions, I mentioned above that a decision to “try again” after a timeout lurks around the corner for a Protocol. It’s now easy to see that once you think you need a “try again” in an inner component you are mixing OSI layers. This makes matters much more complex.

The difficult application level

Of course, at the application level you simply shouldn’t “try again”. Leave that to a lower level. (Actually the OSI model uses layer instead og level.) But we have seen the LED blinking situation, which is application level. It’s perfectly fine.

However, beware of this situation:

  1. At the application level, you send a message to a remote external application level (through drivers at the edges) and ask it to do some action for you. You expect that job to be finished within ten seconds. (Figure 1 (below) was inspired by this scenario)
  2. You start an application level timer because you don’t trust the code below. You don’t trust that you will get a “connection down” from the lower layers
  3. Well, you might trust that, but you might not trust the other application’s ability to send you the right message: “command xx failed: heating element broken”. It should have, since it has detected that the temperature didn’t rise. Coding is difficult, and to make it “safer” you started your own application timer to watch over this. (It’s a tiny heating element!-)
  4. The synchronisation between you and the remote application is difficult to know when to establish for the fault scenario here. See more below

Here’s the ideal(?) situation for this:

  1. As above, but the expectation for ten seconds is programmed in the remote application
  2. You just sit waiting for “OK” or “FAIL” from the remote application
  3. If you get a “connection down” from the communication driver then treat that at the application level. Enter some new state
  4. When you get “connection up” again you need to synchronise state with the remote application. Figure out some message to send it, but make sure that you handle the case of crossing messages. The “connection down” and “up” again could have happened between seconds five to six. (Synchronization is more a separate protocol than just a message. Better stick to a known protocol that has this)
  5. You need to do this synchronisation even if the remote application also got “connection down” from its driver and then it might have unrolled the message

Now, which one is best of these? I’d go for the latter. I think it scales better. Meaning that you could do this between lots of components, and I believe it’s a good pattern.

A process/data-flow diagram example

The inspiration for this diagram is the previous chapter. One of the tasks here is to discuss the timer t2.

If it is defined in the spec, what is the real need and cost? What’s the cost of not having it? What happens if the connection is broken, and what when it comes up again?

The catch in this note is to show that matters get much simpler without timer t2.

Figure 1

Process/data-flow diagram of possibly unnecessary timer

Fig. 1 – Process/data-flow diagram of a possibly unnecessary timer

(Fullpixel press picture, scalable PDF is here)

P2’s job is to read the button PB and send a message to P7 that it shall heat and regulate the H heater. After t7 it reads the Temperature of H and sends OK or FAIL back to P2. Then P2 lights LED2 with green for OK or red fault. It never blinks, we haven’t put a blink timer in P2 for it. (Thanks, Petter at work, for pointing this out! And Preben for detecting a wrong name on a channel!)

The example uses CSP type of processes and channels. Like in the Go language, Stackless Python etc. But the general idea is valid also with asynchronous messaging and other process models.

A and B are different processors. X1, X2, X3 are some communication lines, like ethernet. C13 to C53 are synchronous, non-buffered channels. S35 is an asynchronous signal, no data. S35, C53 and C35 form a “knock-come” deadlock-free pattern, where both parts can initiate a message to each other at any time.The other channels rely on a pure client/server relationship, and are thus also deadlock-free. Timer t1 polls X1. Timer t4 blinks with LED4 when it’s told to. Timers t5 and t6 do heartbeat and detect connection up and down. X67 and X76 are both deadlock-free “XCHAN”. P3 is a kind of application-level router. (Again: search for “knock-come” and “XCHAN” in the top of this page.)

The figure is only meant as a carrier of the ideas. It’s not describing a functional system. Especially are the usefulness of the roles of each process with respect to each other only thrown on paper.

More comments

Fig.1 takes me to all of these comments:

  1. If the spec doesn’t explicitly say that LED2 must confirm success or not within t2, then try to avoid timer t2. In that case, provided X2-X3 is up then P2 will inform by LED2 within about t7
  2. If there is a requirement to the user IO (t2 for LED2) instead of t7 (how long time it should take to heat H) then I think there is a specification clash. If the specifier insists tell him that then P2 could send a message to switch H on to a temperature, remote read T and then decide within t2 if it were a success. Regulation of T is in this case not possible if X2-X3 goes down, because the spec is wrong. Hopefully he’ll see it
  3. Timers t4 and t7 are internal, they don’t survey anything external. Ok
  4. There is no timer in P3. All communication up, right, down and left will per design and implementation succeed. This means that this system won’t allow P1-P5 or P6-P7 to crash without a code ASSERT and probable restart of A or B respectively if the error was fatal. However, A and B shall survive a restart of the other part, so it’s not legal to code any ASSERTs caused by external signals. It might in debug mode, to catch errors in the protocol stack. However, it’s not enough just to skip the ASSERTs for the shipped version, you must also clean up so that the code can continue. The goal all the time should perhaps be defensive programming for shipped code and a kind of design by contract during debug. Or a mix of the two also for shipped versions
  5. If you were to have timeouts in P3 (t3_up, t3_right, t3_down and t3_left) then you might find it a little ok. But then there should be corresponding timers in P2, P5, P4 and P1 (t2_down, t5_left, t4_up and t1_right). The naïve system I have described would need (2 * connections) timers for this. Would you send your daughter with an airplane with such a scheme? Remember, there would be many internal processes
  6. With the client/server relations between P1-P3, P2-P3 and P3-P4 some of them have to be master and the other sides slaves. Here the first are masters. This is easy if data or messages basically flow in one direction. If not, use knock-come pattern as in P3-P5, where both parts are masters and can initiate transfers
  7. Things won’t necessarily get easier if you add buffering in the channels. Golang has this ability. You then let a message go (send and forget) and a more important message arrives for the queue, it’s hard to remove the less important. Per def you won’t know which are there because your code has forgotten. Therefore I advice no buffering (as I learnt with occam for many years) and then to put buffering inside the processes. I think that a buffer process is more flexible than a buffered pipe, channel or queue. Then you can flush the internal buffer or overtake less important messages etc. and be in full control
  8. All P1-P5 and P6-P7 are lightweight processes. They probably aren’t full Linux processes with memory protection between them. A Golang goroutine, or (once upon a time) they might be occam PROCs. I have written some about this world. “ChanSched” is our latest scheduler, and its basic mechanisms are described in a published paper (here). However, the ideas outlined here should still be worth discussing for any application assembled with concurrent components. Perhaps even for non-concurrent objects in the OO world. Yes, I think so
  9. If we really need both t2 and t7, whether it’s a good idea for P2 and P7 to know about each other’s timeouts I don’t know. We can’t just go to a shop and ask for a chair and say we’ll cancel after two days if transport will take at least three days. But it might be ok to say that we’ll cancel after ten days. But then, if we do a deal that says that we’ll be informed anyhow about arrival or failure then we can forget about it all. The shop will have mechanisms in place to ensure that you will get the chair or the message
  10. Could a separate P (like P3) have a central function as to deal with all timeouts? The run-time system will have a generic timeout handler, not like that – but like an application level central timeout dispatcher? A colleague mentioned that for me. Personally I am sceptical

Design by Contract ®

Design by Contract ® (DbC) is trademarked by Eiffel Software, ref. Design by contract in Wikipedia (stating that “it has its roots in work on formal verification, formal specification and Hoare logic”). It was done originally by Bertrand Meyer basically for Object-Oriented (OO) systems.

I have been using “design by contract” more relaxed here; more as a general carrier of what is needed to actually have contract. A contract in this sense is some agreed protocol between concurrent components that has no holes, and is always water-proof.

But time in it?

In the Wikipedia article about Design by Contract time is only mentioned in one case:

The notion of a contract extends down to the method/procedure level; the contract for each method will normally contain the following pieces of information: (then a list, and at the bottom:) (more rarely) Performance guarantees, e.g. for time or space used

Also, time in formal languages is starting to come, but it’s notoriously difficult to guarantee anything here. Promela only has one global timeout which is something that’s guaranteed to happen sooner or later, or rather taken if there’s nothing else to run.

Some timeout abstractions

Response time is often described as a non-functional or extra-functional requirement. In specification languages such as QoSCL timeout contracts are defined (integrated from QoS Modeling Language, QML) [1]. I believe this about the same as the conditional choice (mentioned above) ALT with timeouts in occam. As Alt with timeout in UML (seen in [1] page 224). And select with timeout in Go. Plus, I assume, in many other languages, be they specification or implementation languages. Also Linux has flavours of it.

Some times the embedded run-time supplies only a timer that is reset on every process scheduling, even if timeout wasn’t the cause of the scheduling. It might have been some communication. This is the standard ALT / Alt or select timeout. In this case, where longer time or clock time is needed, some system is needed to keep it running for longer. In that case polling of this timer system may be needed on each scheduling. A small timeout then drives the longer system. It might have passed 24h since a timer was set. We have done this in some embedded systems, even if such a system burns quite a lot of cycles to look at the watch.

Simple ALT/select timer with manual long time handling Both simple and long timer in ALT/select
ALT_CHAN_IN_F (C43)
ALT_TIMER_IN_F (t4)
ALT_CHAN_IN_F (C43)
ALT_TIMER_IN_F (EGGTIMER_t4)
ALT_TIMER_IN_F (REPTIMER_t4_long)
Test_TimerL_Decr (t4_long) // regularly run this

However, later we introduced what we called egg and repeat timers (EGGTIMER and REPTIMER), the latter being “application timers” in [5]. The table above gives a clue only. With this we got rid of the need for timer ticks to be supplied to each process. We could order timers around the communication (eggtimers: like if not received anything new then poll that pin) or timers that survived any number of schedulings (reptimers: do this every 24 hours). We did not, in that particular product also add a clock timer, like instead every day at 10 o’clock. We could have done it, but the problem is to set it!

“Messing Around with Timeouts. In Contracts?” at CPA 2016 fringe

I’m going to present this (= hopefully participate in a discussion over a beer) at the CPA 2016 fringe. See bottom of http://wotug.org/cpa2016/programme.shtml. A carrot is that a one page abstract  will appear in the ISBN-classified printed proceedings later on.

The abstract is also here, and the presentation also here. Observe that the figure in this context shows a LED2 that’s able to blink even if I haven’t drawn a blink timer for it. I don’t plan to update. But I did update Fig. 1 (above).

Here’s from the CPA/WoTUG page: “Communicating Process Architectures 2016, the 38th. WoTUG conference on concurrent and parallel systems, takes place from Sunday August 21st. to Tuesday August 23rd. 2016 and is hosted by the Niels Bohr Institute, University of Copenhagen. Conference sessions will take place at the Hans Christian Ørsted Institute, which is located here. The evening Fringe sessions will be at the Caféen Bar, which is just a few minutes walk from the Ørsted buildings.”

Some comments after the 15 minutes presentation:

  1. A university professor said that he had been teaching Design by Contract for some years. The concept deals with state machines only, about their preconditions and postconditions. There is no concept of timeout, except for the fact that one could state that on transition should happen before another. This should match rather well with my investigations here.
  2. Professor Peter Welch said that he had presented some of the same points in his fringe at CPA 2015. I discussed this in a separate blog note about Hard real-time then, see here. He said that he had concluded that high priority processes should have timeouts but low priority processes rather not. This is the same as I have said here with respect to drivers “at the edge” (high priority) and “internal processes” (low priority) – but I really didn’t saw that connection myself. But he’s right, these are two sides of the same matter
  3. Also see my comments about XCHAN (IEC 61508, below)

Does all computation have a timing requirement?

Martin Korsgaard has studied whether in fact all computation has (or could consider having) a timing requirement [5]. He studied a new language that he called TOC (for Timed occam). While this is interesting per se, I am not sure how this relates to this blog note. I think he came to the conclusion that there would be exceptions. Stay tuned on this!

Disclaimer: “non-interference between software elements”

For Safety Critical systems, in IEC 61508 [4] 2010 part 3, Annex F 61508-3 there is an informative chapter called Techniques for achieving non-interference between software elements on a single computer. (61508 also defines Safety Integrity Levels or SIL levels 1 to 4). The Annex is a guidance on how to achieve independence of execution, and “non-interference (between elements of differing systematic capability, between elements which are designed to achieve or contribute to the same safety function, or between software contributing to a safety function and non-safety related software on the same computer).” The Annex goes on to say that “Independence of execution should be achieved and demonstrated both in the spatial and temporal domains”. Spatial is about not wrongly modifying data in one “element” by another. Temporal means that one element can’t delay another past specifications, so to say.

This chapter was difficult to write once I discovered it was necessary.

“Independence of execution” raises some questions about the architecture described in Figure 1. Think one part is safety critical, one is not – and the one that isn’t is not allowed to cause problems for the safety critical part. Will Annex F hold if A and B were these two parts? Will it hold internally in A or B? Think that the extra connection P1 isn’t safety critical. How could internal implementation be if it should hold, in case it doesn’t hold now? If we have synchronous channels betwen P1-P5 then occam didn’t have error returns from channel handling. Google Go has. Is this needed? If the connections were Linux (untyped) pipes or some (typed) messaging library? Now, if the router P3 were a composite of ten sub-processes, when will the requirement hold and when not? Could those P3.1-10 be allowed to be more connected, provided and error in P3 is detected?

“Independence of execution” then, between a component with a SIL level and one without, also is about layered implementation. We must think also of application level that stubbornly won’t proceed if some reply isn’t seen or a communication layer that won’t detect a broken line.

It’s also about what in the formal world is called progress, i.e. not deadlock or livelock. It certainly helps the independence of exeution the more we know about how execution runs or fails. And we have also seen the ASSERT scheme. All this together, as far as I can see, will not void the whole idea behind this blog note: that implementing with timeouts is not a general panacea. It must be used with caution.

Even a safety system profits on being understandable and having fewer possibilities for errors. If that leaves you “without t2” then so be it!

After the CPA 2016 fringe presentation (above) I saw that I had failed to mention the XCHAN concept as a means to make a channel based system less coupled at link level. This would also help the Annex F situation.

(Update 23Dec2016) I have queried some about whether XMOS logical cores might be used for differing systematic capability , see https://www.teigfam.net/oyvind/home/technology/098-my-xmos-notes/#xcorelogical_cores_as_iec61508_8220differing_systematic_capability8221. I was pointed to the fact that it’s possible to have one logical core stopped by an exception handler while the others were running. How independent may they be?

By the way, I am active through my employer Autronica Fire and Security and SINTEF to have submitted a suggestion for a slightly modified text in Part 3 Annex F for the next revision of IEC 61508. It’s in the committee now (8.2016)

Golang Go concurrency pattern “Context”

At CPA 2016 (above) I learnt, by listening to “High Performance Tape Streaming in Tapr” (Klaus Birkelund Jensen and Brian Vinter) about the Go concurrency pattern Context. It’s described here. (Search word in my blog notes: “Go-Context”.)

It takes care of timeouts and “request-scoped values, cancelation signals, and deadlines across API boundaries to all the goroutines involved in handling a request.” It’s type Context interface and Done is signalled over a channel to listeners. And Contexts are safe for simultaneous use by multiple goroutines.

This is great layering for pushing timeouts away from user code.

Conclusion

  • Whenever you have some timeout in your code, switch on the yellow light
  • It may go green by second thoughts (discussion, architectural and code reviews, 61508 etc.) or it might go red
  • It even may go green by moving that timeout to where it belongs
  • If you have two timeouts that interfere then start with a red light

Feedback

Send me a mail and and try to convince me that time can be part of a contract. I’ll be glad to add it here. I’ll open for comments below, too (until I get tired of spam)

References

Wiki-refs: Defensive programmingDesign by contract, Golang, if all you have is a hammer, everything looks like a nailoccam, OSI model (defining OSI levels or layers), Promela and SPIN, Response time (technology), Timeout (computing)

  1. Google search for “design by contract and timeout”. Hit at Component-Based Software Engineering: 7th International Symposium, CBSE 2004, Edinburg, UK, May 2004, Proceedings. Ivica Crnkovic, Stafford, Schmidt, Wallnau (Eds.)..at https://books.google.no
  2. High Performance Computing and Communications Glossary, see http://wotug.org/parallel/acronyms/hpccgloss/all.html. It’s a great, old fashioned glossary that goes for much more than HPC
  3. New ALT for Application Timers and Synchronisation Point Scheduling, Øyvind Teig (myself) and Per Johan Vannebo, Communicating Process Architectures 2009 (WoTUG-32), IOS Press, 2009 (135-144), ISBN 978-1-60750-065-0, see http://www.teigfam.net/oyvind/pub/pub_details.html#NewALT
  4. IEC 61508 or Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems (E/E/PE, or E/E/PES), see https://en.wikipedia.org/wiki/IEC_61508
  5. Process-Oriented Real-time Programming, PhD assertation by Martin Korsgaard, NTNU. ISSN 1503-8181 ; 2013:65. See http://www.diva-portal.org/ (search for his name, because he has sevaral papers on the theme)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.