22May2015, updated 4June2017
This page is in group Technology and is a note about how an embedded controller might favourably spin to burn cycles. It describes an exception.
I have used the word spin rather than block here even if it is an example of rather red blocking in the earlier note Not so blocking after all. I have been trying to make it green enough.
I will describe how resource-hungry printing might level off its hunger as much as possible.
Sometimes we rely on a debug log, also in the running system that we put in a small box and ship. The log goes to nowhere, except to a service man who might connect to pick up what’s going on. But it’s there all the time. This unit doesn’t have a switch for it. The UART that does this is implemented by bit-banging – because the available UARTs are used for something else. This means that at 9600 Baud there is a new bit every 104 µs (1/9600). A timer interrupt needs to be there exactly so often during a character transmission. There is only sending on this log channel (luckily; oversampling by 4 or 8 would have been beyond reach). It’s a small controller that only does 374 cycles in 104 µs. If the interrupt takes for example 34 µs then only 74 µs are available to do real work needed also during logging. Better perhaps to say that it’s got 67% available for this.
As you understand, when the send-buffer is full, the UARTs PutChar function spins in a loop waiting for the next final vacant position, in front of the FIFO ring buffer. This needs to be done without any race between the busy poll and the bit-banged UART’s 104 µs interrupt. One way is to have interrupts disabled in the PutChar function and then enable-and-disable interrupts in a loop as many times as it takes, to let the 104 µs interrupt (and other interrupts like the system timer) through.
Hard time limit
Now, there is a function of the system that does have a hard real-time limit. This is controlled by one of the other interrupts, a level shifting interrupt. The max response time is required, also when there is logging going on. The solution goes like this: When this interrupt arrives the bit-banged UART’s state machine will stop after the stop bit if it runs. PutChar will also terminate its spin loop if it spins. All new PutChars will fill up the buffer, but if it’s full a log character lost status will be set. (ok to log when things are ok again). This effectively causes the hard real-time requirement to be served. The hard real-time’s event then involves communication over another UART. When it’s done the bit-banged UART is enabled again.
Buffer and index sizes
The log’s send-buffer was full almost all the time. The ring buffer’s size was (1<<14) or 16 K byte. The timing in ExtraPutty was more than 16 seconds off! So, since I solved the hard timeout, there was no reason not to decrease the log buffer to (1<<7) or 128 bytes. Since the SW spin-looped anyhow, then to have zero space available and 16 K bytes behind is worse than having 1 byte available and 127 behind. So 128 byte may be better than 16 K byte. I saw no performance penalty with 128 bytes buffer.
The bit-banged UART would in effect perform better in count of not needing so many cycles. The 8-bit machine we use uses less to compare and increment 8 bits index values than 16 bits values. When going from 16 K byte to 128 byte I also changed the size of the index type from 16 to 8 bits. So now I would have more cycles available for other work.
Then I reduced the debug log amount. Sending on a CSP channel does not have to be printed in full with data. Shows only the traces as from->to process is much better than nothing. But I didn’t need to do #07->10 as decimal 7 chars when #7A as hex would do. And \n\l (complete new lines) may be replaced with a space. ExtraPutty wraps the lines anyhow.
So now the send-buffer had available space more often. I logged statistics on it, and yes, if a CSP process needed to print 30 characters then that would most often be done without the spin loop. There would be space. The bit-banged UART would do its job invisibly in the background.
Spreading out burning cycles
A final touch was to make a function like PutChar with the same semantics except it didn’t send anything. Instead it spin looped for one character if the send-buffer was full just to let the bit-banged UART do it’s job with the character it did send. If not it just returned. I decided to try to have a call to this on every return back to the Scheduler from a process. (It’s cooperative scheduling, run to completion semantics of the processes, where this means run to synchronization point.) The call is done unconditionally on every return.
The idea was that it should cause less blocking in each process, but instead burn the cycles more flattened or spread out. I also did statistics on this, and saw that most of the time it didn’t burn cycles, even better – most of the time the send-buffer was empty.
So, wear leveling here is “spin-and-burn” leveling
The title of this note hints towards wear leveling, where bytes or sectors in FLASH only have a limited number of writes, and where the algorithm limits wear (number of writes) on each cell. Here we tried to level out burning of cycles: if the buffer is full some burning will be done by the printf, but some is also done character per character per character (max 1 ms per time) on descheduling.
The purpose is to do spin-and-burn loop leveling. As long as there are characters left in the send-buffer the cycle burnings are trickling. So far in the analysis and tests it seems to work. There now seems to be more small breaks spread over than longer breaks now and then. A process doesn’t have to take all the penalty of a long log. Some of it is taken in between.
I now see about the same number of times that the send-buffer is becoming full as when it’s becoming empty. It’s not almost always full any more. So, I probably have increased the total reactiveness of the system.
Update June2017: I guess I am not touching upon spin locks here, even if the idea of doing nothing for the good of all is the same. I have, since this note was written, learned about spinlocks by reading the work by Edvard Severin Pettersen (see here – but most in his later Master’s thesis. I was co-adviser). Simple spinlocks evolve into test-and-set (TAS) and test-and-test-and-set (TTAS) spinlocks. Then, on the net, I found them even having yield in the loop. I guess, spinlocking is the same as “busy poll” or “busy polling”. Of course it depends on whether scheduling is cooperative or preemptive. I need to learn more about this.
- From message queue to ready queue. Case study of a small, dependable synchronous blocking channels API “Ship & forget rather than send & forget” First ERCIM Workshop on Software-Intensive Dependable Embedded systems, Øyvind Teig. see http://www.teigfam.net/oyvind/pub/pub_details.html#Ercim05
- No Blocking on Yesterday’s Embedded CSP Implementation. (The Rubber Band of Getting it Right and Simple). By Øyvind Teig. Follow-up after “From message queue to ready queue” (below) Communicating Process Architectures 2006 (CPA 2006). See http://www.teigfam.net/oyvind/pub/pub_details.html#NoBlocking
- New ALT for Application Timers and Synchronisation Point Scheduling, by Øyvind Teig and Per Johan Vannebo, Autronica Fire and Security (AFS) at Communicating Process Architectures 2009 (CPA 2009), see http://www.teigfam.net/oyvind/pub/pub_details.html#NewALT