Atomic for all?

Contents

1 Microcontroller in C
2 Atomic in general
3 The transputer
4 Atomic operations and the transputer
5 C11 and C++14
6 Package atomic in Google’s concurrent Go language
7 A tabular view?
8 Hardware level
9 What I have tried
10 Postscript
11 References

Started at Caffè Nero at Blackwell’s in Oxford on 24Aug2014, updated 17Sept2014. 22Oct2023: typos fixed. This page is in group Technology . (Update Oct2023: Observe that this note was written before I got acquainted with the XMOS xCore architecture and the xC language. TODO: I might one day make an addition for this.)

Atomic is a computer science term for not stepping on each other’s toes. Nowadays I may declare atomic-type entities or use libraries. But up to now I have done it differently. I’ll discuss both HW and SW facets of atomic, and perhaps show that in some cases it’s not as necessary as one might think. I’ll start with an example from my own life. It concerns how to pick up data from a HW interrupt, and how to send data to it (like reading or sending println characters):

Microcontroller in C

For a uniprocessor to facilitate mutual exclusion the only thing I have done is to disable all interrupts (or some) to start a critical region of code [then write / read data] and then afterwards enable all interrupts (or some). If there is a multi-priority interrupt system (with nested interrupts) I would also have needed to think about disabling of higher priority interrupts when being in the interrupt – if there were a possible data race lurking.

A small embedded microcontroller often doesn’t have preemptive scheduling by a SW scheduler, but we still need the critical region to make it possible to communicate without race with interrupts, the HW preemption that was invented in the 1960-ies and which I believe all processors in the world is equipped with. (But stay tuned!)

So, even on a small microcontroller with a SW cooperative non-preemptive scheduler and HW interrupt, we have the same type of problem as between threads when we have a preemptive scheduler: the need for some atomic operation.

Aside: Observe that if we have this architecture it’s enough that one part writes to the shared data to need this protection, no matter how many SW-engineers that argue that if there is only one writer then it’s ok! It’s called CREW (Concurrent Read Exclusive Write) – even when we have a single processor running threads.

Here is what I do when a process [02] (or thread) uses the help of an interrupt to output and input data:

OUT (process->interrupt): in a process I write output data into the interrupt (if data-flow state does not allow it then we have an output overflow) – in a critical atomic region set up with the above scheme. This would either start a new interrupt session or fill up a queue to pick from.
IN (interrupt->process): in an interrupt I would send a signal on an asynchronous non-blocking channel (i.e. set a bit) into a process signifying that data is ready. If previous data has not been read I mark this in the data-flow state as overflow.
- Back in the process (after the channel bit has been picked up in a critical section): when the process has received this channel (in a CHAN_IN or and ALT conditional choice) then I pick up the data in a critical section from the interrupt (and read and update the data-flow state).
Observe that OUT and IN do not use symmetrical schemes.
If I explained this ok, then you have the basics,…

..but this was on one machine (=one core) and one memory! How about multi-core – and perhaps even distributed (=not shared) memory?

Atomic in general

I must go back a little now. This blog started when I read some of Jeff Preshing’s blog notes, especially his page about synchronizes-with [03]. I looked up the Wikipedia article about atomic, but for once it was of little help (see Wiki-refs below).

When the language supports atomic the reason is that the programmer should be shielded from the nifty details of how it is implemented. Different processors would do it differently, and the types of instructions available for it would differ. Most larger operating systems have preemptive scheduling between threads / processes and interrupts. So atomic plays, as we have seen, also a role in that context to make it possible to read atomically, do something with the data, make new data and then write the new data back atomically. This way one may implement mutexes and semaphores etc. on top of atomic. And on top of them message queues and in some cases channels.

Not to hide, the latter is my favourite, but beware: the concept needs a good process / thread model. Such that if you block for a resource it’s perfectly ok since it only means the blocking part has nothing more to do for the moment, and it won’t hinder other processes that do other things to do these other things. With this we have left single threaded coding for those that like a single drawer in their bureau or even house. I have lots of blog notes about this (and my next), so I won’t repeat. Only say that this blocking is not the least evil, it is the very opposite.

But I think (yes, think) that atomic is also valid for multi-cores with shared memory. However I see no trace of it in the C++14 draft standard [01]. More below. And I think not for distributed memory. The occam language had all of this in about 1980.

Since this note was started attending the CPA-2014 conference and I happened to be seated next to a specialist during the conference dinner, I will refer to what I think I heard him say (in the noise of eager computer scientists talking with each other):

For some atomic implementations the hardware needs four copies of the atomic variable, and there are four booleans – three written by the writer and one by the reader. There was no time to go into more detail, but it was certainly interesting. I think he also added that in some cases this was not enough.

But then, there is no direction associated with this, even if the unbalanced reader/ writer booleans above seem to indicate else-wise. An atomic is a kind of common point of touch between threads, isolating cores, caches etc. from the budget as seen from the programmer. No direction. No wakeup of a reader when a writer has modified. It’s «just» atomic.

The transputer

The transputer had native support for occam processes, channels and timers. There were two process priorities. But there was no HW interrupt as described above that was available for the compiler writer, or occam for that sake. But the word interrupt is still used: for high priority processes interrupting a low priority process. There is a lot of concern about this. But there was an event pin, which was delivered at the sending side of a channel and then usually sent to a listening high priority process. This event is actually called an event channel.

But I have found no atomic. There is no mention of it in the Transputer Instruction Set book [04]. Atomicity seems to have been achieved by carefully designing which instructions were interruptible. And probably other factors.

I don’t know how a 2014-type transputer (cores with possibility for shared memory? and distributed memory), with many more process priorities, would have been designed – provided it should again be based on channel communication only? Could they have made it without any atomic type? And the discussion in the community about priority: prioritisation of channels (=communication) and not processes (=work)? It probably would not matter, it’s still priority?

After this speculation I have been able to find the real story from the very person who wrote the scheduler in micro code on the transputer. Roger Shepherd (who worked for Inmos Ltd. in Bristol 1979-1989, now with Chipless Ltd or chip<™‎) writes in a mail that (I had asked whether he allowed me to use a potential answer):

Your question about the transputer and atomic operations was very stimulating.

I’ve tried to write something about how the transputer implementation worked regarding atomicity. I had to work through the implementation of the scheduler to work out exactly what atomicity properties it expected and why the whole thing worked. Anyhow, I hope the attached note clarifies things. I’d appreciate any questions etc. I’d be happy for it to be published under my copyright and with my name.

The other issue is how one could use the transputer without atomic operations. For many purposes the transputer provides all you need – the occam model just works. For other things – such as implementing locks around shared memory, you can implement a process as a lock and communicate with it to claim and free the lock. This is efficient.

Shepherd attached a document that I have pasted below (also attached, [09]):

Atomic operations and the transputer

© Roger Shepherd, 2014

Atomic operations are relevant in systems where there is state that can be observed and/or modified by more than one agent. Commonly atomic operations concern the state of a memory system, although the atomicity of operations on other state could also be an issue.

Transputer memory system

The transputer memory system was much simpler than many modern memory systems. The only caching was the 2-word instruction prefetch buffer. There was no mechanism in the transputer to ensure coherency between the prefetch buffer and the memory and it is possible to construct programs which show the breakdown in coherency and consistency. However, apart from this, the memory system was coherent and had strict consistency for all observers.

The memory accesses were atomic at the word and sub-word levels. All the bits in a word could be written to at the same time, and all the bytes could be individually written to. This atomicity was not true for larger accesses; for example, the action of the processor writing multiple words to memory (e.g. writing four words to save the stack when performing a call instruction) was performed as four separate writes to memory which could be interleaved with memory accesses from the links. If regions of memory were shared between agents (processor and links) it would be possible to observe this non-atomicity. For the most part, by convention, the agents in the transputer did not share memory regions which might be read and written concurrently. The occam rules ensured that this was true for data manipulated by the processor and links; a link channel would be operating as a result of an input or output operation performed by a process and hence the memory accessed by that process would be accessed in accordance with the occam rules. One perhaps subtle point is that occam permits memory to be disjoint at the byte level; that is, a single word could contain bytes “belonging” to different processes. It was therefore necessary that the memory system supported the writing of individual bytes in a word; there would be an atomicity issue to be addressed if part-word writes had to be implemented as a read-modify-write sequence.

So, in summary, as far as program data access were concerned, the occam sharing rules and the sub-word atomicity of the memory system ensured there were no atomicity issues. The transputer scheduler also ensured that all memory accesses associated with processes running in parallel had completed before a subsequent process started to execute. For example, in

SEQ
  PAR
    P
    Q
  R

all memory accesses for P and Q will have completed before any memory accesses of R start.

There is genuine sharing of the memory system between the processor and the data-transfer engines of the links, however the occam disjoint rules and the memory’s atomicity properties ensure that that these operate without any interference.

Atomicity and the scheduler

The remaining atomicity concerns in the transputer relate to the operation of the scheduler and, in particular, to the data structures used by the scheduler. The data structures comprise:- the process control blocks (the first few locations below a process workspace), the scheduler queues, the timer queues, the channels, and the interrupted process save area.

The atomicity of access to these structures is ensured because there is only one entity which performs scheduling operations – the processor. The processor executes instructions (or part instructions in the case of interruptable instructions) and scheduling operations on behalf of the links.

C11 and C++14

The C++ standard now has support for atomic as a primary citizen of the language, but there also is a library defined in <atomic>. Pasting from a draft of the C++14 standard [01], this is what there seems to be about the theme (except for a myriad of comments all over the document). I guess this is the map one needs to become acquainted with:

1.10 Multi-threaded executions and data races [intro.multithread]
Comment 5. The library defines a number of atomic operations (Clause 29) and operation on mutexes (Clause 30) that are specially identified as synchronization operations.

29 Atomic operations library [atomics]
29.1 General
29.2 Header <atomic> synopsis [atomics.syn]
29.3 Order and Consistency 
29.4 Lock-free Property 
29.5 Atomic Types <atomic> 
29.6 Operations on Atomic Types
29.7 Flag Type and Operations
29.8 Fences

30 Thread support library [thread] 
30.1 General
30.2 Requirements 
30.3 Threads <thread> 
30.4 Mutual exclusion <mutex> <shared_mutex> 
30.5 Condition variables <condition_variable> 
30.6 Futures <future>

As I understand read from some comment a standard does not tell the rationale for matters. Don’t ask me why. I could understand that this goes for well established matters, but not for new ideas. I miss it.

But the new C standard [08] also has atomics, but it looks a little different. It’s _Atomic and libraries <stdatomic.h> and <threads.h>.

The common goal of the C11 and C++14 elements is that they are meant to bridge different HW. But how? As mentioned above, synchronize with, triggers my interest. I think I may be able to infer some from the comments about it. Like this, from ISO/IEC 9899:201x Committee Draft — April 12, 2011 N1570, 5.1.2.4 Multi-threaded executions and data races, comment 11 (page 18):

Certain library calls synchronize with other library calls performed by another thread. In particular, an atomic operation A that performs a release operation on an object M
synchronizes with an atomic operation B that performs an acquire operation on M and
reads a value written by any side effect in the release sequence headed by A.

I have a feeling this would not have been needed if we had HW without data memory caching? There is some place in a processor where this synchronization help is needed. Is it through the cache? I assume that keeping atomic variables out of the cache is not wanted, for speed – anywhere where this is possible. Some times I assume that shared variables would be found in shared memory. There is no need to use atomic if the variable is not a shared variable.

So to «synchronize with» is a means to tell the compiler back-end implementor that some sequence needs to be respected. Therefore «happens before/after» are equally important. When an atomic section, be it code or an atomic instruction has been entered, another access of the same is not allowed to interfere. Or, if it does interfere, this needs to be seen by the losing part so that a new attempt (or whatever) may be tried.

A piece of code has scope, context and name space. This is handled by the processors, with or without cache. This is about visibility.

So what is atomic more than visibility? I have a feeling this has to do with microcode and processor architecture. I haven’t dug deep enough. Let me have a look at Go, to see if it can help.

Package atomic in Google’s concurrent Go language

Go has a package called atomic [10]. It seems to be from 2011, two years after Go 1.0. Two years «after channels». There doesn’t seem to be any reserved keyword atomic, and the atomic word is not in The Go Programming Language Specification. So, everything on the outside of the atomic package is basically non-atomic (but packages are meant for encapsulation; put goroutines in them and use channels between them and you are thread-safe 101%). In the package’s description they say that:

 7	// Package atomic provides low-level atomic memory primitives
 8	// useful for implementing synchronization algorithms.
 9	//
10	// These functions require great care to be used correctly.
11	// Except for special, low-level applications, synchronization is better
12	// done with channels or the facilities of the sync package.
13	// Share memory by communicating;
14	// don't communicate by sharing memory.

Great! This walks, swims and quacks like a (2014’ish) transputer & occam. But if they are also enable building synchronization with atomic in the bottom (not just safeguarding accesses to shared variables), other than channels, this must be for some other synchronization mechanism. Is this getting close to «synchronizes with» of C/C++? This shows so well the differences between the Go and C++ mindsets, as discussed in a previous blog note Pike & Sutter: Concurrency vs. Concurrency.

A tabular view?

Maybe a tabular attempt might be of some help, trying to sum up the above. This note is about atomicity, how it’s done and when it’s needed; I struggle. This is my third attempt at a table, and I have fallen down to listing up some examples only. TYPE is an attempt at main category; SCHEDULING I think is the main reason for needing atomicity since it disturbs any solitaire action; SHARING tries to show what one could fight about; LANGUAGE is what we might do this with; PARADIGM is a trait that has the potential to remove a shared access at all and COMMENT is comment. I will try to discuss it below the table.

#	Type	Scheduling	Sharing	Language	Paradigm	Comment
#1	Interrupt	Interrupt pre-emption only	Interrupt by critical region	C	main loop	Embedded
#2	Interrupt	Interrupt pre-emption, SW cooperative	Interrupt by critical region, else none	C	main loop, scheduler and message passing	Embedded, SDL, CSP
#3	Concurrency	Interrupt and SW pre-emption	Primitives	C++ / C	Threading	Most used
#4	Concurrency	Interrupt and SW pre-emption, SW cooperative	Channels	Go	Goroutines	Multi-core, shared memory
#5	Parallelism Concurrency	Interrupt pre-emption, SW cooperative	Channels	occam	Processes	Multi-core distributed memory

#1 – This would not need atomic instruction support. A single main loop often gets too high cohesion; too much to do; to many conditions; too difficult to read the code. The last time I did this was with a very simple system (in 2 KB of assembler) in 1979. Don’t do it.

#2 – This would not need atomic instruction support. I assume then that any scheduler would not be preemptive, but basically support SDL asynchronous run-to-completion processes. In that paradigm a process would run until it has nothing more to produce and needs new input. Alternatively there may be CSP synchronous channel support (asynchronous if no data) which would run until it reaches a synchronizing point. For both cases, «run until» means to return to the scheduler.

#3 – This needs atomic instruction support. It’s Linux, preemption and C++14 code with support for threading. C documentation is full of atomic concerns, so overwhelming that I started this note. Over the next months I will look into the processor manuals, of Intel and ARM processor and read what could be hidden in that documentation.

#4 – Go probably would need atomic instruction support since the scheduler is preemptive. It also has support for low level locks. The Go documentation warns agains use of the atomic library, simply because channels are a better choice fro most concerns.

#5 – As we have seen, the transputer didn’t even have atomic instruction support because it was not needed, simply because the designers did a darn good job: the paradigm, as it was at the time, didn’t need it. There may be problems that were not acceptably solvable by this architecture (sorry, Turing) – but not as many as some seem to think. I wonder how the HeliOS (Wiki-refs below) operating system of the nineties solved atomicity – except from the disabling all interrupts and process scheduling – which might not be that elegant? And a future transputer just might need atomicity at more cultural level, no matter what the paradigm will need?

Hardware level

I have already covered some of this above when mentioning how many data copies and booleans may be needed. Stackoverflow has a good overview of hardware replies with «How are atomic operations implemented at a hardware level?» [11]. It is short, and I don’t need to repeat it here. Except, perhaps SeanMcSomething’s comment [here] that:

The memory controller is only in charge of making sure that memory & cache on different processors stays consistent – if you write to memory on CPU1, CPU2 won’t be able to read something else from its cache. It’s not its responsibility to make sure that they’re both trying to manipulate the same data. There are a few low level instructions used locking and atomic operations. These are used at the OS level to manipulate small chunks of memory to create things like mutexes and semaphores, these are literally one or two bytes of memory that need to have atomic, synchronized operations performed on them. Applications then build on top of this to perform operations on larger data structures and resources.

From many of the other references this is also shown: only an atomic wrapper is needed for larger structures. I didn’t know that manufacturers don’t say much and that there are rather many «nasty corners». Maybe it is also rather difficult to model this and have an implementation formally proven? The loss of documentation in general may explain why I have had a hard time finding it – f there isn’t any! And that bus locks are not as much needed as are cache locks, was also new to me. And with cache coherency there is the Cache Coherence Protocol and the MESI protocol (but it doesn’t seem to work in all cases, see Wiki-refs below).

What I have tried

is try to find out what atomic is. With caches and multi-core and shared memory it’s certainly more than what I relate to with the small microcontrollers I program daily. But it’s basically the same need: to update some result alone, so that another user which at any time, even during my update, would want to grab the data, doesn’t get fooled. Atomic is about getting all that right. By carefully knowing ownership of the data or some shared token that’s passed around.

But if you feel that you are too concerned about this, maybe you should find another level, or paradigm to work with? Or another language? Where atomicity is not that atomic.

Postscript

In the Wikipedia article (Wiki-refs below) about Non-blocking algorithm I read that:

Literature up to the turn of the 21st century used «non-blocking» synonymously with lock-free. However, since 2003, the term has been weakened to only prevent progress-blocking interactions with a preemptive scheduler. In modern usage, therefore, an algorithm is non-blocking if the suspension of one or more threads will not stop the potential progress of the remaining threads. They are designed to avoid requiring a critical section. Often, these algorithms allow multiple processes to make progress on a problem without ever blocking each other. For some operations, these algorithms provide an alternative to locking mechanisms.

With few exceptions, non-blocking algorithms use atomic read-modify-write primitives that the hardware must provide,..

I now know what will be my next blog note: Not so blocking after all.

References

Wiki-refs: Atomic/linearizability, Cache Coherence Prototol, HeliIOS, Interrupt, MESI protocol, Non-blocking algorithm
I have used [05], [06] and [07] for background reading only.

A draft of C++14: «Programming Languages — C++» (2013-05-15) in http://open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3690.pdf
New ALT for Application Timers and Synchronisation Point Scheduling by Teig (me) and Vannebo:, see http://www.teigfam.net/oyvind/pub/pub_details.html#NewALT
The Synchronizes-With Relation, blog note by Jeff Preshing, see http://preshing.com/20130823/the-synchronizes-with-relation/
Transputer instruction set, a compiler writer’s guide. INMOS Limited. Prentice Hall, 1988. See http://www.transputer.net/iset/pdf/tis-acwg.pdf. I have converted it to searchable pdf (121 MB!) and placed it here: http://www.teigfam.net/oyvind/blog_notes/090/Transputer_instruction_set_searchable_tis-acwg.pdf
Critical sections with multicore processors, by JustJeff, see http://stackoverflow.com/questions/980521/critical-sections-with-multicore-processors
Implementing Scalable Atomic Locks for Multi-Core Intel® EM64T and IA32 Architectures by Michael Chynoweth, see, https://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures
Atomic Operations in Hardware by http://courses.cs.washington.edu/courses/cse378/07au/lectures/L25-Atomic-Operations.pdf
C – Project status and milestones at http://www.open-std.org/jtc1/sc22/wg14/www/projects, I read this: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
Atomic operations and the transputer by Roger Shepherd, (in blog above) and here: http://www.teigfam.net/oyvind/blog_notes/090/roger_shepherd_atomic_operations_and_the_transputer.pdf
Package atomic of Go, see http://golang.org/src/pkg/sync/atomic/doc.go?h=atomic
How are atomic operations implemented at a hardware level?, see http://stackoverflow.com/questions/14758088/how-are-atomic-operations-implemented-at-a-hardware-level

Øyvind Teig