IEEE-COPA 2021 fringe

Contents

1 Breaking news
2 Fold handling
3 «Torus heat equations to get dizzy from. Hooping with xC»
4 The presentation
5 The code (download)
6 The program
7 Addenda and Q&A
8 Extra: testing global synch
9 References

This note is in group Technology (plus My XMOS pages) but may be considered a kind of follow-up of the CPA 2018 fringe. Updated 27Jul2021

For the TIOBE index 016:[10]: «xC programming». Also permanently stored stored at web.archive.org.

Breaking news

13May2021. No comment here. Press the drawing.

Update 27Jul2021. According to 221:[1] then [[combinable]] and interface are not available in lib_xcore. This makes following up this paper is very much out in the blue.

Fold handling

Expand All (for browser searching)

Collapse All

Typical fold

This text 123456789 will only be found with browser search when the fold is expanded. However, the search field on the top of this page is global for all notes, (expanding) all containing folds during the search. More precise: pages in that case will not have any collapsed state.

«Torus heat equations to get dizzy from. Hooping with xC»

A joint fringe presentation with Dr. Lawrence John Dickson (USA) («Larry»).

At 2021 IEEE Concurrent Processes Architectures and Embedded Systems Virtual Conference (COPA 2021), Sponsored by IEEE Computer Society. April 25-28, 2021
(COPA2021, COPA-2021)

See http://ieee-copa.org or http://www.cvent.com.

On EasyChair the title has «FRINGE» in front: «FRINGE: Torus heat equations to get dizzy from. Hooping with xC» (https://easychair.org). You may be able to see the slides, starting from https://easychair.org/my/slide_version.cgi?version=9596#. However, the presentation at EasyChair is compressed to the (well..) – almost indistinguishable. Plus, the versions I sent to EasyChair were required to have max w x h of 1190 x 842, whereas the downloads in the next chapter are 2048 * 1536 pixels.

The presentation

Abstract:
The story on how points representing temperature on a torus, in the form of occam PROCs running on one or several transputers, exchanging temperature with their neighbours, caught the interest of whether – or how this problem may be run as xC tasks on the XMOS xCORE multicore architecture.

Keywords:
3D temperature flow on a torus
occam programming language
xC programming language
xCORE XMOS architecture

The presentation has been uploaded here on24Apr2021 (rev.3). But a side effect is that, in other words, if you have been curious enough to happen to land here, you’re welcome to a sneak preview. But please register for the full conference to hear us, plus see all the real papers presented.

Keynote – Source: Apple Keynote (9.1) (11.7 MB)
PDF – Export with one build per page (9.3 MB)
Power point – Export (9.4 MB)

The code (download)

Below are some «have a fast look» files. I have selected the 3 most important files (with code that runs) and added a .txt attribute. There is one «main» .xc file and several include .h files, named as header files, but having concrete xC contents, because they are included from that «main».xcfile. I don’t like it, but it’s meant to help. It’s difficult enough as it is. And yes, there are several main code parts here, and you’d find them in 2 and 3 (and others in the zip file above). There is a wilderness of preprocessor #ifs here which xTIMEcomposer highlights and greys out quite nicely. But it should be possible to follow even without xTIMEcomposer. Observe that you probably could run all the code in the freely downloadable xTIMEcomposer’s simulator, without any hardware. But this has not been tested.

The «main» file: _xc_test_Toroid_2.xc.txt – no main in here
The occam-like 2*2 topology: top_20_occam_like_xc.h.txt – several main
Also contains the «final» 8*8 topology with servers being distributable: top_26_xc.h.txt – several main
Just the version log file: _version.h.txt
Code project, download at #218 at My xC code downloads page
More code in Extra: testing global synch (below)

The program

07.00 PDT → 16.00 CEST
«07.00» in the program is 07.00 AM of Pacific Daylight Time (PDT) = (UTC-7). Central European Summer Time (CEST) is (UTC+2), so the difference is 9 hours.

See https://easychair.org/smart-program/COPA2021/

Addenda and Q&A

We did not discuss throughput or percentage usage of the xCORE processor, as related to the number of logical cores used. This is related to the number of cores per tile and the instruction pipeline depth p. It is 4 on the first series of processors, 5 on the newer. This limits how fast a core may get a new instruction, and decides minimum guaranteed usage. If I have one core it then gets max 1/5 of the cycles, two 2/5, three 3/5, four 4/5 and for five cores then all cycles are used. So «parallel slackness» for this processor means that coding single-threaded is 1/5 as good an idea as coding 5-threaded! 100% is used at 5 cores, using 8 cores is then still smart since there is waiting on pins, timeout etc – but the guaranteed minimum MIPS still goes. See this discussed in 049:[xC by XMOS] and 141:[6]
Observe that cycle_cnt is not needed as a parameter, as suggested in page 33
Combinable / combined tasks, as well as distributable / distributed tasks are considered «low priority«, even if the xC has no such concept. See 141:[Priority]
We had a comment from one of the participants expressing that the roles «client» and «server» were not correct with respect to how he saw the roles in the mathematics. This may well be so (even if I didn’t understand that). However, the terms client and server were the only option, in xC, when we needed to use tasks as combinable (as i*j increased), and thus had to leave usage of channels (not to use up all the cores). Instead we must communicate over interfaces, not chans. And interface usage implies use of client and server and some times slave
1. I had a reply from the participant by mail:
  «I think your use of client-server was fine in the context of how that terminology is used in xC programming. However in CSP and occam we have an established design rule for building deadlock-free systems called the client-server protocol, originally discovered by Peter Welch. I thought you were talking about that! The communication pattern you are using (with the red-green colouring scheme) matches another design rule called “cyclic communication”. You can find the case study I mentioned, which is similar to yours, in [1], page 108. It covers both cyclic and client-server design patterns.»
2. My comment:
  That being said, I do think we are talking about, at least almost the same thing. I think we have a cyclic communication pattern using client server roles. Our first example is an «I/O-par» [1], shown with chan in occam and xC. But when we introduced client/server roles in it we halved the number of tasks! But this was too expensive, so we went over to an «I/O-seq» and used xC interface instead of chan. An xC client-server pair using an xC interface with a SET_GET remote procedure call will not deadlock. If we use the [[notify]] and [[clears_notification]] pattern we have this asynchronous deadlock free protocol, and we have to strictly adhere to xC client, server and slave tagging and roles. With xC chan it’s free speed, but we can certainly implement al of the mentioned protocols
It was also commented on the fact that client & server scheme implies deadlock freedom, implicating «why did you talk about deadlock»? I was talking about placement of the code when some were combined. The code itself is, as the commenter pointed out, deadlock free. However, it’s the «proceeding force» of the tasks that would in some cases stop (not proceed = deadlock), if all the tasks were placed as combinable. I told that I’d like the xC compiler to warn about this, and that the rest of «the million» error messages concerning usage of channels and interfaces (some here) would probably come a far way towards ensuring correct usage. But as of now, I’d have to run the code and see if it does the job or stops at some point. (Also see 217:[Missing error messages])
We were asked about what the ultimate goal of this was. Larry answered
«climate modeling, even if it requires hardware development.«
Another comment by a participant (mail). (Laplace was mentioned in a comment, and I couldn’t twist my mind to create any useful connection):
«The heat equation is due to Laplace, the iterative solution is due to Gauss and Seidel, the method of over-relaxation is a variant of the latter which converges faster to determine a potential across a membrane from the boundary values. I appreciate that this may not be applicable to your specific problem, since you do not have any boundary, but it is interesting from a CSP perspective as it introduces additional ordering dependencies in the communications of the system, which is why I mentioned it.»
An then, Larry and my turn:
«That’s right, the converging property was not interesting to us, even if we did use the converging value (19.0*25.0)/89.0 = 5.337079 degC value for our data set during the 4*4 runs, to check the mathematics. This was also how we discovered that the nodes some times were not in synch, and we needed to test some alternatives. In other words, our cycles were not per se supposed to convergence to a final state, but to emulate physical time. The final state is actually trivial in this problem, since energy is conserved. This is why all nodes converge to the same temperature.»
One point I have thought about after this. In 219:[3] (XUF216-512-TQ128 Datasheet) I read on p13 that:
«The interconnect relies on a collection of switches and XMOS links. Each xCORE device has an on-chip switch that can set up circuits or route data. The switches are connected by xConnect Links. An XMOS link provides a physical connection between two switches. The switch has a routing algorithm that supports many different topologies, including lines, meshes, trees, and hypercubes.
The links operate in either 2 wires per direction or 5 wires per direction mode, depending on the amount of bandwidth required. Circuit switched, streaming and packet switched data can both be supported efficiently. Streams provide the fastest possible data rates between xCORE Tiles (up to 250 MBit/s), but each stream requires a single link to be reserved between switches on two tiles. All packet communications can be multiplexed onto a single link.»
I wonder if all this might be relevant to us, or if this is based on normal type tasks communicating, even extended across multiple xCORE silicon chips? I have no idea how xC may be twisted to do this, or if this is hidden is some of the XMOS tools.
I found some papers describing some of this, see 141:[36] and 141:[37]

Extra: testing global synch

Fig.1 Tasks of _xc_test_combined_embedded_par_comms.xc

I have tested global synch, to see if it might replace the polled synch scheme in the presentation. This is suggested in page 33 of the presentation, headed «EXPANDING BEYOND 8*8 WITH A MASTER «NEXT PHASE» TICKS?»

The conclusion seems to be that there is a too high cost, in use of chanends mostly, since the synch interface communications are added. See screen paste of the task diagram from the xTIMEcomposer Task Viewer (export from it has too low resolution), picture left.

Another, pre-condition (not conclusion) is that a full BSP (Bulk synchronous parallel) solution is not necessary, but it won’t hurt if it’s free. Some parts of a trampoline bed may perfectly well be far below the highest part if the kids are out of phase. We require that data calculated or produced in cycle_cnt N is sent to neighbours having cycle_cnt N+1, ie. no fabric broken, no sliding. We do allow non-neighbours to be stretched. We show that in the 8*8=64 nodes with polled synchronisation phase, some nodes may be stretched by plus or minus 3 (or -6 with respect to the fastest). However, there is no sliding between neighbours, causing no data to be lost.

I tested with some different code where I made a separate synch (barrier) task, and a different task config than in the presentation. The synch connections need to be with the clients, not the servers. I think this is a problem, since all clients have been coded with a select case timerafter, enough to exclude them from being «the cheapest» [[distributable]]. Observe that in the below code I did code a version of my_client_task without select case timerafter.

I made two versions of the code with slightly different interface definitions. Observe that cycle_cnt is not needed as a parameter, as suggested in page 33.

// (MY_CONFIG==1)
    // FROM CLIENTS TO SYNCH:
    typedef interface wait_if_t {
       void WAIT_EVENT (void);
    } wait_if_t;
    
    // FROM SYNCH TO CLIENTS:
    typedef interface next_if_t {
       void NEXT_EVENT (void);
    } next_if_t;

// (MY_CONFIG==2)
    // FROM CLIENTS TO SYNCH AND CALLS BACK:
    typedef interface synch_if_t {
        [[notification]] slave  void WAIT_EVENT (void);
        [[clears_notification]] void NEXT_CALL (void);
    } synch_if_t;

The first (MY_CONFIG==1) needs to establish an interface connection for each direction, and it’s coded as such. WAIT_EVENT into SYNCH and NEXT_EVENT back.

The other (MY_CONFIG==2) uses the xC built-in [[notification]] and [[clears_notofication]] scheme and then only needs to set up the one interface connection which in that case is bidirectional. WAIT_EVENT into SYNCH and a NEXT_CALL from SYNCH with client.

It seemed like the latter solution was the most flexible; it was the one that ran without stop for the largest number of tasks. But the compiler and xmap’er know what they do, so there is the same amount of chanends used for both cases, at least if I were able to compare apples with apples.

About this «running and then stopping» (no proceed) mentioned in the code (and the Fringe). This is a side effect of [[combinable]], discussed much in another blog note: My xC combined combinable notes. The code compiles and all looks fine until you start to run the code. That’s when it runs up to some point soon. I’m waiting for the xTIMEcomposer version (> 14.4.1) that gives an error message on this. I have a feeling that this error message would be in this «group»: error: distributed task with client and server connection to same logical core is not supported (see xC is C plus x).

The code can be viewed and downloaded here:

_xc_test_combined_embedded_par_comms.xc.txt – the source code and logs at the end
Code project, download at #218 «Extra: testing global synch» at My xC code downloads page

References

The Design and Construction of Deadlock-Free Concurrent Systems by Jeremy Malcolm Randolph Martin. Thesis submitted for the degree of D. Phil to the School of Sciences in the University of Buckingham, 1996, see https://www.wotug.org/docs/jeremy-martin/thesis.pdf

Just comment below or mail me if you have any questions.

Øyvind Teig