Showing results for 
Show  only  | Search instead for 
Did you mean: 
Registered: ‎10-27-2017

Complications with Multi-CPU Threading


First off, a quick disclaimer that I am not massively familiar with multithreading support under petalinux and how things affect it.

I am running an application on an RFSoC (so 4 core ARM APU) which has five threads: one for running I/O type activities (primarily over Ethernet), three "worker" threads and one "coordinator" thread. This is a very soft-RT type design - more detail below.

The workers are all set up the same: they sit on a pthread_cond_wait and, once released, call a function (each worker is set up to run a different function), and then wait again. They do have a deadline (~500us) but the actual code they run is pretty small, so only takes ~100us and hence I don't really care about the non-RT latency of 30us+ (although I have changed the kernel pre-emption mode to be low latency desktop - more on that later).

I have an interrupt that the coordinator thread waits for (read on generic-uio IRQ) and it then reads some PL registers to determine which of the three workers need to be run and then calls the appropriate pthread_cond_signal (there are three separate conditions and three separate mutexes). Sometimes more than one thing needs to be run at the same time and so several of the workers will be kicked into action at the same time.

As far as I am concerned, this should work well as I have one CPU for the OS, I/O etc and three more cores, one for each worker thread (which are the majority load of the system). When I actually ran it the first time, however, I saw that the three worker threads had very variable run time (they should be pretty much the same every go-round - it is a fairly deterministic algorithm) - for example, one had run times mostly around 80us but sometimes they spiked to 130us, 180us, 230us  and sometimes very high numbers (~950us). I did note that the jumps appeared to be multiples of 50us approximately. I concluded that they were being interrupted by other processes and therefore changed the controller and worker threads to use SCHED_FIFO rather than SCHED_OTHER and set the priority to 99 (no errors in function calls and read-back of the values confirmed that they were set correctly). This resulted in near identical behaviour.

I then wasted a bit of time trying to improve matters (I patched the kernel with PREEMPT_RT and various other things) but they were all pointless asides. I eventually realised that the issue was that the three workers were interrupting each other because they were all running on the same CPU (i.e. I had one very busy CPU and three nearly idle ones). As a result, I constrained the OS and everything else to CPU 0 using isolcpus=1,2,3 as a bootarg (and confirmed it worked using ps -e -m -o psr,command) and then used pthread_attr_setaffinity_np prior to thread creation to tie the three worker threads one each to CPUs 1, 2 and 3 (again, confirmed that it worked using ps). So now I have them all sitting on their own CPU but the worker threads never run.

Looking into this issue, it seems that once they are on different CPUs, the pthread_cond_signal does not get picked up by the waiting thread (i.e. I can see that the worker thread never starts but that the controller thread does call signal). I presume that this is because each CPU is looking at its (local) L1 cache and never flushing the results to/from the (common) L2 cache.

So my questions:

  1. How do identify to the OS variables that I either want flushed to L2 or that I want disabled from L1 caching? This doesn't seem possible from user space though, so...
  2. I (theoretically) could create a non-cached region (I actually have no idea how to do this in PetaLinux, so help required there, if this is the right approach). This feels like it would be slow though as presumably it would write to DDR... I preferably would just cut it from the L1 cache and keep it in the L2 cache... any idea how to do that?
  3. As an extension to the above, is it possible to make a heap that I can allocate from in the non-cached area, or do I have to do memory management myself (FYI most of the memory used in the system would need to be in this region as the threads share quite a bit, although for everything except the pthread_condition, I now when it would need to be flushed, so having a cache flushing mechanism would also work).
  4. Or, am I doing this totally wrong? If I run a programme under windows (say) that is multithreaded, I expect it to run the threads concurrently on different cores and be able to share data without me (the programmer) needing to think about it at all. Why is Linux not just doing this by default? Do I have some kernel (or other) setting wrong in petalinux-config (or with the compile for the programme)?

Many thanks in advance for your help.



0 Kudos
2 Replies
Registered: ‎05-21-2015

@rjcassels ,

I would think that a good MUTEX would fix this, since the ARM supports exclusive access instructions to handle this very issue.


0 Kudos
Registered: ‎10-27-2017

Ignore this, just me doing something very stupid in the code (nothing to do with mutexes though - I already had those implemented properly). Now works just fine.

0 Kudos