UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
9,961 Views
Registered: ‎02-27-2015

High bare metal timing variablity in AMP configuration due to linux

I have a bare metal program running on CPU1 in the AMP configuration (APP-1078). I started noticing problems with high timing variability in seemingly simple code causing me to miss deadlines. It appears that specific userspace-induced actions performed on Linux running on CPU0 is either causing or exaggerating the problem. I have a reduced the problem to a demonstration with a small set of changes over the APP-1078 2014.04 example code (though I am running Ubuntu instead of Petalinux), and specifically the bare metal C program is attached. As you can see in the bare metal program, I have tried a few different experiments (disabiling interrupts, using only SRAM, using only DRAM, etc). None of them appear to have any effect on the high timing variablity. The FPGA PL has a 10MHz counter which is accessible from some IO-mapped memory space. The counter reads fine and as far as I can tell from the arm processor, is ticking away exactly as desired. The bare metal program gets the current counter value and then subtracts the previous value. The resulting difference is used as an index to an array which is incremented. This forms a histogram of 100ns buckets representing how long this handful of operations took. I would have naively expected exactly one bucket (or perhaps two if I was unlucky) to be used. Instead, I see three buckets with significant use and 10+ other buckets with occasional use. In other words, sometimese these instructions take over a microsecond longer than they normally do. This causes me to miss deadlines and rather seems to defeat the entire purpose of having a bare metal option. Worse, specific operations on the Linux side induce a faster rate of these long delay operations. Starting gdb (no options) on Linux, for instance, often causes (relatively) massive delays in the bare metal program; but starting any program will cause higher than normal latencies. I believe I have proven that Linux side CPU usage, disk I/O, network I/O, memory usage (writing 20M memory array over and over), system calls, context switches, and UART usage all are not the primary cause; but what *is* the cause is still a mystery. Does anyone have an idea of what the problem might be or how I can work around the problem so that the bare metal program runs without significant timing variance? ``` Linux: 3.17.0-xilinx-00001-gc064659 #1 SMP PREEMPT Device: 7z020 Vivado: 2014-04 CPU0 27: 0 GIC 27 gt 29: 217740 GIC 29 twd 35: 0 GIC 35 f800c000.ocmc 39: 43 GIC 39 f8007100.adc 40: 372 GIC 40 f8007000.devcfg 41: 0 GIC 41 f8005000.watchdog 43: 321316382 GIC 43 ttc_clockevent 51: 0 GIC 51 e000d000.spi 53: 0 GIC 53 ehci_hcd:usb1 54: 1262642 GIC 54 eth0 56: 7665 GIC 56 mmc0 57: 2027 GIC 57 cdns-i2c 82: 3286 GIC 82 xuartps IPI1: 0 0 Timer broadcast interrupts IPI2: 0 0 Rescheduling interrupts IPI3: 0 0 Function call interrupts IPI4: 0 0 Single function call interrupts IPI5: 0 0 CPU stop interrupts IPI6: 0 0 IRQ work interrupts IPI7: 0 0 completion interrupts Err: 0 ```
0 Kudos
14 Replies
Scholar austin
Scholar
9,944 Views
Registered: ‎02-27-2008

Re: High bare metal timing variablity in AMP configuration due to linux

s,

 

Are you sure this isn't a metastability problem?  Typically trying to read a clock synchronized in one clock domain to a procerssor in another clock domain requires extreme care in crossing the clock domains.

 

I have seen all sorts of seemingly insurmountable problems like you describe.

 

Note that use of grey codes, and clock domain crossing metastability improved flip flops will only lessen the statistics of the events, they will NEVER go away completely.

 

A commonly used hack is to read and throw away any results that are obviously in error, and read again.

 

I have also seen where the clock is read twice until two read agree.

 

Again, it is a problem without a good solution.  Best is to synchronize the processor with the counter so it is not possible for a read to occur before a value is settled....  Coomonly, the procesor's counter/timer is used in a synchronous mode to the CPU clock.

Austin Lesea
Principal Engineer
Xilinx San Jose
0 Kudos
9,937 Views
Registered: ‎02-27-2015

Re: High bare metal timing variablity in AMP configuration due to linux

Thanks for replying.

 

I'm not reading a system clock, I am reading a counter from the FPGA which is being incremented at 10MHz.  I'm pretty sure that the counter is working approximately correctly: if I read the counter, check a remote clock (ntpdate -d), wait, and read the counter + ntpdate again, the counter has incremented an approximately correct number of times.  I've also never seen obviously insane differences (negative, 100K, etc).  I'm pretty sure that the negotation to get data from the FPGA could not possibly have a metastability problem or no-one could ever transfer data.

 

This is also a simplificiation of the real code which exhibits the same problem.  In the real code, the FPGA is counting bits coming in at 1MHz and relaying those values to me, but the bare metal program misses bits when the linux processor start programs.  Under normal circumstances, the bare metal system is able to poll at least 4 times per  ųs, so missing a bit is a significant event.  The external signal is definitely at the correct frequency and the FPGA definitiely caught the signal since it incremented the counter, so the only option appears to be that the CPU occasionally takes much longer than normal to execute some instructions (probably the fetch from the FPGA but I cannot prove that).

 

Can the AXI bus stall or get busy?  Could CPU0 do something to the mmu and cause CPU1 to be delayed?  Anything?

 

Thanks,

0 Kudos
Xilinx Employee
Xilinx Employee
9,930 Views
Registered: ‎09-10-2008

Re: High bare metal timing variablity in AMP configuration due to linux

Hi,

 

I've never tried to track it down as close as you are trying to do.  I've only been concerned about lower 10s of usecs, not lower than that.  Here's some thinking about the problem, but I don't know the answer eitther, but am interested in what you find.

 

I personally might try to think about the h/w system in a different manner (if possible) such that the s/w timing is not as critical if possible. Yes it is a baremetal application, but it's still running in a complex system with a complex superscaler processor (A9 vs R) also.

 

A chipscope in the PL may be able to show you the timing of the transactions.  You might be able to able to use a PS timer rather than a PL timer to do your same binning operations and see what happens.  

 

I don't know on the stalls but certainly there is interconnect from the CPU to the PL for the transactions.  CPU0 cannot alter the CPU1 MMU so I don't see how that would happen.  I'm assuming CPU 0 is not accessing logic in the PL also.

 

Thanks,

John

 

 

0 Kudos
Scholar austin
Scholar
9,909 Views
Registered: ‎02-27-2008

Re: High bare metal timing variablity in AMP configuration due to linux

All,

 

The cpu's share L2 Cache, OCM, and any accesses to external memory.

 

So, if the linux cpu is actively executing from DDR (for example), stalls are possible as that takes the DDR, L2 and unless the other cpu is executing out of its own L1 cache, it will have to wait for the resources to be released.

 

The AXI bus to the PL is also taking bandwidth from the overall pipe to off-chip memory.  So any transfers of blocks of data slow down other uses, but should not stall them (as the MMU by defauult is programmed to provide all resources fairly to all requests).

 

Your Xilinx FAE has resources to evaluate what is actually happening, so you should contact them.

 

The same tools they use are also available from ARM for profiling your code, as well.

 

So, yes, I agree, this is unlikely a metastability issue (but that will still occur -- and you should be able to recognize it, and recover from it).

 

 

Austin Lesea
Principal Engineer
Xilinx San Jose
0 Kudos
9,898 Views
Registered: ‎02-27-2015

Re: High bare metal timing variablity in AMP configuration due to linux

Thank you for your reply and ideas.

 

I converted the program to use the XScuTimer running at 333MHz instead of the PL timer.  This, I assume, eliminates the AXI bus and the PL design in general as a likely source of interference.  However, I encountered the same problem.  Most of the time the code was running in the 480ns range for a single loop, but when CPU0 started programs, eg gdb/emacs, the latency would burst up to 1.9ms.

 

So this would seem to leave the L2 cache and DRAM bandwidth as a source of variable timing since gdb/emacs are not going to be causing any OCM accesses.  I would assume that the CPU1 code would be running out of cache--it isn't very big.  Indeed, when running a program on CPU0 which chews memory bandwidth (looping through a 20MB array touching memory) there is no ill effect on CPU1. Thus the L2 cache would seem to be most likely candidate.

 

I suppose for my next test I could try moving the code into the (uncached as per app1078) SRAM or see if I can disable the DRAM L2 cache.  Or perhaps someone has some code they can point me at which will (safely) explicitly flush the cache from CPU0. I know how to implicitly flush the execution cache on CPU0, so I guess I could try that as well.

 

 

0 Kudos
9,820 Views
Registered: ‎02-27-2015

Re: High bare metal timing variablity in AMP configuration due to linux

This is still being very mysterious. I am still getting the bare processor being slower when the linux processor does work, except now I have reproduced the problem when the bare processor is spinning, counting down. I set the scu timer, count down from 100M, and then get the scu timer value. If the Linux side isn't doing anything, it is very stable. If the Linux side starts programs, it takes more time to count down from 100M. The effect is pretty small using this test--on the order of 200ns instead of the ųs+ effects I saw in other tests--so the effect does vary depending on what the bare metal CPU is doing. This is all running in a handful of instructions, so it has to be in the L1 icache. It is running using registers, so the L2 cache and OCM isn't involved. Neither Linux nor the bare metal are doing anything with the PL. How can one processor be interfered with by the other? Any ideas?
0 Kudos
Scholar austin
Scholar
9,814 Views
Registered: ‎02-27-2008

Re: High bare metal timing variablity in AMP configuration due to linux

s,

 

There are still common resources here.  The program to use the scu timer still has to go through the (internal) single axi bus shared by both cores.  If the count down program wants the scu, it still has to wait for that bus if the other cpu is using it.

 

Look at the block diagram of the PS, the SCU is a common block below the processor cores.

Austin Lesea
Principal Engineer
Xilinx San Jose
0 Kudos
9,808 Views
Registered: ‎02-27-2015

Re: High bare metal timing variablity in AMP configuration due to linux

True, but I'm only accessing the scu once every five seconds or so and the interfering behaviour was initiated and completed during the spinning phase so there was no chance for direct interference. Actually, when performing this analysis I noticed that the added delay actually decays which really confuses me. In the below data, the values are differences between starting and ending timers. The "e" value is the steady state which is what happens when linux does nothing. The "65"ish value is what happened when I ran a program (eg sleep) in linux--I have seen values in the 90 range as well. The weird thing is the program started and terminated during the third interval, so why are subsequent intervals slower than the steady state? I find it hard to believe a cache is having to slowly fill, the icache could not possibly need to fill and I'm only writing the diff out to a OCM memory location once every five seconds so the dcache should not be involved. 6000000e 6000000e 60000065 60000034 6000000f 6000000e 6000000e 6000000e 60000066 6000002a 60000018 6000000e I've also been able to reproduce the problem during a linux program execution--if the process uses mlockall() it will cause another (decaying) spike of delay. However, repeating mlockall()/munlockall() over and over does not persist the delay. Doing open/write/close does not cause a spike. Note that even having the linux side poll the PL memory location (causing 150K AXI transactions a second) does not cause the bare metal processor to slow down (other than of course the spike when I start up the polling program). I will continue to try to find a way to persist the delay which might help isolate where the interaction is occurring.
0 Kudos
Scholar austin
Scholar
9,803 Views
Registered: ‎02-27-2008

Re: High bare metal timing variablity in AMP configuration due to linux

"linux does nothing"

 

I don't agree, use 'top' to see just how many processes linux runs when doing 'nothing.'

 

Further,  if you use ANY function calls to the libraries in the baremetal application, you are going to leave cache, as these functions typically have many layers are go all over the place.

 

 

Austin Lesea
Principal Engineer
Xilinx San Jose
0 Kudos
7,295 Views
Registered: ‎02-27-2015

Re: High bare metal timing variablity in AMP configuration due to linux

perf

Bare/Linux Performance Interactions

Bare metal processor L1 cache being zapped by Linux processor.

I found the ARM performance counters and dumped them for this test. What this appears to show, to me, is that the instruction cache (and probably the TLB and Data cache) on the bare metal processor is getting zapped by something that is happening on the Linux side. Does anyone know how to stop this from happening?


To clarify a comment from a previous post that was confusing, when I said "linux does nothing" I was referring to the activity causing the spike in behavior (program execution, potentially new executable page being mapped, etc)–the CPU was at 100% utilization during my tests (and to further clarify, when the Linux side was at 0% utilization I had the same delay times–Linux side CPU utilization is not a proximate cause).


Also, as the performance counters below indicate, my bare metal program is running entirely in L1 cache. The entire bare metal text segment is only 8116 bytes long, compared to the 32KB icache.


The tst column represents the XPM_CNTRCFG and the output word. The "event" is the XPM counter description, "Steady" is a sampled value when the Linux processor was not doing the evil operation and the test latency was at a minimum value. "Burst" was the first value after the Linux side performed the bad operation (eg execute /bin/true). Empty cells are zero.

tst Event Steady burst
1,0 SOFTINCR    
1,1 INSRFETCH_CACHEREFILL   19
1,2 INSTRFECT_TLBREFILL   1
1,3 DATA_CACHEREFILL    
1,4 DATA_CACHEACCESS 12 12
1,5 DATA_TLBREFILL   3
2,0 DATA_READS a b
2,1 DATA_WRITE 8 9
2,2 EXCEPTION    
2,3 EXCEPRETURN    
2,4 CHANGECONTEXT    
2,5 SW_CHANGEPC 7ffffff7 7ffffff7
3,0 IMMEDBRANCH 7ffffff4 7ffffff4
3,1 UNALIGNEDACCESS    
3,2 BRANCHMISS 1 5
3,3 CLOCKCYCLES c0000067 c00001b8
3,4 BRANCHPREDICT 7ffffff9 7ffffff9
3,5 JAVABYTECODE    
4,0 SWJAVABYTECODE    
4,1 JAVABACKBRANCH    
4,2 COHERLINEMISS    
4,3 COHERLINEHIT    
4,4 INSTRSTALL   13e
4,5 DATASTALL    
5,0 MAINTLBSTALL    
5,1 STREXPASS   2
5,2 STREXFAIL    
5,3 DATAEVICT    
5,4 NODISPATCH 53 1c2
5,5 ISSUEEMPTY 18 19e
6,0 INSTRRENAME b 12
6,1 PREDICTFUNCRET 5 5
6,2 MAINEXEC 80000009 8000000e
6,3 SECEXEC 80000003 80000003
6,4 LDRSTR 12 15
6,5 FLOATRENAME    
7,0 NEONRENAME    
7,1 PLDSTALL    
7,2 WRITESTALL    
7,3 INSTRTLBSTALL   2
7,4 DATATLBSTALL   2
7,5 INSTR_uTLBSTALL   5
8,0 DATA_uTLBSTALL    
8,1 DMB_STALL    
8,2 INT_CLKEN c0000069 c00001d4
8,3 DE_CLKEN c0000069 c00001d4
8,4 INSTRISB    
8,5 INSTRDSB    
9,0 INSTRDMB    
9,1 EXTINT    
9,2 PLE_LRC    
9,3 PLE_LRS    
9,4 PLE_FLUSH   3e3
9,5 PLE_CMPL    
10,0 PLE_OVFL    
10,1 PLE_PROG    
10,2 PLE_LRC    
10,3 PLE_LRS    
10,4 PLE_FLUSH a a
10,5 PLE_CMPL    
11,0 DATASTALL    
11,1 INSRFETCH_CACHEREFILL    
11,2 INSTRFECT_TLBREFILL    
11,3 DATA_CACHEREFILL    
11,4 DATA_CACHEACCESS    
11,5 DATA_TLBREFILL    
0 Kudos
7,287 Views
Registered: ‎02-27-2015

Re: High bare metal timing variablity in AMP configuration due to linux

Given hints from the performance information from the previous post, I've also now been able to isolate the problem a bit on the
Linux side. I can cause it to happen indefinately by performing a mmap with PROT_READ|PROT_EXEC and then unmapping, and then remapping (the next page up). If I mmap without PROT_EXEC, the performance is perhaps 30ns worse than "normal" but doesn't spike up to the hundreds or thousands of ns that it does when PROT_EXEC is in play.

 

This seems to confirm that it is some kind of TLB/L1 cache issue. Any ideas on how to tell Linux and/or ARM to not try to synchronize the two processors since they are not looking at the same memory?

0 Kudos
Xilinx Employee
Xilinx Employee
7,274 Views
Registered: ‎09-10-2008

Re: High bare metal timing variablity in AMP configuration due to linux

I'm just thinking out loud a bit to brainstorm and my apologies if you've already said something about this as the thread is pretty long.

 

What about the MMU settings with the standalone application, the S bit is a shared bit which turns on hardware coherence. Based on my understanding this tells the system to update the L1 based on any changes to the L2.

 

In <Xilinx SDK>\data\embeddedsw\lib\bsp\standalone_v4_2\src\cortexa9\gcc\translation_tables.s....

 

.rept 0x0400 /* 0x00000000 - 0x3fffffff (DDR Cacheable) */
.word SECT + 0x15de6 /* S=b1 TEX=b101 AP=b11, Domain=b1111, C=b0, B=b1 */
.set SECT, SECT+0x100000
.endr

 

Should the shared bit be turned off for AMP to prevent this?

 

Thanks

John

0 Kudos
Scholar austin
Scholar
7,268 Views
Registered: ‎02-27-2008

Re: High bare metal timing variablity in AMP configuration due to linux

j,

 

Try it and tell us what happens.  That sounds like a definite source of sharing (contention for resources),

Austin Lesea
Principal Engineer
Xilinx San Jose
0 Kudos
7,256 Views
Registered: ‎02-27-2015

Re: High bare metal timing variablity in AMP configuration due to linux

icache flushes and sharing, possible solution?

I made the requested change to the sharing bit and rebooted with no change in behavior (I dumped the bare CPU table--see below--before I made the change and it was already marked as non-shared).


However, while I was investigating everything I tracked the linux executable page mapping code down to v7_flush_icache_all in arch/arm/mm/cache-v7.S which has an enticing looking ALT_SMP() and ALT_UP() icache invalidation code. The SMP option (mcr p15, 0, r0, c7, c1, 0) didn't appear to be documented in the v7a ARM ARM (it was in the v5 ARM ARM) while the UP option (mcr p15, 0, r0, c7, c5, 0) was referenced. I manually forced the UP version and it appears to have worked so far. In my extremely limited test, the system was slowed by 54ns instead of the normal many hundreds of ns (up to > 1 ųs) but that will hopefully be minimal enough for it not to matter.


So this is exciting news for me. Does anyone know what any ill-effects might be? The system hasn't crashed yet, which is always a good sign. None of the text around the description of the cache flushing instruction really explained to me what it meant in an MP environment. I also wonder if I need to make similar changes to v7_flush_kern_cache_all(), v7_flush_kern_cache_louis(), and perhaps even v7_flush_dcache_all() or v7_flush_dcache_louis()


So thanks for everyone's help so far, and hopefully someone can let me know more about the effects of iflush in these different ways.


Bare metal mmutable (for the record):
--------------------------------------------------
0x00000000-0x2fffffff: invalid
0x30000000-0x3fffffff: s: NS=0 0=0 nG=0 S=0 AP=3 TEX=4 0=0 domain=f XN=0 C=0 B=1
0x40000000-0xbfffffff: s: NS=0 0=0 nG=0 S=0 AP=3 TEX=0 0=0 domain=0 XN=0 C=0 B=0
0xc0000000-0xdfffffff: invalid
0xe0000000-0xe02fffff: s: NS=0 0=0 nG=0 S=0 AP=3 TEX=0 0=0 domain=0 XN=0 C=0 B=1
0xe0300000-0xe0ffffff: invalid
0xe1000000-0xe3ffffff: s: NS=0 0=0 nG=0 S=0 AP=3 TEX=0 0=0 domain=0 XN=0 C=0 B=1
0xe4000000-0xe5ffffff: s: NS=0 0=0 nG=0 S=0 AP=3 TEX=0 0=0 domain=0 XN=0 C=1 B=1
0xe6000000-0xf7ffffff: invalid
0xf8000000-0xf8ffffff: s: NS=0 0=0 nG=0 S=0 AP=3 TEX=0 0=0 domain=0 XN=0 C=0 B=1
0xf9000000-0xfbffffff: invalid
0xfc000000-0xfdffffff: s: NS=0 0=0 nG=0 S=0 AP=3 TEX=0 0=0 domain=0 XN=0 C=1 B=0
0xfe000000-0xffefffff: invalid
0xfff00000-0xffffffff: s: NS=0 0=0 nG=0 S=0 AP=3 TEX=4 0=0 domain=f XN=0 C=0 B=0
--------------------------------------------------
0 Kudos