02-27-2015 12:47 PM
02-27-2015 02:53 PM
Are you sure this isn't a metastability problem? Typically trying to read a clock synchronized in one clock domain to a procerssor in another clock domain requires extreme care in crossing the clock domains.
I have seen all sorts of seemingly insurmountable problems like you describe.
Note that use of grey codes, and clock domain crossing metastability improved flip flops will only lessen the statistics of the events, they will NEVER go away completely.
A commonly used hack is to read and throw away any results that are obviously in error, and read again.
I have also seen where the clock is read twice until two read agree.
Again, it is a problem without a good solution. Best is to synchronize the processor with the counter so it is not possible for a read to occur before a value is settled.... Coomonly, the procesor's counter/timer is used in a synchronous mode to the CPU clock.
02-27-2015 03:27 PM
Thanks for replying.
I'm not reading a system clock, I am reading a counter from the FPGA which is being incremented at 10MHz. I'm pretty sure that the counter is working approximately correctly: if I read the counter, check a remote clock (ntpdate -d), wait, and read the counter + ntpdate again, the counter has incremented an approximately correct number of times. I've also never seen obviously insane differences (negative, 100K, etc). I'm pretty sure that the negotation to get data from the FPGA could not possibly have a metastability problem or no-one could ever transfer data.
This is also a simplificiation of the real code which exhibits the same problem. In the real code, the FPGA is counting bits coming in at 1MHz and relaying those values to me, but the bare metal program misses bits when the linux processor start programs. Under normal circumstances, the bare metal system is able to poll at least 4 times per ųs, so missing a bit is a significant event. The external signal is definitely at the correct frequency and the FPGA definitiely caught the signal since it incremented the counter, so the only option appears to be that the CPU occasionally takes much longer than normal to execute some instructions (probably the fetch from the FPGA but I cannot prove that).
Can the AXI bus stall or get busy? Could CPU0 do something to the mmu and cause CPU1 to be delayed? Anything?
02-27-2015 04:51 PM
I've never tried to track it down as close as you are trying to do. I've only been concerned about lower 10s of usecs, not lower than that. Here's some thinking about the problem, but I don't know the answer eitther, but am interested in what you find.
I personally might try to think about the h/w system in a different manner (if possible) such that the s/w timing is not as critical if possible. Yes it is a baremetal application, but it's still running in a complex system with a complex superscaler processor (A9 vs R) also.
A chipscope in the PL may be able to show you the timing of the transactions. You might be able to able to use a PS timer rather than a PL timer to do your same binning operations and see what happens.
I don't know on the stalls but certainly there is interconnect from the CPU to the PL for the transactions. CPU0 cannot alter the CPU1 MMU so I don't see how that would happen. I'm assuming CPU 0 is not accessing logic in the PL also.
02-28-2015 09:02 AM
The cpu's share L2 Cache, OCM, and any accesses to external memory.
So, if the linux cpu is actively executing from DDR (for example), stalls are possible as that takes the DDR, L2 and unless the other cpu is executing out of its own L1 cache, it will have to wait for the resources to be released.
The AXI bus to the PL is also taking bandwidth from the overall pipe to off-chip memory. So any transfers of blocks of data slow down other uses, but should not stall them (as the MMU by defauult is programmed to provide all resources fairly to all requests).
Your Xilinx FAE has resources to evaluate what is actually happening, so you should contact them.
The same tools they use are also available from ARM for profiling your code, as well.
So, yes, I agree, this is unlikely a metastability issue (but that will still occur -- and you should be able to recognize it, and recover from it).
02-28-2015 11:10 AM
Thank you for your reply and ideas.
I converted the program to use the XScuTimer running at 333MHz instead of the PL timer. This, I assume, eliminates the AXI bus and the PL design in general as a likely source of interference. However, I encountered the same problem. Most of the time the code was running in the 480ns range for a single loop, but when CPU0 started programs, eg gdb/emacs, the latency would burst up to 1.9ms.
So this would seem to leave the L2 cache and DRAM bandwidth as a source of variable timing since gdb/emacs are not going to be causing any OCM accesses. I would assume that the CPU1 code would be running out of cache--it isn't very big. Indeed, when running a program on CPU0 which chews memory bandwidth (looping through a 20MB array touching memory) there is no ill effect on CPU1. Thus the L2 cache would seem to be most likely candidate.
I suppose for my next test I could try moving the code into the (uncached as per app1078) SRAM or see if I can disable the DRAM L2 cache. Or perhaps someone has some code they can point me at which will (safely) explicitly flush the cache from CPU0. I know how to implicitly flush the execution cache on CPU0, so I guess I could try that as well.
03-06-2015 09:38 AM
03-06-2015 10:01 AM
There are still common resources here. The program to use the scu timer still has to go through the (internal) single axi bus shared by both cores. If the count down program wants the scu, it still has to wait for that bus if the other cpu is using it.
Look at the block diagram of the PS, the SCU is a common block below the processor cores.
03-06-2015 11:29 AM
03-06-2015 11:31 AM - edited 03-06-2015 11:33 AM
"linux does nothing"
I don't agree, use 'top' to see just how many processes linux runs when doing 'nothing.'
Further, if you use ANY function calls to the libraries in the baremetal application, you are going to leave cache, as these functions typically have many layers are go all over the place.
03-10-2015 03:06 PM
I found the ARM performance counters and dumped them for this test. What this appears to show, to me, is that the instruction cache (and probably the TLB and Data cache) on the bare metal processor is getting zapped by something that is happening on the Linux side. Does anyone know how to stop this from happening?
To clarify a comment from a previous post that was confusing, when I said "linux does nothing" I was referring to the activity causing the spike in behavior (program execution, potentially new executable page being mapped, etc)–the CPU was at 100% utilization during my tests (and to further clarify, when the Linux side was at 0% utilization I had the same delay times–Linux side CPU utilization is not a proximate cause).
Also, as the performance counters below indicate, my bare metal program is running entirely in L1 cache. The entire bare metal text segment is only 8116 bytes long, compared to the 32KB icache.
The tst column represents the XPM_CNTRCFG and the output word. The "event" is the XPM counter description, "Steady" is a sampled value when the Linux processor was not doing the evil operation and the test latency was at a minimum value. "Burst" was the first value after the Linux side performed the bad operation (eg execute /bin/true). Empty cells are zero.
03-10-2015 04:47 PM
Given hints from the performance information from the previous post, I've also now been able to isolate the problem a bit on the
Linux side. I can cause it to happen indefinately by performing a mmap with PROT_READ|PROT_EXEC and then unmapping, and then remapping (the next page up). If I mmap without PROT_EXEC, the performance is perhaps 30ns worse than "normal" but doesn't spike up to the hundreds or thousands of ns that it does when PROT_EXEC is in play.
This seems to confirm that it is some kind of TLB/L1 cache issue. Any ideas on how to tell Linux and/or ARM to not try to synchronize the two processors since they are not looking at the same memory?
03-11-2015 05:53 AM
I'm just thinking out loud a bit to brainstorm and my apologies if you've already said something about this as the thread is pretty long.
What about the MMU settings with the standalone application, the S bit is a shared bit which turns on hardware coherence. Based on my understanding this tells the system to update the L1 based on any changes to the L2.
In <Xilinx SDK>\data\embeddedsw\lib\bsp\standalone_v4_2\src\cortexa9\gcc\translation_tables.s....
.rept 0x0400 /* 0x00000000 - 0x3fffffff (DDR Cacheable) */
.word SECT + 0x15de6 /* S=b1 TEX=b101 AP=b11, Domain=b1111, C=b0, B=b1 */
.set SECT, SECT+0x100000
Should the shared bit be turned off for AMP to prevent this?
03-11-2015 06:59 AM
Try it and tell us what happens. That sounds like a definite source of sharing (contention for resources),
03-11-2015 05:03 PM
I made the requested change to the sharing bit and rebooted with no change in behavior (I dumped the bare CPU table--see below--before I made the change and it was already marked as non-shared).
However, while I was investigating everything I tracked the linux executable page mapping code down to v7_flush_icache_all in arch/arm/mm/cache-v7.S which has an enticing looking ALT_SMP() and ALT_UP() icache invalidation code. The SMP option (mcr p15, 0, r0, c7, c1, 0) didn't appear to be documented in the v7a ARM ARM (it was in the v5 ARM ARM) while the UP option (mcr p15, 0, r0, c7, c5, 0) was referenced. I manually forced the UP version and it appears to have worked so far. In my extremely limited test, the system was slowed by 54ns instead of the normal many hundreds of ns (up to > 1 ųs) but that will hopefully be minimal enough for it not to matter.
So this is exciting news for me. Does anyone know what any ill-effects might be? The system hasn't crashed yet, which is always a good sign. None of the text around the description of the cache flushing instruction really explained to me what it meant in an MP environment. I also wonder if I need to make similar changes to v7_flush_kern_cache_all(), v7_flush_kern_cache_louis(), and perhaps even v7_flush_dcache_all() or v7_flush_dcache_louis()
So thanks for everyone's help so far, and hopefully someone can let me know more about the effects of iflush in these different ways.
Bare metal mmutable (for the record): -------------------------------------------------- 0x00000000-0x2fffffff: invalid 0x30000000-0x3fffffff: s: NS=0 0=0 nG=0 S=0 AP=3 TEX=4 0=0 domain=f XN=0 C=0 B=1 0x40000000-0xbfffffff: s: NS=0 0=0 nG=0 S=0 AP=3 TEX=0 0=0 domain=0 XN=0 C=0 B=0 0xc0000000-0xdfffffff: invalid 0xe0000000-0xe02fffff: s: NS=0 0=0 nG=0 S=0 AP=3 TEX=0 0=0 domain=0 XN=0 C=0 B=1 0xe0300000-0xe0ffffff: invalid 0xe1000000-0xe3ffffff: s: NS=0 0=0 nG=0 S=0 AP=3 TEX=0 0=0 domain=0 XN=0 C=0 B=1 0xe4000000-0xe5ffffff: s: NS=0 0=0 nG=0 S=0 AP=3 TEX=0 0=0 domain=0 XN=0 C=1 B=1 0xe6000000-0xf7ffffff: invalid 0xf8000000-0xf8ffffff: s: NS=0 0=0 nG=0 S=0 AP=3 TEX=0 0=0 domain=0 XN=0 C=0 B=1 0xf9000000-0xfbffffff: invalid 0xfc000000-0xfdffffff: s: NS=0 0=0 nG=0 S=0 AP=3 TEX=0 0=0 domain=0 XN=0 C=1 B=0 0xfe000000-0xffefffff: invalid 0xfff00000-0xffffffff: s: NS=0 0=0 nG=0 S=0 AP=3 TEX=4 0=0 domain=f XN=0 C=0 B=0 --------------------------------------------------