cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
galati
Adventurer
Adventurer
1,188 Views
Registered: ‎06-18-2013

Zynq Cache Coherency Issue

Jump to solution

Dear collegues,

it seems, that I have a Cache Coherency Issue in my Zynq setup, which I cannot explain, but anyone here might be able to.

 

What my setup does, is the following:
a HLS component reads two blocks of data, does some bytewise calculations and writes back the result block. At the moment, each of these two parameters as well as the result use an axi_master to an HP port of the Zynq processor to read/write data to the DDR memory.

The processor uses this HLS component five times in a row. Afterwards, the final result should be available. A snapshot of the results of the system ILA can bee seen here:

galati_0-1625820505851.png

With the ILA, I am able to check, if the calculated values, which are written back to memory, are correct.

Now, from the software point of view, I wrote some lines, which compare the calculated data to some expected data, known to be correct:

xil_printf( "Compare\r\n" );
Xil_DCacheInvalidateRange( (UINTPTR)pstctEmptyPkt->pu32DataBuffer, ETH_BUFSIZE_WORDS * sizeof( u32 ) );
xil_printf( "Addr1: 0x%08x\r\n", pu32ETH1LostPaketData );
xil_printf( "Addr2: 0x%08x\r\n", pstctEmptyPkt->pu32DataBuffer );
for( u16DbgCounter = 0; u16DbgCounter < 512; u16DbgCounter++ )
{
	if( pu32ETH1LostPaketData[u16DbgCounter] != pstctEmptyPkt->pu32DataBuffer[u16DbgCounter] )
	{
		// Expected Data
		xil_printf( "0x%08x\r\n", pu32ETH1LostPaketData[u16DbgCounter] );
		// Calculated Data
		xil_printf( "0x%08x\r\n", pstctEmptyPkt->pu32DataBuffer[u16DbgCounter] );
		xil_printf( "%d\r\n", u16DbgCounter );
	}
}

 

Here, sometimes, the first (or the first few) words are not as expected (always zero); the compare shows the differences in such a case: which value was expected and which calculated is output as debug:

Compare
Addr1: 0x00203F1C
Addr2: 0x01284CC8
0x13001047
0x00000000
0

In our example above (see ILA snapshot), a value of 0x13001047 would have been expected and it can be seen, that this value was calculated and written to memory previously (see WDATA at the cursor). But why doesn't the software see the correct value while comparing the data, reading 0x00000000 instead?

Since the HLS component directly affects data in the memory, a Xil_DCacheInvalidateRange should be issued, before attempting to use the data; this is done here. I had some similar issue some months ago, where the expected data also was zero. Here I used an axi_dma component to write data to the memory to a location which was prepared previously by the processor. Here the issue was, that the destination address was not an aligned address, which was solved with: __attribute__ ((aligned(32)));

But since the destination addresses in our example are printed out and seem to be okay, this shouldn't be the same issue here, correct?

Could it be, that the Xil_DCacheInvalidateRange is not yet completed entirely before reading the values? Does this function wait somehow for a completion?

Or do I oversee anything else?
Any input would be appreciated!

Kind regards,
DG

Tags (3)
0 Kudos
1 Solution

Accepted Solutions
maps-mpls
Mentor
Mentor
460 Views
Registered: ‎06-20-2017

>Did anyone use aligned_alloc successfully so far?

I just wrote my own.  It's not hard, and unless you're mallocing and freeing kBuffers/second, you may not need to spend a lot of time optimizing your code.  

Be aware that, e.g., the MPSoC, the size of the size of the Cache Line is different for the R5 than the A53. 

If you write your own wrapper for malloc/calloc/free, the rest of your code can stay the same. 

1.  Define a structure with: 
    A.  Actual pointer
    B.  Cache line aligned pointer
    C.  Pointer to next structure.
2.  Create a static pointer to such a structure (your "head").
3. In your malloc, malloc memory for the above structure, and malloc memory for the requested size + 1 CL
4. Compute and assign S.A, S.B, S.C (standard linked list stuff)
5. In your free_wrapper, search your linked list for S.B, then call actual free with S.A.
6. Then free the memory for the structure, adjusting the head and S.C pointers as necessary.

If you need to optimize your linked list (e.g., insertion sort during malloc, optimized search during free.)

As I said previously, such an exercise will help you hone your skills if you're inexperienced, and should be trivial if you're experienced.

Anyway, this issue should be marked resolved. 

P.S.  I agree  that Xilinx *should* provide cache line aligned memory allocation, and/or access to C11 compiler...but until they do, you have a work around, and really, this whole thread should have a solution by now.  Surly 10+ days is enough to write your own wrappers for cache line aligned memory allocation.

*** Destination: Rapid design and development cycles *** Please remember to give internet points to those who help you here. ***

View solution in original post

0 Kudos
21 Replies
dbemmann
Observer
Observer
1,145 Views
Registered: ‎05-08-2018

I did something similar recently and after invalidating the cache range, the new data was visible immediately.

Could it be that your address map is incomplete?

0 Kudos
galati
Adventurer
Adventurer
1,126 Views
Registered: ‎06-18-2013

Hi dbemmann,

I have somy excluded segments. Does this influence the invalidating of the cache?
What do you have in mind?

 

0 Kudos
dbemmann
Observer
Observer
1,075 Views
Registered: ‎05-08-2018

To be honest, I have no clue. I just recently happened to see all zeros after invalidating a cache range and reading from it, until I found out the reason that the target address range in Vivado's address editor was partly unassigned. It took me a few hours to track that down and your case sounded similar, so I thought it's worth a try. Another interesting question would be if we can invalidate a PS cache range from within the PL, which could be very useful.

0 Kudos
maps-mpls
Mentor
Mentor
1,070 Views
Registered: ‎06-20-2017

Your buffers do not appear to be cache aligned

*** Destination: Rapid design and development cycles *** Please remember to give internet points to those who help you here. ***
0 Kudos
dbemmann
Observer
Observer
990 Views
Registered: ‎05-08-2018

I have another theory: out-of-order execution. Maybe the APU issues the read a few cycles earlier (to avoid wait states) and thus reads the cache before it’s invalidated.

0 Kudos
ericv
Scholar
Scholar
921 Views
Registered: ‎04-13-2015

@galati , @dbemmann 's description is most likely the reason.  You can verify that's the case by disabling the branch prediction when setting-up the cache.  Having hit to many times that issue, in my code I've stopped invalidating the destination memory area before the transfer, I do it after the transfer is over.

0 Kudos
maps-mpls
Mentor
Mentor
837 Views
Registered: ‎06-20-2017

@ericv's post doesn't make much sense. 

1.  Likely cache line alignment issue.  Read up on cachce controllers and do some experimentation.  See UG585, and read the invalidate/flush doxygen carefully. 

2.  Not likely out of order execution issue.  While out of order execution is a real issue in some situations, it is not likely in this situation.  And even if out of order execution were your issue, you would fix it with a dsb, dmb, or isb instructions--not by stopping invalidating destination buffers, which had no rational basis in the first place.  See UG585.

3.  There is no rational reason to invalidate a destination buffer before a DMA transfer, as @ericv describes.  For test purposes, there is often a reason to flush a destination buffer before DMAing.  And certainly invalidating a destination buffer after DMAing is required if you want the processor to see the data.  But neither of these rational practices would explain the symptoms you describe.

4.  But a cache line alignment issue would.

I'd be very careful about who you get advice from on the internet.  Instead, hone your debug skills and read up on computer architecture.

*** Destination: Rapid design and development cycles *** Please remember to give internet points to those who help you here. ***
0 Kudos
ericv
Scholar
Scholar
806 Views
Registered: ‎04-13-2015

@galati , @maps-mpls is correct; I didn't realized you were invalidating the cache after the transfer. But he/she is providing 10,000 feet concepts, not an explanation / solution. I will explain your issue. Your data is likely not aligned and when you use Xil_DCacheInvalidateRange(), the buffer area (start and end) not cache-aligned are flushed & invalidated, not only invalidated; everything aligned is only invalidated. The flushing is the root cause - the data written in the memory by the HLS component gets overwritten by the flushing operation.

Either you modify Xil_DCacheInvalidateRange() to not flush the boundaries or you have to align the data. If you aren't aware, you can use the __attribute__ ((aligned(32))) statement on a field in a data structure.

e

0 Kudos
galati
Adventurer
Adventurer
776 Views
Registered: ‎06-18-2013

Hi guys!
Thank you very much for your input and for the discussion!!!

How would you apply the aligned-attribute? To a variable, it would be simply:
static u32 pu32RxBuf[RX_BUFLEN] __attribute__ ((aligned(32)));
for example.

How can it be applied while using dynamic allocation with calloc?
Do I have to embed the variable in a struct to be able to apply the attribute?

Kind regards!

0 Kudos
ericv
Scholar
Scholar
760 Views
Registered: ‎04-13-2015

@galati,sadly you can't rely on calloc(), nor malloc(), for the alignment - buffers returned by calloc()/malloc() are aligned for any built-in types. (I think it's 16 bytes - Neon 128 bit Q registers). This said,  you can always allocate a larger size and use a copy of the pointer ceiled to the nearest 32 bit. It isn't necessary to embed the buffer in a structure; but if you embed it in a structure, with calloc()/malloc() you'll still need to use an align buffer address.

FYI, you can do the ceiling with: (void *) ((((uint32_t) Ptr) + 31) & ~31)

maps-mpls
Mentor
Mentor
751 Views
Registered: ‎06-20-2017

>How can it be applied while using dynamic allocation with calloc?

It's not pretty, but you can either find third party support, or better for your long term development and skills in my opinion, write your own malloc/calloc/free wrappers to return pointers that are cache line aligned.  In your wrappers you will have to call the standard malloc/calloc/free requesting more memory than was requested via your wrapper functions.  Then you'll need a mechanism to keep track of the actual pointer returned by malloc/calloc while returning a cache line aligned pointer, so that when your free_wrapper() is called later with the cache line aligned pointer, you will free() the actual pointer after looking it up in your data structure.  You can do this with a linked list, or any other data structure you want.

*** Destination: Rapid design and development cycles *** Please remember to give internet points to those who help you here. ***
0 Kudos
maps-mpls
Mentor
Mentor
729 Views
Registered: ‎06-20-2017

>But he/she is providing 10,000 feet concepts, not an explanation / solution.

I'd call these fundamentals.  But whatever

*** Destination: Rapid design and development cycles *** Please remember to give internet points to those who help you here. ***
0 Kudos
vanmierlo
Mentor
Mentor
711 Views
Registered: ‎06-10-2008

How about using C11 aligned_alloc ? Or C++17 std::aligned_alloc ?

0 Kudos
maps-mpls
Mentor
Mentor
652 Views
Registered: ‎06-20-2017

Great idea, if Xilinx supports C11.  I haven't checked recently.

*** Destination: Rapid design and development cycles *** Please remember to give internet points to those who help you here. ***
0 Kudos
galati
Adventurer
Adventurer
587 Views
Registered: ‎06-18-2013

Hey guys!

Thank you very much again!

 

I'm not quite sure, if I got the point:
when allocating memory with for example
pu32DataBuffer = (u32*)calloc( ETH_BUFSIZE_WORDS, sizeof( u32 ) );
with ETH_BUFSIZE_WORDS being 512, shouldn't this return a cache-aligned address? Since it is a multiple of 32 bytes which is the size of a cache line? Or is cache alignment also dependent of the start address, which is 0x01284CC8 in our case? How can I evaluate, if the returned address was cache aligned?

By the way:
would the use of the ACP port be an alternative?

Kind regards,
DG

0 Kudos
dbemmann
Observer
Observer
577 Views
Registered: ‎05-08-2018

The cache works on 32-byte segments, i.e. aligned addresses start at multiples of 0x20. Your start address 0x01284CC8 is not 32-byte aligned, which means that you cannot invalidate from 0x01284CC8, only from 0x01284CC0 or 0x01284CE0.

Yes, the ACP port would be an alternative, but not neccessarily a better one. When you write data from the PL to memory via the ACP port, it will automatically evict these segments from the cache, so you don't have to invalidate manually. In terms of PS read performance, everything stays the same. If you are writing bigger chunks like your 2KB and the PS knows when writes happen, then there is hardly any benefit in using ACP. I assume that in the above case, ACP would simply invalidate from 0x01284CC0, which you could easily do yourself... but why would you want to invalidate 8 bytes more at the start and 24 bytes more at the end, when you can simply have your buffer start at an address that is a multiple of 32 and then invalidate exactly the buffer?

0 Kudos
vanmierlo
Mentor
Mentor
568 Views
Registered: ‎06-10-2008

Yes, using the ACP will evict the lines from the L1 caches. But if you allow it I believe that it can allocate and store in the L2 cache.

0 Kudos
dbemmann
Observer
Observer
559 Views
Registered: ‎05-08-2018

@vanmierlo is right, you can configure the cache controller for write-though into L2, which normally is done using the address. I think you can also do that on a per-master basis, but I never tried. My point with regards to performance was that for a 2KB block it hardly makes a difference whether it is initially cached in L2 or not, because the first read miss will cause the rest to get cached.

0 Kudos
galati
Adventurer
Adventurer
542 Views
Registered: ‎06-18-2013

@dbemmann:
>but why would you want to invalidate 8 bytes more at the start and 24 bytes more at the end...

But isn't this exactly, what was proposed earlier? To allocate 32 bytes more then needed, determine where the aligned address segment is laying and then use only the aligned part? It seemed to me, that there is no "direct" solution, since calloc does not provide this feature and unless aligned_alloc is not available (at least, I don't know how to use it. It seems to be known but I always get the linker error message: aligned_alloc.c:(.text+0xa): undefined reference to `posix_memalign'). 

This is why I hoped to be able to leave all the code as it is, in either using the ACP port or in using aligned_alloc. Did anyone use aligned_alloc successfully so far?

0 Kudos
maps-mpls
Mentor
Mentor
461 Views
Registered: ‎06-20-2017

>Did anyone use aligned_alloc successfully so far?

I just wrote my own.  It's not hard, and unless you're mallocing and freeing kBuffers/second, you may not need to spend a lot of time optimizing your code.  

Be aware that, e.g., the MPSoC, the size of the size of the Cache Line is different for the R5 than the A53. 

If you write your own wrapper for malloc/calloc/free, the rest of your code can stay the same. 

1.  Define a structure with: 
    A.  Actual pointer
    B.  Cache line aligned pointer
    C.  Pointer to next structure.
2.  Create a static pointer to such a structure (your "head").
3. In your malloc, malloc memory for the above structure, and malloc memory for the requested size + 1 CL
4. Compute and assign S.A, S.B, S.C (standard linked list stuff)
5. In your free_wrapper, search your linked list for S.B, then call actual free with S.A.
6. Then free the memory for the structure, adjusting the head and S.C pointers as necessary.

If you need to optimize your linked list (e.g., insertion sort during malloc, optimized search during free.)

As I said previously, such an exercise will help you hone your skills if you're inexperienced, and should be trivial if you're experienced.

Anyway, this issue should be marked resolved. 

P.S.  I agree  that Xilinx *should* provide cache line aligned memory allocation, and/or access to C11 compiler...but until they do, you have a work around, and really, this whole thread should have a solution by now.  Surly 10+ days is enough to write your own wrappers for cache line aligned memory allocation.

*** Destination: Rapid design and development cycles *** Please remember to give internet points to those who help you here. ***

View solution in original post

0 Kudos
galati
Adventurer
Adventurer
419 Views
Registered: ‎06-18-2013

Dear colleagues,
thank you very much for your help and the ideas!

I appreciate it!
DG

0 Kudos