cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
schindlerto
Observer
Observer
657 Views
Registered: ‎02-19-2020

UltraScale R5: Read multiple data to AXI with memcpy wrapper

I have some strange problems implementing a wrapper for writing multiple values to AXI using memcpy:

Writing the data in “A” to “addr” with a length of 4:

 

 

 

 

float A[4]={1,2,3,4};
float C[4]={5,4,3,2};

void testebench(){
uz_writeMultipleFloatToAxi(&A[0], addr, 4); // With wrapper
memcpy( (void *)addr, A,4*sizeof(float) ); // without wrapper
uz_readMultipleFloatFromAxi(&C[0], addr2, 4); // read 4 floats

}

static void uz_readMultipleFloatFromAxi(float *PointerToDataOnPS, uintptr_t ReadFromAXIAddr, size_t numberOfElements){
    memcpy( PointerToDataOnPS, (void *)ReadFromAXIAddr,numberOfElements*sizeof(float) );
}

static void uz_writeMultipleFloatToAxi(const float *SourcePointer, uintptr_t writeToAXIAddr, size_t numberOfElements){
    memcpy( (void *)writeToAXIAddr, SourcePointer,numberOfElements*sizeof(float) );
}

 

 

 

 

 

The write/read wrapper is basically the same function just source and destination are switched.

The problem is now:

  1. writeMultipleFloatToAxi does not work with compiler optimization set to -O0, i.e., no values are written to the IP-Core
  2. straight memcpy without wrapper works with -O0 and -O2
  3. writeMultipleFloatToAxi does work with -O2
  4. uz_readMultipleFloatFromAxi works with -O0 and -O2
0 Kudos
16 Replies
schindlerto
Observer
Observer
573 Views
Registered: ‎02-19-2020

*bump*

0 Kudos
schindlerto
Observer
Observer
526 Views
Registered: ‎02-19-2020

Any input at all @xilinx? Is this a known bug? I added volatile to the function arguments but with no success. I also made no progress on any idea why this works with > O0 but not with O0 optimization

0 Kudos
dgisselq
Scholar
Scholar
497 Views
Registered: ‎05-21-2015

@schindlerto ,

When you say it "doesn't work", what is happening?  Is any transaction taking place at all?  Can you tell?  Is it just not landing on your IP?

I guess I'm struggling to know where to start here.  Perhaps a more detailed description of what happens for one of those might help.

Dan

schindlerto
Observer
Observer
473 Views
Registered: ‎02-19-2020

"does not work" means that the data does not land in the IP while "work" means it does. More specifically, the "working" variant does trigger an AXI burst transaction. "not working" generates "random" transactions on the AXI. See the pictures attached which show a ILA single shot triggered on AXI_AWADDR==120 (the starting address for the write).

The only difference from what I do between the two pictures is setting -O0 or -O1, i.e., no c code is changed at all.

works_focus_o1.png
optimization_o0_wrongAXI.png
0 Kudos
schindlerto
Observer
Observer
451 Views
Registered: ‎02-19-2020

I dug around the forum here and found quite the amount of forum posts regarding this topic but without a clear "solution" and contractionary information (including the claim that "memcpy" does not generate a burst on AXI and the claim that "memcpy is the correct way"). Additionally, half of the threads just link to each other.

https://forums.xilinx.com/t5/Processor-System-Design-and-AXI/How-do-I-use-burst-transfer-using-AXI-Full-interface/m-p/958398

https://forums.xilinx.com/t5/AXI-Infrastructure-Archive/How-Do-I-Perform-an-AXI-Burst-in-Software/td-p/592502

https://forums.xilinx.com/t5/Processor-System-Design-and-AXI/How-to-create-a-burst-transaction-by-a-Zynq-AXI-GP-Master/m-p/796170#M24425

https://forums.xilinx.com/t5/Processor-System-Design-and-AXI/R5-lt-gt-PL-Fast-exchange-for-small-packet-of-data/m-p/1092716

https://forums.xilinx.com/t5/Processor-System-Design-and-AXI/How-to-implement-burst-transactions-in-the-PS/m-p/837908

 

This looks quite similar to my ILA and the user claim that is "a compiler issue" (solved it by using SDK):

https://forums.xilinx.com/t5/Embedded-Development-Tools/AXI-Transfer-from-PS-to-PL/m-p/1173074

 

I am on Vitis 2020.1. I tried to use a "costum" memcpy (https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c) c-implementation and this results in a behavior similar to my "non working" screenshot.

Furthermore, the following code works independently of compiler optimization (write function uses the global variable "A" instead of the pointer that is passed):

float A[4]={1,2,3,4};
float C[4]={5,4,3,2};

void testebench(){
uz_writeMultipleFloatToAxi(&A[0], addr, 4); // With wrapper
memcpy( (void *)addr, A,4*sizeof(float) ); // without wrapper
uz_readMultipleFloatFromAxi(&C[0], addr2, 4); // read 4 floats

}

static void uz_readMultipleFloatFromAxi(float *PointerToDataOnPS, uintptr_t ReadFromAXIAddr, size_t numberOfElements){
    memcpy( PointerToDataOnPS, (void *)ReadFromAXIAddr,numberOfElements*sizeof(float) );
}

static void uz_writeMultipleFloatToAxi(const float *SourcePointer, uintptr_t writeToAXIAddr, size_t numberOfElements){
    memcpy( (void *)writeToAXIAddr, A,numberOfElements*sizeof(float) );
}

 

I suspect that with -O0 the compiler does use "some" memcpy implementation while in all the other cases a platform-specific Xilinx/ARM implementation is used. However, this is pure speculation since I am quite out of my field of expertise here.

0 Kudos
Rmccarty
Adventurer
Adventurer
446 Views
Registered: ‎09-05-2020

Try adding 'const' attributes to A an B?

All I know is the optimization can do seemingly strange things if you are not super careful.

0 Kudos
schindlerto
Observer
Observer
426 Views
Registered: ‎02-19-2020

Do you mean to "A"? "C" is not constant.

Tried it with "const A[4]" for good measure and an lack of ideas but this did not change anything. Please note that this is a simplified version that reliably reproduces this problem in the spirit of a "minimal not working example". I also tried a variety of adding volatile / inline / always inline gcc attribute without success. It really boils down to the optimization setting or better the lack of optimization since it works for settings != -O0

I'd also be quite happy for someone (e.g., a Xilinx employee) to state that the problem is with my code or provide an MWE (maybe with changing some cache settings which is suggested in some of the threads and I did not touch yet). but as it stands this seems to be a bug?

0 Kudos
Rmccarty
Adventurer
Adventurer
414 Views
Registered: ‎09-05-2020

I duuno, but you are also not initializing the arrays properly.

1 is not necessarily the same thing as 1.0f . This not trivial or nitpicking, leaving thing vague can result in varying interpretations.

0 Kudos
schindlerto
Observer
Observer
409 Views
Registered: ‎02-19-2020

You are certainly right about init 1.0 would be cleaner, but do you think this relates to the problem at hand? I do not since the issue is not even related to the fact that floats are used here. The problem is also present if A & C are of type int32_t, the data type is just an artifact from the real application from which I derived this not working example. The same is true for the variable names "A" and "C" without "B", but again this does not seem to relate to the actual issue, or does it?

0 Kudos
joancab
Advisor
Advisor
367 Views
Registered: ‎05-11-2015

float A = 1;

is equivalent to

float A = (float) 1;

and is easy to prove with 2 lines in a helloworld.

If we are going to doubt of such basic things of a language, better re-write (or re-read) its standard (ISO/IEC9899) to which compilers should comply with, and if not, simply flag it as a non-compliance bug.

0 Kudos
schindlerto
Observer
Observer
365 Views
Registered: ‎02-19-2020

what?

0 Kudos
dgisselq
Scholar
Scholar
327 Views
Registered: ‎05-21-2015

@schindlerto ,

The only way I know of to get the ARM to use AXI bursts is to turn on caching for the memory region in question.  This is very risky to do with an FPGA (i.e. PL) type of memory space, since the whole point of the PL memory space is that your design can adjust it at will independent of the processor.  Not all PL memory types can be marked as cachable.  You should be very careful when doing this--and know what you are doing, lest you run into this exact same sort of problem.  It's for all these reasons that the PL address space is marked by default as not cachable.

If caching is enabled, then the CPU won't necessarily go to the memory in question.  It will read when it needs to fill a cache line, and write when it needs to flush a cache line, but you might not see any other reads or writes otherwise.  In this case, the issue of whether it reads or writes under one optimization setting or another is a red herring, and therefore has you looking in the wrong place.

The obvious solutions to fixing this behavior are 1) to turn caching back off for the memory region in question, or 2) to read/write a sufficiently sized chunk of memory that it doesn't fit in the cache.

See if either of those help you,

Dan

0 Kudos
schindlerto
Observer
Observer
250 Views
Registered: ‎02-19-2020

Thanks for the input @dgisselq !

I actually already followed your advice unknowingly since I did not change anything regarding caching of the memory region. I am just aware that this is a possibility, but I do not know what I am doing in this regard, thus I did not change it

Furthermore, it feels wrong to me to "mess" with the caching since all I want to do is write four values from the PS -> PL

From what I have measured with the ILA so far using memcpy actually triggers a burst transaction (see first ILA screenshot) without any explicit other steps (at least that I did consciously). Its just that if the bug triggers (i.e., -O0) no valid AXI transaction happens at all, not even a "non burst". For comparison, I just created a ILA screenshot that shows single AXI write operation if I just write to the four registers one by one with the following pseudocode:

 

float A={1.0, 2.0, 3.0, 4.0} // ;)

void testebench(){
test(&A[0]);
}

void test(float *arr){
 xil_out(addr1, arr[0]);
 xil_out(addr2, arr[1]);
 xil_out(addr3, arr[2]);
 xil_out(addr4, arr[3]);
}

 

So the problem is not that if I compile with -O0 that I do not trigger a write with axi burst but that no valid write operation is happening at all. Basically, I tell the R5 to copy data from one place to another place and the R5 just does not do it.

axi_singleWrite.png
0 Kudos
dgisselq
Scholar
Scholar
221 Views
Registered: ‎05-21-2015

@schindlerto ,

I remember a similar problem from not that long ago, where the PS configuration either didn't get properly propagated from Vivado to Vitis or was somehow corrupted along the way.  The PS was then configured for the wrong device.  The solution was to rebuild the project to see if that fixed things.  (Yes, I thoroughly detest phantom files created by vendor software that can neither be inspected nor "fixed", but ... they can be problems.)

Dan

0 Kudos
schindlerto
Observer
Observer
165 Views
Registered: ‎02-19-2020

Thats a good point. I'll "ask" a coworker to reproduce this problem independently of my build and additionally just create a completely clean version of my project deleting all the cache files and so on.

Funfact: I tried to use "Xil_MemCpy" from "xil_mem" and that did not work at all, i.e., I was not able to write valid data to my Ip core in any of the configurations.

Xil_MemCpy_neverWorks.png
0 Kudos
schindlerto
Observer
Observer
98 Views
Registered: ‎02-19-2020

I tested it with a clean rebuild of everything (vivado + vitis) but the behavior is unchanged.

Any more ideas? Can anyone (read: Xilinx employee) reproduce this bug?

0 Kudos