cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Contributor
Contributor
1,080 Views
Registered: ‎07-29-2019

R5 <-> PL Fast exchange for small packet of data

Jump to solution

Hello everyone,

My goal is to find the fastest way to exchange small amount of data from PL to R5-0 and R5-0 to PL.

Small amount of data < 48 000 bits.

I have a Zynq ultrascale + 4EG.  I have 2 ways to look at my problem.

 

The first way is to define how many bits I want to transfer and then search for the fastest read/write tool.

The second way is to define how many times I would like to spend and see how many bits I can read/write.

 

I started with 3 BRAM + 1 TIMER. 

3 bram + 1 timer3 bram + 1 timer

Config BRAM ctrlConfig BRAM ctrl

 

 

 

//Vitis / R5 / freeRTOS.
// Pointer method with 300*32bits from 3 BRAM

	u32* p_add_bram0 = (u32*) XPAR_AXI_BRAM_CTRL_0_S_AXI_BASEADDR;
	u32* p_add_bram1 = (u32*) XPAR_AXI_BRAM_CTRL_1_S_AXI_BASEADDR;
	u32* p_add_bram2 = (u32*) XPAR_AXI_BRAM_CTRL_2_S_AXI_BASEADDR;

.....
	XTmrCtr_SetResetValue(&TmrCtrInstancePtr, TmrCtrNumber, RESET_VALUE);
	XTmrCtr_Start(&TmrCtrInstancePtr, TmrCtrNumber);
	*p_add_bram0 = i0;
	while (i0<100) {*p_add_bram0++ = i0++;}
	*p_add_bram1 = i1;
	while (i1<100){*p_add_bram1++ = i1++;}
	*p_add_bram2 = i2;
	while (i2<100){*p_add_bram2++ = i2++;}

	XTmrCtr_Stop(&TmrCtrInstancePtr, TmrCtrNumber);
	result = XTmrCtr_GetValue(&TmrCtrInstancePtr, TmrCtrNumber);

>> I got  20 350 clock cycle  @ 300MHz (68µs) for writing to 300*32bits.

 

Then I heard about BURST so I tried it.

//Vitis / R5 / freeRTOS.
// memcpy method with 300*32bits from 3 BRAM
....
        XTmrCtr_SetResetValue(&TmrCtrInstancePtr, TmrCtrNumber, RESET_VALUE);
	XTmrCtr_Start(&TmrCtrInstancePtr, TmrCtrNumber);

	memcpy((void *)XPAR_AXI_BRAM_CTRL_0_S_AXI_BASEADDR,(void*)Data_in,400); // 32*100/8 = 400 bytes
	memcpy((void *)XPAR_AXI_BRAM_CTRL_1_S_AXI_BASEADDR,(void*)Data_in,400); // 32*100/8 = 400
	memcpy((void *)XPAR_AXI_BRAM_CTRL_2_S_AXI_BASEADDR,(void*)Data_in,400); // 32*100/8 = 400


	XTmrCtr_Stop(&TmrCtrInstancePtr, TmrCtrNumber);
	result = XTmrCtr_GetValue(&TmrCtrInstancePtr, TmrCtrNumber);

>>I got  16 871clock cycle  @ 300MHz (57µs) for writing to 300*32bits.

 ! I wonder if I am really using BURST since

Spoiler

This is where I am so far with R5_0 and PL data transfert for small amount of data transfert.

 

Did I make a mistake? I was hopping to have a much faster transfer. 

Could you tell me a faster way to transfer <48 000 bits?

If I have a time box of 10µs to transfert as many bits as possible, which IP shall I use ?

 

I started looking at ACP but it is only for A53 (APU) not for R5 (RPU) isn't it?

 

Thank you in advance for your support.

Regards.

MT
0 Kudos
1 Solution

Accepted Solutions
Highlighted
Contributor
Contributor
818 Views
Registered: ‎07-29-2019

Re: R5 <-> PL Fast exchange for small packet of data

Jump to solution

Quick review

 

Using only Xilinx IP :      R5 <--> axi smart connect <--> Axi BRAM controler <--> Bram (port A)

 

Axi BRAM controler:  Full AXI not lite  |   Allow burst |  clok used : 300 MHz

Bram:  True dual port

 

Bare Metal R5 important key :

 

Set the associated memory cacheable : 

INTPTR Localaddr = XPAR_AXI_BRAM_CTRL_0_S_AXI_BASEADDR; //32bits
Localaddr &= (~(0xFFFFFU));
Xil_SetMPURegion(Localaddr, 0xFFF, 0x30b); // 0x30b I do not know where this number come from :'( plz reply If you know

some other used : Xil_SetTlbAttributes(.....);

 

Write your data in you cache with memcpy then flush it (I am new so I do not know if I am doing mistake)

	//Write
	memcpy((void *)XPAR_AXI_BRAM_CTRL_0_S_AXI_BASEADDR,(void*)Data_in32_b1,400); // 32*100/8 = 400 octets
	Xil_DCacheFlushRange(XPAR_AXI_BRAM_CTRL_0_S_AXI_BASEADDR,400);

 

For read, you probalby need to flush before.

 

 

It Takes less than 9µs to : 

Write 100*32bits in the bram_0

Write 100*32bits in the bram_1

Write 100*64bits in the bram_2

Write 100*64bits in the bram_3

--> 266 M octets / s

 

It Takes less than 18µs to : 

Read 100*32bits in the bram_0

Read 100*32bits in the bram_1

Read 100*64bits in the bram_2

Read 100*64bits in the bram_3

-->  133 M octets / s

If you think I made a mistake PLEASE tell me

Regards

 

 

MT

View solution in original post

0 Kudos
10 Replies
Highlighted
Contributor
Contributor
955 Views
Registered: ‎07-29-2019

Re: R5 <-> PL Fast exchange for small packet of data

Jump to solution

Hello,

Someone around to guide me if this is the fastest way to exchange small amount of data between rpu and PL?

Regards,

 

MT
0 Kudos
Highlighted
Contributor
Contributor
874 Views
Registered: ‎07-29-2019

Re: R5 <-> PL Fast exchange for small packet of data

Jump to solution

Hi,

 

I used Xil_SetTlbAttributes(XPAR_AXI_BRAM_CTRL_0_S_AXI_BASEADDR, 0x30b);

 

But now I have result that are too good.... I suspect that I am wrong somewhere.

 

Is that possible?

 

Write 300 * 32 bits --> ~2 µs        

Read 300 * 32 bits --> ~2 µs

 

Write 300 * 64 bits -->  3.4µs        

Read 300 * 64 bits --> 3.4µs

 

 

 

MT
0 Kudos
Highlighted
Contributor
Contributor
842 Views
Registered: ‎07-29-2019

Re: R5 <-> PL Fast exchange for small packet of data

Jump to solution

I was worried about the fact that it is too fast ... Maybe It is in the cache and not in the actual Bram.

I am doing some test and I think It is the case, I probably have to flush ( I do not really know what is that for, need to keep learning)

I tried Xil_DCacheFlushRange that seems to fixe my write 

 

	//Write
	memcpy((void *)XPAR_AXI_BRAM_CTRL_0_S_AXI_BASEADDR,(void*)Data_in32_b1,400); // 32*100/8 = 400 octets
	Xil_DCacheFlushRange(XPAR_AXI_BRAM_CTRL_0_S_AXI_BASEADDR,400);

I need to see what about the read.... 

 

If you read this and have some tips... feel free

regards

MT
0 Kudos
Highlighted
Scholar
Scholar
837 Views
Registered: ‎05-21-2015

Re: R5 <-> PL Fast exchange for small packet of data

Jump to solution

@mt-user-2019,

Do be aware that the AXI block RAM controller has a 3-clock loss per burst, and can only handle a single burst at a time.  Hence, for singleton transfers using the AXI block RAM controller the best throughput you can achieve is 25% of the maximum.

Here's a full AXI controller that achieves 100% throughput.  Might help.

Dan

0 Kudos
Highlighted
Contributor
Contributor
829 Views
Registered: ‎07-29-2019

Re: R5 <-> PL Fast exchange for small packet of data

Jump to solution

Hello @dgisselq ,

Thank you for your link. I will have a look.

I will try to finish to see what is the maximum throughput with a R5 <-axi-> Bram.

Regards

MT
0 Kudos
Highlighted
Contributor
Contributor
819 Views
Registered: ‎07-29-2019

Re: R5 <-> PL Fast exchange for small packet of data

Jump to solution

Quick review

 

Using only Xilinx IP :      R5 <--> axi smart connect <--> Axi BRAM controler <--> Bram (port A)

 

Axi BRAM controler:  Full AXI not lite  |   Allow burst |  clok used : 300 MHz

Bram:  True dual port

 

Bare Metal R5 important key :

 

Set the associated memory cacheable : 

INTPTR Localaddr = XPAR_AXI_BRAM_CTRL_0_S_AXI_BASEADDR; //32bits
Localaddr &= (~(0xFFFFFU));
Xil_SetMPURegion(Localaddr, 0xFFF, 0x30b); // 0x30b I do not know where this number come from :'( plz reply If you know

some other used : Xil_SetTlbAttributes(.....);

 

Write your data in you cache with memcpy then flush it (I am new so I do not know if I am doing mistake)

	//Write
	memcpy((void *)XPAR_AXI_BRAM_CTRL_0_S_AXI_BASEADDR,(void*)Data_in32_b1,400); // 32*100/8 = 400 octets
	Xil_DCacheFlushRange(XPAR_AXI_BRAM_CTRL_0_S_AXI_BASEADDR,400);

 

For read, you probalby need to flush before.

 

 

It Takes less than 9µs to : 

Write 100*32bits in the bram_0

Write 100*32bits in the bram_1

Write 100*64bits in the bram_2

Write 100*64bits in the bram_3

--> 266 M octets / s

 

It Takes less than 18µs to : 

Read 100*32bits in the bram_0

Read 100*32bits in the bram_1

Read 100*64bits in the bram_2

Read 100*64bits in the bram_3

-->  133 M octets / s

If you think I made a mistake PLEASE tell me

Regards

 

 

MT

View solution in original post

0 Kudos
Highlighted
Scholar
Scholar
807 Views
Registered: ‎05-21-2015

Re: R5 <-> PL Fast exchange for small packet of data

Jump to solution

@mt-user-2019 wrote:

...

--> 266 666 666 M octets / s

...

-->  133 333 333 M octets / s

 

If you see that I you think I made a mistake PLEASE tell me


You might want to fix your units.  I'm pretty sure that you aren't getting transfer rates in the millions of Megabytes per second.

Dan

0 Kudos
Highlighted
Contributor
Contributor
799 Views
Registered: ‎07-29-2019

Re: R5 <-> PL Fast exchange for small packet of data

Jump to solution

@dgisselq 

Thank you, It took me a while to see the mistake.

Those numbers looks 'normal' to you? 

 

Mathieu

Now I have so much other things to do.... I need to go in the linux side and try out gui things maybe QT...  then in the FPGA side and I am a total noobie will be fun.

MT
0 Kudos
Scholar
Scholar
767 Views
Registered: ‎05-21-2015

Re: R5 <-> PL Fast exchange for small packet of data

Jump to solution

@mt-user-2019,

I'm one of those (crazy/deranged/pick your word) people who actually studies CPU designs and bus implementations for "fun".  I've written about them extensively on the ZipCPU blog.  I'm hoping to run some of my own measurements soon, so ... that's why I'm here paying attention to whatever you might write.  As such, I only have part of the picture, but the parts and pieces I do have are quite interesting.

Let's start at the top.

  • CPU's are instruction based.  A CPU pipeline often stalls on a read instruction, since the CPU (often-enough) cannot complete it's next instruction without the read result coming back from the bus first.  Writes might also be similarly limited, if there's ever any chance that a write might produce a bus error and the computer needs to halt at that point.  These limitations apply to a library memcpy routine.  Since it's instruction based, the CPU will read a value, write a value, read a value, write a value.  The better algorithms are word based, so they might read/write 32'bits at a time or perhaps even 64-bits at a time (see for example newlib)--but whether or not this is possible is very much dependent upon the alignment of the two memory areas and the length of the transfer.  If they are misaligned with respect to each other, it might take you 4x-8x as long to do your memory copies.  Indeed, accessing memory is one of the slowest parts of the CPU and often the bottleneck in its performance.  (Several CPU bottlenecks exist in general, memory bandwidth is a significant one.)
  • The typical way a CPU handles this is through a cache.  Reads from a cache can accomplish many reads at once--but only if the area they are reading from is *memory*.  In other words, this only works if the memory region's values don't change without the CPU knowing about it.  If there's a possibility that the value might change and the CPU doesn't know about it, then the CPU can't use the cache at all.  Writes are similarly tricky.  Ideally, you want to keep everything in cache until the last possible moment when the CPU writes everything out.  This can lead to other problems where the CPU believes one piece of memory is valid when it isn't.  Indeed, cache coherency (as this problem is called) is a serious problem, and it's not trivially solved.  The PL->PS interface only makes this even more challenging to get right.
  • AXI is built around burst transactions.  Depending on the bus configuration, you might be able to burst up to 4kB of information at a time (AXI4 w/ a bus width of 256).  Here's the trick, though: you need to know ahead of time that you are going to move that much data.  This makes it difficult for a CPU to optimize its accesses in a memcpy, since the CPU is only ever moving one byte (or word--depending on the algorithm and alignment) at a time.
  • For this reason, if you want *FAST* data transfer speeds, the best way to do it is with a DMA type of operation, where you tell the DMA ahead of time how much data you want to move and from where to where.  memcpy can't compare to that.  I have cores that I've worked with (all published on github) that can completely fill the AXI bus with transfers so that there are no idle cycles.  Chances are this isn't what's happening in your experiments--why?  Because you are using memcpy.
  • A hardware memory copy is commonly called a DMA.  DMAs, however, has other problems.  Most modern processors use Memory Management Units (MMUs).  When you use an MMU, the address your program sees may not be the actual physical address on the bus.  The DMA needs to know the physical address on the bus.  In other words, if you wish to use an DMA, you need to coordinate your operation with the MMU.
  • So, let's say you are coming from either a CPU or a DMA.  That would be the bus master that controls the operation.  What's the next step?  The interconnect.  I like to think of an interconnect as simply a crossbar, something that accepts up to N inputs and routes them to N of M outputs based upon their address.  That much I've built on my own, and I've been able to achieve 100% throughput performance there as well.
  • The sad part of such a view of an interconnect is that it's a touch naive.  Xilinx's interconnect is highly configurable.  It will automatically handle clock crossings (mine doesn't--you'd have to do those manually although they aren't that hard to do), reset crossing (I haven't tried that before), data width conversions (still just thinking about how I'd design some of those), conversions to/from AXI-lite (I've done that) or other protocols.  Every conversion has a cost.  In AXI, the minimum cost is two clocks per conversion, but the reality is they can cost much more.  In my own crossbar interconnect alone, it costs me a minimum of four clocks per transaction.  I also have a series of performance charts I've kept from all of the cores I've worked on--but my point is simply that the logic adds up.
  • Further, depending upon how the interconnect is set up, you might only be able to transfer one burst at a time--either read or write.  To me, this is horrible performance.  However, if you increase the configuration to achieve better performanace, your logic usage will go up tremendously, and you'll be likely to break Xilinx's IP packager auto-generated slave cores.  (The demo code is quite broken, and only triggered if the AXI interconnect is configured for high performance.  In these cases, the slave core might cause the entire bus to seize.  Again, these bugs are all discussed on the ZipCPU blog, together with examples of better performing slaves.)
  • One issue I'm currently studying is how AxID's impact interconnect performance.  My own interconnect acts as though all bursts have the same AxID, and it can achieve 100% throughput across bursts from a any single master to any single slave on both read and write channels simultaneously.  The ID's are handled in a protocol appropriate fashion--they are "reflected" back to the master, but that's another story.  The problem with this whole setup is that the interconnect cannot change allocations until it has assured that no other bursts will be returned under the current allocation.  In my case, that means that once a master starts talking to a particular slave, *all* of those slave's responses must be returned before the master will be allowed to issue a request of a second slave.  AXI, by design, allows better performance than that by using multiple ID's--allowing the interconnect to route packets from a slave to the right master.  Theoretically, this should improve performance.  Practically, it takes several extra clock cycles of work to get this logic correct, and I'm still looking into how to do it properly.
  • Finally, when you get through the interconnect, you finally get to the slave.  Here again, there's good and bad performance.  For example, Xilinx's IP packager generated custom slaves aren't known for their throughput (or being bus compliant--another discussion).  Their AXI-lite slave at best gets 1 write transaction every three clocks (33% throughput), and one read transaction every other clock (50% throughput)--this is in addition to whatever delay the interconnect has already added to the data path.  (While my AXI-> AXI-lite bridge can achieve 100% throughput, Xilinx's tends to only allow one request outstanding at a time, so the second transaction will not be issued until the first completes.) Their IP packager/demonstration AXI4 (full) slave achieves (at best) 100% write throughput (minus a couple clocks to set up the first burst), and 50% read throughput.  However, if you happen to issue a write while a read transaction is ongoing, then the core tries to wait for the read to complete.  (It doesn't do this very successfully.)
  • Xilinx's block RAM slave has slightly better performance.  It can achieve 100% throughput on both read and write channels with the exception of a 3-cycle stall at the beginning of every burst.  This means that the memcpy approach, which will only write one word at a time, will take 4 clocks to read or write each word (unless the cache is configured to cache this data, then you might get better depending on cache performance).  This is in addition to whatever performance was lost in the interconnect, or in the CPU waiting for a response to a read, or ... whatever.
  • Achieving full 100% throughput, across bursts, and with no down-time between bursts, has been one of the goals of my own personal research.  All of this work is posted on GitHub, so you are welcome to try it out.  As I mentioned above, I do have an AXI slave processor posted that can handle full 100% throughput, so it should work faster than what you have.
  • Finally, this is one of those places where open source can help.  An open source AXI bus design, from master to slave, can make it possible to track where every clock goes and why.  For this purpose, I recently posted an AXI DMA checker on-line, so you can measure that.  That checker achieves 100% memory to stream throughput across many bursts, but as built doesn't quite get there when going from stream to memory.  (It gets close, though.)
  • I've also posted an article on the blog going through a lot of the pieces of how the bus can slow a processor down.  It's called, Why does blinky make a CPU appear to be so slow?.  There's a lot to be learned by going through it, as I eventually get a soft-core CPU running at 100MHz to blink a peripheral LED at (nearly) 50MHz.  It's not a trivial task, but the article goes through a lot of what's going on along the way.

Hope this helps you out in your understanding a bit.  If not, I might be able to offer open source references to my own research--in case that clears things up.  I could have linked about 10 articles and examples to various topics above only ... I've been told that I hit the spam filter if I ever do that, otherwise my references might've been perhaps more valuable.  In other words, if you want a link to something I've referenced above, please ask.  The information is all available on line on either my blog or (for examples) in my Github repo.

Dan

0 Kudos
Highlighted
Contributor
Contributor
439 Views
Registered: ‎07-29-2019

Re: R5 <-> PL Fast exchange for small packet of data

Jump to solution

Hello deranged ,

Tank you for you reply. I will have to read that again.

Are you saying that If I use DMA I can achieve better time transfert between Bram and the R5 (or a53) cache. I probably have to read more about it and try.

I will have to transfer at once the same amount of data from the same place ( BRAM) to the same place (R5 or A53 cache) .

Something like : when IRQ occur then READ Bram Write Bram do something with the data then wait next IRQ .

Thank you again for your post.

Regards

 

MT
0 Kudos