UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Observer seb.mitchison
Observer
464 Views
Registered: ‎03-15-2018

Single PS Master AXI Reads/Writes Bandwidth Maximization Through Dead Time Reduction

Jump to solution

Hi,

I am working on a Zynq US+ system. The PS is generating AXI reads/writes through a 64-bit hpm0 port, through an AXI interconnect and into a custom AXI slave with a SPRAM. I have a simple software routine in SDK that loops through reading and writing my custom AXI slave contents.

uint64_t reg_value;
int i;

for (i = 0; i <= 255; i += 8)
{
reg_value = Xil_In64(0xA0030000 + i);
}

 

for (i = 0; i <= 255; i += 8)
{
Xil_Out64(0xA0020000 + i, i);
}

When I check out the ILA, I can see that the writes from the PS occur 'back to back', or at least as fast as my infrastructure can accept them, but the reads have got 14 clocks of dead time between one read completing and the next being generated. See ILA waveforms below!

Is there any way to speed up this dead time and force the PS to generate the next Read transaction a) immediately after issuing the first or b) immediately after completing the first. It seems I could triple my bandwidth if I can do this. Is there a technical reason for the processor to handle Reads and Writes so differently?

Note that I am not trying to do a DMA at this point, I just want to maximise the bandwidth of single AXI reads/writes.

axi_read_latency_64bit.pngaxi_write_latency_64bit.png 

 

Any help appreciated!

Cheers,
Seb

0 Kudos
1 Solution

Accepted Solutions
Voyager
Voyager
391 Views
Registered: ‎02-01-2013

Re: Single PS Master AXI Reads/Writes Bandwidth Maximization Through Dead Time Reduction

Jump to solution

It’s unclear as to whether your SW is running on an APU or an RPU. Either way, though, I think the problem is the same: you’re not seeing the whole picture—literally.

You’re showing an ILA capture. The ILA sees what happens within the PL. You’re not seeing the associated steps that occur inside the PSU. An AXI_HPM0_* port is not the end of a transaction’s pathway. The read transactions still need to propagate though switches in the PSU, until they finally arrive at the APU/RPU running your SW. Assuming you’re running code on the APU, this is the extra path you still need to consider:

 2019-01-13_12-25-41.jpg

So you’re not seeing ILA trace from the launch of the read transaction until it arrives in the PL, nor are you seeing the trace of the completion of the read command from the PL boundary to the requester. Those additional paths could account for much of the 14 clocks of apparent ‘dead time’.

There also may be a contribution to the delay from your actual code. Execution of the code might be waiting for a read to complete before evaluating the loop, thereby stalling the launch of the next read. Instead of only one read per loop iteration, try:

for (i = 0; i <= 255; i += 64)
{
reg_value = Xil_In64(0xA0030000 + i);
reg_value1 = Xil_In64(0xA0030000 + i + 8);
reg_value2 = Xil_In64(0xA0030000 + i + 16);
reg_value3 = Xil_In64(0xA0030000 + i + 24);
reg_value4 = Xil_In64(0xA0030000 + i + 32);
reg_value5 = Xil_In64(0xA0030000 + i + 40);
reg_value6 = Xil_In64(0xA0030000 + i + 48);
reg_value7 = Xil_In64(0xA0030000 + i + 56);
}

(Apologies if I botched the C code; it’s not my bag, Man.)

You're seeing the writes occur 'faster' because you're watching them getting flushed out of the write data path--where they've been queued from the much-faster processor end.  Writes complete faster than reads, across similar, wait-capable interfaces.

-Joe G.

 

3 Replies
Voyager
Voyager
392 Views
Registered: ‎02-01-2013

Re: Single PS Master AXI Reads/Writes Bandwidth Maximization Through Dead Time Reduction

Jump to solution

It’s unclear as to whether your SW is running on an APU or an RPU. Either way, though, I think the problem is the same: you’re not seeing the whole picture—literally.

You’re showing an ILA capture. The ILA sees what happens within the PL. You’re not seeing the associated steps that occur inside the PSU. An AXI_HPM0_* port is not the end of a transaction’s pathway. The read transactions still need to propagate though switches in the PSU, until they finally arrive at the APU/RPU running your SW. Assuming you’re running code on the APU, this is the extra path you still need to consider:

 2019-01-13_12-25-41.jpg

So you’re not seeing ILA trace from the launch of the read transaction until it arrives in the PL, nor are you seeing the trace of the completion of the read command from the PL boundary to the requester. Those additional paths could account for much of the 14 clocks of apparent ‘dead time’.

There also may be a contribution to the delay from your actual code. Execution of the code might be waiting for a read to complete before evaluating the loop, thereby stalling the launch of the next read. Instead of only one read per loop iteration, try:

for (i = 0; i <= 255; i += 64)
{
reg_value = Xil_In64(0xA0030000 + i);
reg_value1 = Xil_In64(0xA0030000 + i + 8);
reg_value2 = Xil_In64(0xA0030000 + i + 16);
reg_value3 = Xil_In64(0xA0030000 + i + 24);
reg_value4 = Xil_In64(0xA0030000 + i + 32);
reg_value5 = Xil_In64(0xA0030000 + i + 40);
reg_value6 = Xil_In64(0xA0030000 + i + 48);
reg_value7 = Xil_In64(0xA0030000 + i + 56);
}

(Apologies if I botched the C code; it’s not my bag, Man.)

You're seeing the writes occur 'faster' because you're watching them getting flushed out of the write data path--where they've been queued from the much-faster processor end.  Writes complete faster than reads, across similar, wait-capable interfaces.

-Joe G.

 

Observer seb.mitchison
Observer
360 Views
Registered: ‎03-15-2018

Re: Single PS Master AXI Reads/Writes Bandwidth Maximization Through Dead Time Reduction

Jump to solution

Hi Joe,

Thanks for your response.

I can see how the 'dead' cycles I see in the ILA may be being used by the processor through the routes on the block diagram you highlight. Presumably the processor is unable to issue the next read until completing the previous one before unlike with issuing writes where it clearly does not wait for the write response. I felt your statement regarding the for loop could be relevant so 'unwrapped' it in my SDK code that I am running on the APU but it made no difference to the waveforms captured in the ILA unfortunately.

Interestingly (or perhaps worryingly) I re-customized my AXI interconnect to optimize for 'maximize performance' instead of 'custom' to see if that would reduce the PL latency and it slowed down the PL latency by 4 clock cycles so that was a no go for speeding things up... reverted back to custom! 

Looks like I will have to live with the slow ad-hoc read accesses... If I were to sort out a DMA engine to move data from the PL BRAMs into a DDR4, do you know if the processor can complete reads from the DDR4 faster than from a PL BRAM or is it subject to the same processing overhead?

0 Kudos
Voyager
Voyager
350 Views
Registered: ‎02-01-2013

Re: Single PS Master AXI Reads/Writes Bandwidth Maximization Through Dead Time Reduction

Jump to solution

We need to highlight the distinction here between latency and throughput. Optimizing your AXI Interconnect block for performance likely inserted buffering stages that would have helped its overall throughput, but proved counterproductive in your actual quest to reduce latency.

AXI is an interconnect protocol and methodology. It's meant to be quite flexible, extremely general, and support jaw-dropping throughput. But what you want is minimal latency. Unfortunately, the best way to get low latency is through simplicity.  

There may be ways to set-up the system to wring-out a few clocks of latency during PS reads of the PL, but I've never investigated that. An alternative approach would be to change your system to rely upon a high-throughput transfer to move data from the PL to the PS (e.g, using OCM--not DDR), then have your SW deal with the data in the PS.

As long as your design success doesn't hang upon the initial latency of the first read, DMA will certainly help. In an overall way, you can shorten your average read time considerably by bursting data ahead of time, from the PL to the PS. 

If you think about it, that's the way faster and faster DDR memory has benefitted processor design. The underlying memory inside DDR still takes tens of nanoseconds to perform a read look-up--just like it has for decades--but it fetches a lot more data than just the first item requested, and then it shoves all of that data into a super-fast interface back to the processor. After the initial latency period, the CPU has all of the data it wants, all queued-up in a cache, waiting to be used.

-Joe G.