07-23-2021 04:26 PM
Is there a way to reduce Ultrascale AXI read latency to the Cortex-A53? The Ultrascale AXI read latency seems much higher than the Zynq-7000 read latency.
Here are some posts I found:
1) This poster sees Ultrascale AXI read latency of about 232ns with a 250MHz clock : https://forums.xilinx.com/t5/Versal-and-UltraScale/Strategy-for-lowest-latency-to-access-AXI-register/td-p/1093568 .
2) Here they see 240ns latency between reads with a 100MHz clock: https://forums.xilinx.com/t5/Processor-System-Design-and-AXI/ZYNQ-ultrascale-MPSoC-How-to-improve-AXI-read-latency-and-use/td-p/829004 (latency is lower with AXI-lite, but I need burst)
My Ultrascale AXI4 read shows roughly 230ns of latency. Since the latency is similar across different bus speeds, maybe the delay is on the CPU side?
For the Zynq-7000 it appears people have found ways to reduce latency by changing CPU settings:
1) Changing CPU TLB settings dramatically reduced AXI latency: https://forums.xilinx.com/t5/Processor-System-Design-and-AXI/Latency-on-AXI-Interconnect/td-p/397977 (look for MyXil_SetTlbAttributes)
2) Here's a support article demonstrating low-latency: https://www.xilinx.com/support/answers/47266.html . (Does a similar example exist for Ultrascale?)
07-28-2021 02:10 PM
The post https://forums.xilinx.com/t5/Processor-System-Design-and-AXI/ZYNQ-ultrascale-MPSoC-How-to-improve-AXI-read-latency-and-use/td-p/829004 saw roughly 160ns latency between reads when using AXI-Lite, but I need a full AXI4 interface. By removing the interconnect, my full AXI4 interface read latency drops from ~230ns to ~150ns.
Here's a link talking about Ultrascale PL BRAM latency http://cs-people.bu.edu/rmancuso/files/papers/mpsoc_mem_OSPERT18.pdf . They see roughly 150ns read latency.
Is it possible to get below 150ns of read latency when accessing anything from the PL?
07-28-2021 06:01 PM
You can do better than both Xilinx's AXI (full) slave demo, and their AXI block RAM controller. Both are poorly designed when it comes to either throughput or latency. You'll save about 3-4 clocks by getting rid of the interconnect, another 4 by getting rid of the AXI (full) to lite converter--assuming a good AXI (full) to lite converter. The block RAM controller requires a rough 3 extra clocks of latency per read than required. But if you are already down at 150ns, you might not do much better than 3 fewer clocks, or 135ns --- assuming you are already running at a 200MHz clock rate.
A faster PL clock rate can help if not.
There are some flags you can set to turn on caching of RTL memory--if that helps. (It usually doesn't, 'cause the whole purpose of a PL memory device is to change when the CPU isn't looking.)
Your biggest challenge will be that the ARM never issues more than one (poss. two) request of the PL at a time. This is what really kills any throughput you might otherwise have.
07-29-2021 11:43 AM
Thanks @dgisselq .
Did you write the following? https://zipcpu.com/blog/2019/05/29/demoaxi.html If so, it was amazingly helpful in getting a working/fast full AXI interface going!
Originally my design was:
[ Ultrascale ] M_AXI_HPM0_FPD @ 125MHz -> AXI Interconnect -> Custom AXI/RTL Module
I managed to remove about 8 clocks of read latency by using your excellent blog article and removing the interconnect.
Since I couldn't get below 150ns of read latency, then I started looking for anything (including BRAM) that could reduce latency. However, I couldn't find anything and BRAM is not my final goal.
As you said I didn't turn on caching because my device is always changing behind the main CPU's back. That's really interesting when you say: "Your biggest challenge will be that the ARM never issues more than one (poss. two) request of the PL at a time. This is what really kills any throughput you might otherwise have". I wish there was a way to do several back-to-back reads from the PL without a delay in between them.
07-29-2021 11:49 AM
> Did you write the following? https://zipcpu.com/blog/2019/05/29/demoaxi.html If so, it was amazingly helpful in getting a working/fast full AXI interface going!
Yes, that was me. I'm glad to hear you enjoyed it.
> That's really interesting when you say: "Your biggest challenge will be that the ARM never issues more than one (poss. two) request of the PL at a time. This is what really kills any throughput you might otherwise have". I wish there was a way to do several back-to-back reads from the PL without a delay in between them.
If I recall correctly, there is an ARM instruction to load or store multiple registers at a time. I'm not an ARM developer, however, so I have little more than a vague memory of the existence of such an instruction.
Typically, the better answer is to use a DMA of some type when data throughput is an issue. The ARM just isn't known for either high throughput or low latency. Better yet, put all of your high bandwidth and low latency stuffs into the part of the logic that handles it the best: the PL.
07-30-2021 01:53 PM
Thanks @dgisselq . That all makes sense, and we're slowly working towards moving more stuff to DMA reads. It was just a surprise to see that the Ultrascale has much more read latency than older ARM CPUs.