cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
rgebauer
Contributor
Contributor
1,204 Views
Registered: ‎07-17-2017

Strategy for lowest latency to access AXI register

Jump to solution

Background:
I have an application running on the R5 processor and I would like to access AXI4Lite registers of custom PL modules with the lowest possible latency.

My current design looks like this (with 5 masters at the interconnect):
[ ZYNQ ] M_AXI_HPM0_LPD @ pl_clk0 -> S00_AXI [ AXI Interconnect ] M0x_AXI @ design_clk -> S00_AXI [ Custom RTL Module ]

design_clk = 250MHz external clock
pl_clk0 = IOPLL @ 249.997498 MHz

If you need more information, please let me know.

Current Latency: 304ns Read, 314ns Write
This seems to be a bit high for a "low-latency" interface

Question:
What knobs can I turn and what changes can I implement to achieve the lowest possible latency for register accesses from the R5?

Any hints are highly appreciated. Thank you!

0 Kudos
1 Solution

Accepted Solutions
dgisselq
Scholar
Scholar
1,186 Views
Registered: ‎05-21-2015

@rgebauer,

The first knob you can tune is to get rid of the clock domain crossing.  An AXI interconnect will handle a CDC using an asynchronous FIFO.  This will incur a delay of two clocks in each direction, so roughly 16ns in your case.  You can choose to handle it instead within the logic of your own core.

The second knob you can tune is to make your slave full AXI capable, rather than AXI-lite only.  Xilinx has publicly said that they will not be supporting high-performance AXI-lite slaves, and that if you want high-performance you need to build a full AXI4 slave.  My guess is that Xilinx's AXI to AXI-lite bridge costs a minimum of another 4 clocks (16ns), but I've been embarrassed for them before.

When you do this, don't use Xilinx's demonstration AXI slave.  It has a maximum of 50% read throughput, an extra clock (or two) of latency, and it attempts to throttle itself so that it can only handle reads or writes but never both at the same time.

You might also wish to be aware of what Xilinx cores you use--many of Xilinx's AXI-lite cores use an internal IPIF module what is horribly inefficient performance wise.  An example of this would be their GPIO core.  Using such a core would drop your throughput by at least 25%, and add another 4 clocks of latency (16ns).  A good custom core should have a 1-2 clock latency, and 100% throughput.  (Xilinx's ... aren't that good.)

The third knob you could tune would be the width of the interface.  Any AXI resizer, either up or down, that you have to go through will cost you between 4-6 clocks (my estimates).  Depending on how the resizer is tuned, as in throughput vs latency, you might suffer from even more latency there.

The third knob you can tune is the AXI crossbar.  Get rid of it.  A minimum AXI crossbar will cost at least 3 clocks of latency, my guess is that Xilinx's costs about 5-6.  (Not for poor implementation, but rather for high speed logic cost.  The cost might be much higher, I just know my own AXI crossbar can achieve a 4-5 clock latency without too much hassle, and the protocol requires 3 clocks.)  Of course, this again depends upon how you have it set up.  You could instead create a single AXI (full) slave, at the width of the ARM's low-latency port and either put your other slaves on other ports, or handle the routing of connections to them yourself.  With some judicious work here, you might manage to drop your latency down by another 16ns.

That would be about 72ns latency shaved off of your 304ns measurement.  Not bad, no?  I seem to recall another forum post recently about latency as well, which recommended adjusting a setting within the ARM that could have significant impacts on how the bus performed as well.

Dan

View solution in original post

4 Replies
dgisselq
Scholar
Scholar
1,187 Views
Registered: ‎05-21-2015

@rgebauer,

The first knob you can tune is to get rid of the clock domain crossing.  An AXI interconnect will handle a CDC using an asynchronous FIFO.  This will incur a delay of two clocks in each direction, so roughly 16ns in your case.  You can choose to handle it instead within the logic of your own core.

The second knob you can tune is to make your slave full AXI capable, rather than AXI-lite only.  Xilinx has publicly said that they will not be supporting high-performance AXI-lite slaves, and that if you want high-performance you need to build a full AXI4 slave.  My guess is that Xilinx's AXI to AXI-lite bridge costs a minimum of another 4 clocks (16ns), but I've been embarrassed for them before.

When you do this, don't use Xilinx's demonstration AXI slave.  It has a maximum of 50% read throughput, an extra clock (or two) of latency, and it attempts to throttle itself so that it can only handle reads or writes but never both at the same time.

You might also wish to be aware of what Xilinx cores you use--many of Xilinx's AXI-lite cores use an internal IPIF module what is horribly inefficient performance wise.  An example of this would be their GPIO core.  Using such a core would drop your throughput by at least 25%, and add another 4 clocks of latency (16ns).  A good custom core should have a 1-2 clock latency, and 100% throughput.  (Xilinx's ... aren't that good.)

The third knob you could tune would be the width of the interface.  Any AXI resizer, either up or down, that you have to go through will cost you between 4-6 clocks (my estimates).  Depending on how the resizer is tuned, as in throughput vs latency, you might suffer from even more latency there.

The third knob you can tune is the AXI crossbar.  Get rid of it.  A minimum AXI crossbar will cost at least 3 clocks of latency, my guess is that Xilinx's costs about 5-6.  (Not for poor implementation, but rather for high speed logic cost.  The cost might be much higher, I just know my own AXI crossbar can achieve a 4-5 clock latency without too much hassle, and the protocol requires 3 clocks.)  Of course, this again depends upon how you have it set up.  You could instead create a single AXI (full) slave, at the width of the ARM's low-latency port and either put your other slaves on other ports, or handle the routing of connections to them yourself.  With some judicious work here, you might manage to drop your latency down by another 16ns.

That would be about 72ns latency shaved off of your 304ns measurement.  Not bad, no?  I seem to recall another forum post recently about latency as well, which recommended adjusting a setting within the ARM that could have significant impacts on how the bus performed as well.

Dan

View solution in original post

rgebauer
Contributor
Contributor
1,064 Views
Registered: ‎07-17-2017

Dear @dgisselq 

Thank you very much for this detailed list of possible knobs!

So far, I only changed two things:

  • Removed the clock domain crossing at the Interconnect and shifted at after the register interface inside the modules
  • In that course, I only connected 2 of the 5 components to the LPD interconnect (the others are at the FPD)

That already quite significantly reduced the access times:

232ns Read / 242ns Write

Not sure why it was so drastically and not only the 16ns (2 cycles for each CD)...

Atleast the data width was already 32bit to not need a resizer, so this should be fine already.

Unfortunately, currently I do not have the time to test all of your points. But I will definitely investigate it at some point, especially getting rid of the crossbar and utilizing a full AXI slave seems promising. Can you recommend a full AXI4 implementation that one can use as guidance for an implementation?

 

0 Kudos
dgisselq
Scholar
Scholar
1,044 Views
Registered: ‎05-21-2015

@rgebauer,

Sure!  Check out this full AXI4 implementation.  If your design can handle single cycle reads, it should work nicely for you.  I have other full implementations as well, in case you are interested.

Dan

0 Kudos
alexander
Newbie
Newbie
21 Views
Registered: ‎07-23-2021

Hi @rgebauer,

Did you ever find a way to get the read latency below 232ns?

Thanks!

0 Kudos