cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
gdg
Explorer
Explorer
4,051 Views
Registered: ‎03-22-2017

Reducing latency of the single operations

Jump to solution

I am wondering if Vivado HLS can be more aggressive in generating resources for the single operations which meet certain time constraints. Please note that, usually, I use pipelining or parallelization, to increase its throughput / decrease its latency.

 

In addition to these "techniques", can I ask the HLS tool to generate a faster implementation of the operations by paying more in area?

 

For example, let's consider the following code (which, for simplicity, does not use pipelining) and its synthesis results:

  • [line 13-14] each of the read operations (on an AXI master channel) take 8 clock cycles (7 for the request, 1 for the actual read)
  • [line 15] the floating point sum takes 14 clock cycles
  • [line 16] the write operation (on an AXI master channel) takes 1 clock cycle

I am synthesizing the design with a clock period of 2ns (for a Zynq UltraScale+) and no DSPs are generated give the strick clock period.

 

Is there any way to constraint the read operations on AXI master and the math operation to take fewer clock cycles? Are 8 clock cycles really necessary for the AXI-M read operations?

 

core.png

 

gantt.png

 

Thank you

0 Kudos
1 Solution

Accepted Solutions
jprice
Scholar
Scholar
6,147 Views
Registered: ‎01-28-2014

The resource directive has a latency operation you can specify. The trick with floating point operations is that reducing the latency will likely result in results that cannot make timing at your target clock rate. It won't hurt to give it a shot but floating point arithmetic is just slow. You may want to try half precision or fixed point if your particular design doesn't truly need floating point (many operations really don't). 

 

With regards to your master AXI interface I'd recommend a different approach. In your example you'll likely find it much faster to use a streaming AXI interface. If that is not an option perhaps try copying your 1024 values into a local memory first. Than run your accumulation loop. The problem with that approach is I've been unable to get HLS to start running that your accumulation loop while a memcpy call is in progress. I'd recommend one of the master AXI to AXI streaming cores to get your performance up. For performance with streaming math I pretty much always use AXI Streaming in and out (to very good effect).

 

 

View solution in original post

4 Replies
jprice
Scholar
Scholar
6,148 Views
Registered: ‎01-28-2014

The resource directive has a latency operation you can specify. The trick with floating point operations is that reducing the latency will likely result in results that cannot make timing at your target clock rate. It won't hurt to give it a shot but floating point arithmetic is just slow. You may want to try half precision or fixed point if your particular design doesn't truly need floating point (many operations really don't). 

 

With regards to your master AXI interface I'd recommend a different approach. In your example you'll likely find it much faster to use a streaming AXI interface. If that is not an option perhaps try copying your 1024 values into a local memory first. Than run your accumulation loop. The problem with that approach is I've been unable to get HLS to start running that your accumulation loop while a memcpy call is in progress. I'd recommend one of the master AXI to AXI streaming cores to get your performance up. For performance with streaming math I pretty much always use AXI Streaming in and out (to very good effect).

 

 

View solution in original post

gdg
Explorer
Explorer
4,006 Views
Registered: ‎03-22-2017

@jprice, do you think that just using an AXI stream interface + an AXI DMA IP moves the problem from my module to the DMA IP module?

0 Kudos
jprice
Scholar
Scholar
4,000 Views
Registered: ‎01-28-2014

In general I haven't been able to get memcpy call or any memory reads to pipeline such that operations that consume data from a large transfer can begin operations before the memory transfer completes. Using an external DMA engine that keeps an AXI-Streaming interface primed with data should solve that problem nicely.

gdg
Explorer
Explorer
3,942 Views
Registered: ‎03-22-2017

@jprice, regarding the use of AXI DMA, I have some issues with the maximum size of the data that I can transfer:

https://forums.xilinx.com/t5/UltraScale-Architecture/XAxiDma-SimpleTransfer-and-Maximum-Transfer-Length/td-p/780556

 

Please note the module that uses AXI-M can transfer any size of data.

 

Thank you

0 Kudos