Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

- Community Forums
- :
- Forums
- :
- Software Development and Acceleration
- :
- HLS
- :
- How can iteration latency be less than the number ...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

witxilinx

Observer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-04-2019 09:07 AM - edited 03-05-2019 01:59 AM

1,418 Views

Registered:
12-26-2018

How can iteration latency be less than the number of inner loops?

So basically, each loop in the code below is the multiplication of a 1x16 vector (i_matrix) with a 16x16 matrix (w_matrix). After "csynth_design" has finished, the report says the "READ_GEMM_LOOP" loop in the code below achieves ** iteration latency** of 10.

- I'm just very surprised how can the "READ_GEMM_LOOP" loop in the code below achieves iteration latency of 10 given that here are at least 16 numbers to multiply and add together (multiply and accumulate). Shouldn't it be at least ~16?
- Why is the "Latency" (both min/max) undefined? My guess is that because the outermost loop "for (int x = begin; x < end; x++)" is variable bound. Do I understand correctly?
- If so, how do I know the total latency of executing this loop?

READ_GEMM_LOOP: for (intx= begin;x< end; x++) {

#pragma HLS PIPELINE II = 1 rewind

... some code (depends onx) ...

+------------------+

| |

| |

16 | w_matrix |

| |

| |

16 +------------------+

+------------------+ 16

1 | i_matrix |

+------------------+

for (int i = 0; i <1; i++) {

for (int j = 0; j <16; j++) {

accum = ...

temp = 0;

// Inner product of "i_matrix" and column of "w_matrix"

for (int k = 0; k <16; k++) {

weight = w_matrix[j][k];

input = i_matrix[i][k];

product = input * weight;

temp += product;

}

accum += temp;

}

}

}

6 Replies

xilinxacct

Professor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-04-2019 10:03 AM - edited 03-04-2019 10:03 AM

1,390 Views

Registered:
10-23-2018

On your outer loop that have variables (e.g. bgn, end)... use the following to tell the system the bounds to the loop to expect at runtime... This should take away the ? in hte latency calculations.

#pragma HLS loop_tripcount min=<int> max=<int> avg=<int>

Hope that helps.

If so, please mark as solution accepted. Kudos also welecomed. :-)

witxilinx

Observer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-04-2019 10:11 AM - edited 03-04-2019 10:16 AM

1,383 Views

Registered:
12-26-2018

Hmm.. that's interesting but questionable to me.

1. So without the #pragma loop_tripcount, the HLS tool cannot report the latency? Why does this pragma help the HLS? I just cannot see it from a user's point of view.

xilinxacct

Professor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-04-2019 10:31 AM

1,374 Views

Registered:
10-23-2018

At synth time, the system has no idea what those values will be... The pragma tells what the range can be, so the calculation have know values.

Hope that clear up the question.

witxilinx

Observer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-04-2019 07:15 PM - edited 03-04-2019 08:13 PM

1,342 Views

Registered:
12-26-2018

I did what you suggested and those "?" have gone away. But I still wonder how can the **iteration latency** be only **10**? Did I miss something?

The fastest way to multiply a 1x16 with 16x16 matrix that I can think of would take at least 16 cycles (with sequenial sum operations). So in the first cycle, we multiply a 1x16 vector with a 16x16 matrix *in parallel *(assuming we have 16x16=256 multiply units)*. *Then we sum up each value (in sequential manner) in each columns of a 16x16 multiply unit, which would give a 1x16 output result. This would take ~15 cycles.

Is there any way I can check the architecture that gets generated from the Vivado HLS tool?

u4223374

Advisor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-05-2019 01:16 AM

1,326 Views

Registered:
04-26-2015

Is it just me, or are you using "i" as both a loop index and to read a value from the matrix? That may be causing all sorts of interesting chaos...

HLS can do some surprisingly good optimizations (and some surprisingly bad ones...). From its point of view, you're just asking for 256 multiply operations and somewhat fewer adds. Each DSP48 can do a multiply-add operation per clock cycle, but at sufficiently low clock speeds you can chain them together and get multiple multiply-add operations in one cycle. My guess would be that it's got a set of 16x2 DSPs set up; on each cycle it does two iterations of the "k" loop (these iterations depend on each other so more than two at once might be pushing your luck) and all 16 iterations of the "j" loop (don't depend on each other, so it can do them all at once). That would be eight cycles of work, plus a bit before/after the loop.

witxilinx

Observer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-05-2019 01:49 AM

1,308 Views

Registered:
12-26-2018

Edit1: I just look up the DSP and I think I misunderstood it for a long time. It can indeed do a MAC operation in 1 cycle. Cool! Im really new to this.