cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
witxilinx
Observer
Observer
1,418 Views
Registered: ‎12-26-2018

How can iteration latency be less than the number of inner loops?

 

So basically, each loop in the code below is the multiplication of a 1x16 vector (i_matrix) with a 16x16 matrix (w_matrix). After "csynth_design" has finished, the report says the "READ_GEMM_LOOP" loop in the code below achieves iteration latency of 10.

  1. I'm just very surprised how can the "READ_GEMM_LOOP" loop in the code below achieves iteration latency of 10 given that here are at least 16 numbers to multiply and add together (multiply and accumulate). Shouldn't it be at least ~16?
  2. Why is the "Latency" (both min/max) undefined? My guess is that because the outermost loop "for (int x = begin; x < end; x++)" is variable bound. Do I understand correctly?
  3. If so, how do I know the total latency of executing this loop?
READ_GEMM_LOOP: for (int x = begin; x < end; x++) {
#pragma HLS PIPELINE II = 1 rewind

... some code (depends on x) ...
+------------------+
| |
| |
16 | w_matrix |
| |
| |
16 +------------------+
+------------------+ 16
1 | i_matrix |
+------------------+

for (int i = 0; i < 1; i++) {
for (int j = 0; j < 16; j++) {
accum = ...
temp = 0;
// Inner product of "i_matrix" and column of "w_matrix"
for (int k = 0; k < 16; k++) {
weight = w_matrix[j][k];
input = i_matrix[i][k];
product = input * weight;
temp += product;
}
accum += temp;
}
}

    Screenshot from 2019-03-05 02-24-24.png

Tags (4)
0 Kudos
6 Replies
xilinxacct
Professor
Professor
1,390 Views
Registered: ‎10-23-2018

@witxilinx 

On your outer loop that have variables (e.g. bgn, end)... use the following to tell the system the bounds to the loop to expect at runtime... This should take away the ? in hte latency calculations.

#pragma HLS loop_tripcount min=<int> max=<int> avg=<int>

Hope that helps.

If so, please mark as solution accepted. Kudos also welecomed. :-)

witxilinx
Observer
Observer
1,383 Views
Registered: ‎12-26-2018

Hmm.. that's interesting but questionable to me.

1. So without the #pragma loop_tripcount, the HLS tool cannot report the latency? Why does this pragma help the HLS? I just cannot see it from a user's point of view.

0 Kudos
xilinxacct
Professor
Professor
1,374 Views
Registered: ‎10-23-2018

@witxilinx 

At synth time, the system has no idea what those values will be... The pragma tells what the range can be, so the calculation have know values.

Hope that clear up the question.

0 Kudos
witxilinx
Observer
Observer
1,342 Views
Registered: ‎12-26-2018

 

I did what you suggested and those "?" have gone away. But I still wonder how can the iteration latency be only 10? Did I miss something?

The fastest way to multiply a 1x16 with 16x16 matrix that I can think of would take at least 16 cycles (with sequenial sum operations). So in the first cycle, we multiply a 1x16 vector with a 16x16 matrix in parallel (assuming we have 16x16=256 multiply units)Then we sum up each value (in sequential manner) in each columns of a 16x16 multiply unit, which would give a 1x16 output result. This would take ~15 cycles.

Is there any way I can check the architecture that gets generated from the Vivado HLS tool?

Capture.JPG

0 Kudos
u4223374
Advisor
Advisor
1,326 Views
Registered: ‎04-26-2015

Is it just me, or are you using "i" as both a loop index and to read a value from the matrix? That may be causing all sorts of interesting chaos...

 

HLS can do some surprisingly good optimizations (and some surprisingly bad ones...). From its point of view, you're just asking for 256 multiply operations and somewhat fewer adds. Each DSP48 can do a multiply-add operation per clock cycle, but at sufficiently low clock speeds you can chain them together and get multiple multiply-add operations in one cycle. My guess would be that it's got a set of 16x2 DSPs set up; on each cycle it does two iterations of the "k" loop (these iterations depend on each other so more than two at once might be pushing your luck) and all 16 iterations of the "j" loop (don't depend on each other, so it can do them all at once). That would be eight cycles of work, plus a bit before/after the loop.

 

witxilinx
Observer
Observer
1,308 Views
Registered: ‎12-26-2018

Oops, i simplified code a little, I'll fix it. It could have been "input" instead of "i". I'll get back soon.

Edit1: I just look up the DSP and I think I misunderstood it for a long time. It can indeed do a MAC operation in 1 cycle. Cool! Im really new to this.
0 Kudos