08-15-2016 04:44 PM
Hi I have a question. In Sdsoc and his tutorial guide, loop pipeline transfo ms innermost loops being unrolled.
So, I did a manual unrolling loop. This is my code.
In th place, Under the feature loop and above the depth loop, pipeline is here.
and I manually unrolled row_f, col_f, depth loop. Speed of this function is up but, the number of dsp is down.
for example the number of dsp(non manual unrolling) is 31, but the number of dsp(fully unrolling) is only 6 and speed is 2x. And utilization of lut and gram is down. Um.. I don't know this result. I think that utilization of dsp will be increase, but result is reverse. Is there any reason such as scheduling optimization or resource resuming method in his or Sdsoc ?
08-15-2016 05:06 PM
Unroll a loop simply means remove the loop. For example:
for(int i=0; i<3; i++) sum += a[i];
After unroll, it is:
sum += a; sum += a; sum += a;
In VHLS, if you put a #pragma HLS pipeline, all loops at the same level of this pragma and below will be unrolled.
I don't see the unroll in your code. Please provide the code before the unroll, and the code after the unroll. Also, loop unrolling does not necessarily mean performance improvement (which usually comes with increased resource usage), it is ultimately limited by your data/memory dependency in your code and also the resource in the PL.
08-16-2016 04:28 AM - edited 08-16-2016 04:29 AM
As @wsun has explained, loop unrolling tends to be limited by dependencies (if each loop iteration depends on the previous ones) or RAM ports.
In your case, you've got fewer DSP slices because they're no longer being used for addressing. Ordinarily, to index into an N-dimensional array, HLS has to use N-1 DSP slices (unless some dimensions are powers of two, in which case it can use shifts). Unrolled, all the indexes are hard-coded into a state machine. The downside is that the resulting state machine can be extremely large.
In the "ideal" unrolling case, where the input and output are both register arrays (so you can read/write every element at once), the state machine is trivial and everything happens simultaneously. The downside is that if you're processing 1000 elements then you need 1000 DSP slices (or whatever other resources are used).
The opposite end is when input or output is a single-port RAM, so you can only read/write a single element at a time. In this case you tend to end up with a giant state machine (eg. 1000 states to process 1000 elements) which steps through the hard-coded addresses one at a time. In this case the resources needed for processing are small, but the state machine can be quite substantial.
In this case, you probably need to partition the input and output arrays to allow HLS to handle lots of elements at once.