cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
naz_rb
Adventurer
Adventurer
661 Views
Registered: ‎11-10-2019

Unrolling the loop manually and function latency constraints

Jump to solution

I have a TOP-level function of the following structure:

struct TYPE1	{uint8 ch[16];};
struct TYPE2 {uint8 ch[100];}; void FUNCT(hls::stream<TYPE1> &inStream, hls::stream<TYPE2> &outStream){ #pragma HLS INTERFACE axis port=inStream #pragma HLS INTERFACE axis port=outStream #pragma HLS DATA_PACK variable=outStream struct_level #pragma HLS DATA_PACK variable=inStream struct_level TYPE1 inpx; #pragma HLS ARRAY_PARTITION variable=inpx.ch complete dim=1 TYPE2 outpx; #pragma HLS ARRAY_PARTITION variable=outpx.ch complete dim=1 inpx = inStream.read(); L0: for(i<100){ L1: for(cha<16){ acc[i] += inpx.ch[cha] * y; } // do more stuff outpx.ch[i] = x; write temp variable } outStream.write(outpx); }

This top-level function receives a stream of pixels and should process one pixel at a time (per call); the pixel rate is 528 clock cycles, so the function has 528 clock cycles to process it. Thus, I would like to place a constraint on the function to have latency no more than 528 clock cycles. At the same time, I would like the function to use as least resources as possible. Since my loop L0 is 100 iterations, I know that each iteration needs to finish withing ~5 clock cycles, if executed sequenctially. Thus, I do not need to unroll L0 loop. With these requirements, I put the following constraints:

#pragma HLS LATENCY min=500 max=528      // directive for FUNCT
#pragma HLS UNROLL factor=1              // directive for L0 loop

 

However, the synthesised design results in function latency over 3000 cycles and the log shows the following message:

WARNING: [SCHED 204-71] Latency directive discarded for region FUNCT since it contains subloops.

Q1:  What is the work around to place the latency constraint on the function while preserving the loops?

Additional information:

  1. I can explicitly completly unroll the L1 loop, in fact, this is what's necessary in order for the L0 loop to meet 5 clocks per iteration.
  2. I can manually unroll both loops and write out operation after operation sequentially:
{ // manually unrolled L0 and L1
acc[0] = 0; acc[0] += inpx.ch[0] * x; acc[0] += inpx.ch[1] *  y;  acc[0] += inpx.ch[2] *  z; ........ acc[0] += inpx.ch[16] *  zz;     do more operations on acc[0] 
acc[1] = 0; acc[1] += inpx.ch[0] * x; acc[1] += inpx.ch[1] *  y;  acc[1] += inpx.ch[2] *  z; ........ acc[1] += inpx.ch[16] *  zz;     do more operations on acc[1]
.........
acc[99] = 0; acc[99] += inpx.ch[0] * x; acc[99] += inpx.ch[1] *  y; acc[99] += inpx.ch[2] *  z; ........ acc[99] += inpx.ch[16] *  zz;     do more operations on acc[99] 
}

Q2: does HLS have limitation on how long (how many) operations can be written on a single line? Will it have a problem parsing/compiling if my loops are of say 1000s of iterations?

Q3: would it make any difference in the synthesis if I omitted the intermediate variable outpx and wrote the result directly to outStream (like below)?

Instead of:
         // do more stuff
         outpx.ch[i] = x; write to local variable
       }

outStream.write(outpx); write local to stream
}

Do this:

         // do more stuff
         outStream.ch[i] = x;  write directly to the stream
       }

}

 Thank you in advance.

0 Kudos
1 Solution

Accepted Solutions
u4223374
Advisor
Advisor
576 Views
Registered: ‎04-26-2015

Putting a top-level latency constraint is probably not a good option; HLS won't be able to plan such a large number of operations properly. It's more for when you want to tell HLS "this multiply must take exactly one cycle" or "this floating-point addition can take three cycles".

It'll work better if you figure out what HLS needs to do, and tell it that - humans are much better at this than computers are. In this case, I think that if you unroll the inner loop with factor = 4 and pipeline the rest (so it completes four iterations of the inner loop per cycle) then the total time will be about 400 cycles (plus a few for pipeline startup) - which will do the job nicely.

 

I think that answers Q1. For Q2, I've got no idea - but I doubt there is a limit. For Q3, your proposed approach won't work; as soon as you write to any part of an output stream element, HLS sends that element. Your current approach (loading a whole element and then writing it in one operation) is correct.

View solution in original post

1 Reply
u4223374
Advisor
Advisor
577 Views
Registered: ‎04-26-2015

Putting a top-level latency constraint is probably not a good option; HLS won't be able to plan such a large number of operations properly. It's more for when you want to tell HLS "this multiply must take exactly one cycle" or "this floating-point addition can take three cycles".

It'll work better if you figure out what HLS needs to do, and tell it that - humans are much better at this than computers are. In this case, I think that if you unroll the inner loop with factor = 4 and pipeline the rest (so it completes four iterations of the inner loop per cycle) then the total time will be about 400 cycles (plus a few for pipeline startup) - which will do the job nicely.

 

I think that answers Q1. For Q2, I've got no idea - but I doubt there is a limit. For Q3, your proposed approach won't work; as soon as you write to any part of an output stream element, HLS sends that element. Your current approach (loading a whole element and then writing it in one operation) is correct.

View solution in original post