cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
neeruajith
Visitor
Visitor
1,203 Views
Registered: ‎06-12-2018

Code replication to increase resource utilisation and reduce latency

I have a nested for-loop. I want to reduce the execution latency of program by replicating this for-loop (first) for different intervals i.e if value of K_NUMBER is 96 then i want to replicate the for-loop for intervals like 1-48 and 48-96 or 1-32, 33-64 and 65-96. My final aim is to increase the resource utilization to reduce execution latency.

 

But when I replicated this for-loop, the latency is increasing and resource utilization is also not increasing significantly. Can anyone please help me to achieve this aim?

 

 

 

first:for(int OD=0;OD<K_NUMBER;OD++){
	  row:for(int OR=0;OR<OUT_DIM;OR++){
	   col:for(int OC=0;OC<OUT_DIM;OC++){
		 countValid=0;depth=0;countDepth=0;
//#pragma HLS UNROLL
		inst:for(int k=0;k<no_inst;k++){
#pragma HLS UNROLL
		  pe:for(int n=0;n<N ;n++){
#pragma HLS UNROLL
			 kr = ((k*N+n)/K_DIM)%K_DIM;
			 kc = (k*N+n)%K_DIM;
			 tmpK = countValid<K_DIM*K_DIM*IMG_DEPTH ? kernel[OD][kr][kc+K_DIM*depth] : zero;
			 tmpI = countValid<K_DIM*K_DIM*IMG_DEPTH ? in_fm[STR*OR+kr][STR*OC+depth*IMG_DIM+kc] : zero;
			 tmp_out[OD][OR][OC] += tmpK*tmpI;
			 countValid++;countDepth++;
			 if(countDepth>=K_DIM*K_DIM){depth++;countDepth=0;}
		 }
		}
	   }
	  }
	 }
0 Kudos
4 Replies
u4223374
Advisor
Advisor
1,166 Views
Registered: ‎04-26-2015

Are your arrays partitioned? That's the usual cause of this problem - HLS ends up building a huge state machine to control access to the RAM.

0 Kudos
neeruajith
Visitor
Visitor
1,155 Views
Registered: ‎06-12-2018

Yes, my kernel and in_fm arrays are partitioned. But I am not able to understand how it can inhibit the lowering of latency.

0 Kudos
u4223374
Advisor
Advisor
1,150 Views
Registered: ‎04-26-2015

It can go both ways.

 

If the arrays aren't partitioned then when HLS unrolls a loop, it has to make a huge state machine for RAM access.

 

If the arrays are partitioned too much (or with a factor unsuitable for the loop unrolling) then HLS has to build an enormous multiplexer to access the arrays/registers, and then add additional clock cycles to handle the propagation delay.

 

What are the partitioning factors for the arrays, and what is the value of OUT_DIM?

0 Kudos
neeruajith
Visitor
Visitor
1,142 Views
Registered: ‎06-12-2018

My array definitions are:

data_t in_fm[IMG_DEPTH][IMG_DIM][IMG_DIM],
data_t kernel[K_NUMBER][IMG_DEPTH][K_DIM][K_DIM],
data_t out_fm[K_NUMBER][OUT_DIM][OUT_DIM]

 

with values:

#define STR 1
#define K_NUMBER 192
#define K_DIM 3
#define IMG_DEPTH 192
#define IMG_DIM 13
#define OUT_DIM (IMG_DIM-K_DIM)/STR+1
#define N 100

 

array partitioning pragma:

#pragma HLS ARRAY_PARTITION variable=kernel cyclic factor=3 dim=3
#pragma HLS ARRAY_PARTITION variable=in_fm cyclic factor=2 dim=2

 

 

As you can see I am not partitioning arrays too much. I want to parallelize the execution of the outermost loop by splitting the loop iteration interval. But all I get is increased latency.

0 Kudos