UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Explorer
Explorer
502 Views
Registered: ‎05-23-2017

Why the pragma hls unroll always unroll completely

void dist_calc_or1(hls::stream<D_point_or> &feature_or_in, D_point_or *query_or,Dtype_uint data_size, hls::stream<D_dist1> &dist_temp_buffer){
     Dtype_l dist_temp_buffer1[D_OR];
    #pragma HLS ARRAY_PARTITION variable=dist_temp_buffer1 cyclic factor=16 dim=1 
    D_point_or feature_or;
    #pragma HLS ARRAY_PARTITION variable=feature_or.x cyclic factor=480 dim=1 
    D_dist1 dist_temp;
    #pragma HLS ARRAY_PARTITION variable=dist_temp.x cyclic factor=16 dim=1 

dist_calc_or0: for(Dtype_uint i=0; i < data_size; i++){
    #pragma HLS LOOP_TRIPCOUNT min=10 max=10 
    #pragma HLS PIPELINE
    feature_or = feature_or_in.read();

    dist_calc_or1:for(int j=0; j<D_OR;j++){
         #pragma HLS PIPELINE
         #pragma HLS unroll factor = 16
            dist_temp_buffer1[j]=feature_or.x[j]-query_or->x[j];
            dist_temp.x[j]= dist_temp_buffer1[j]*dist_temp_buffer1[j];
    }

    dist_temp_buffer << dist_temp;
    }
}

I want to partition the for loop into 16 blocks.

From the vivado_hls.log file I can see the "dist_calc_or1" is completely unrolled.

INFO: [XFORM 203-501] Unrolling loop 'dist_calc_or1' (f_fpga.cpp:149) in function 'dist_calc_or1' completely.

How can I just unroll the for loop with factor 16.

 

Thanks.

0 Kudos
8 Replies
Mentor xilinxacct
Mentor
482 Views
Registered: ‎10-23-2018

Re: Why the pragma hls unroll always unroll completely

@mathmaxsean because the parent loop is pipelined, the nested loops unroll pragma is ignored Hope that helps If so, please mark as solution accepted. Kudos also welcomed. :-)
Scholar u4223374
Scholar
462 Views
Registered: ‎04-26-2015

Re: Why the pragma hls unroll always unroll completely

As @xilinxacct has stated, pipelining a loop unrolls all sub-loop (and inlines all sub-functions). In this case that directive is going to have a higher priority than your partial unroll directive.

Explorer
Explorer
441 Views
Registered: ‎05-23-2017

Re: Why the pragma hls unroll always unroll completely

@xilinxacct 

Thanks very much for quick reply.

Yes you are right!

I removed the parent loop and the unroll factor pragma works.

But another issue raises, the compilation for the hardware emulation takes a very long time.

And when I checked the vivado_hls.log file I can find the compiler is processing the the loop (dist_cacl_or1) with the "unroll factor".

Is there a waty that can accelerate that?

 

Thanks.

0 Kudos
Explorer
Explorer
439 Views
Registered: ‎05-23-2017

Re: Why the pragma hls unroll always unroll completely

@u4223374

Thanks.

Yes that's the answer.

But I found after I removed the parent loop pipeline pragma, the unroll factor did work but the hardware emulaiton compilation takes very long.

Does there a way we can make it faster?

0 Kudos
Explorer
Explorer
403 Views
Registered: ‎05-23-2017

Re: Why the pragma hls unroll always unroll completely

void dist_calc_or1(hls::stream<D_point_or> &feature_or_in, D_point_or *query_or,Dtype_uint data_size, hls::stream<D_dist1> &dist_temp_buffer){
     Dtype_l dist_temp_buffer1[D_OR];
    #pragma HLS ARRAY_PARTITION variable=dist_temp_buffer1 cyclic factor=480 dim=1 
    D_point_or feature_or;
    #pragma HLS ARRAY_PARTITION variable=feature_or.x cyclic factor=480 dim=1 
    D_dist1 dist_temp;
    #pragma HLS ARRAY_PARTITION variable=dist_temp.x cyclic factor=480 dim=1 

dist_calc_or0: for(Dtype_uint i=0; i < data_size; i++){
    #pragma HLS LOOP_TRIPCOUNT min=100 max=100 
//    #pragma HLS PIPELINE
    feature_or = feature_or_in.read();

    dist_calc_or1:for(int j=0; j<D_OR;j++){
  //     #pragma HLS PIPELINE
         #pragma HLS unroll factor = 16
            dist_temp_buffer1[j]=feature_or.x[j]-query_or->x[j];
            dist_temp.x[j]= dist_temp_buffer1[j]*dist_temp_buffer1[j];
    }

    dist_temp_buffer << dist_temp;
    }
}

Here is the code with the pipeline removed.

0 Kudos
Explorer
Explorer
365 Views
Registered: ‎05-23-2017

Re: Why the pragma hls unroll always unroll completely

A stall happens for a whole night when "implementing module dist_calc_or1"

 

NFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'dist_calc_or1' 
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'dist_calc_or0.1'.
INFO: [SCHED 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 1.
INFO: [SCHED 204-61] Pipelining loop 'dist_calc_or1'.
INFO: [SCHED 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 5.
INFO: [SCHED 204-61] Pipelining loop 'dist_calc_or0.3'.
INFO: [SCHED 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 2.

Any hint is very welcomed!

 

0 Kudos
Scholar u4223374
Scholar
327 Views
Registered: ‎04-26-2015

Re: Why the pragma hls unroll always unroll completely

@mathmaxsean Looking at your code, you've got a three arrays that are each being split into 480 separate arrays (ie a total of 1440 arrays).

 

This in itself is not great, because if HLS wants to see where (for example) element 2875 goes into, it has to compute 2875 / 480 (it computes 2875 % 480 to select the array and 2875 / 480 for the index; but the divide result can be used for both operations). This division is going to be slow, both to run and to build.

 

However, you're then accessing 48 of those 1440 arrays at any time. This is going to create a massive array of multiplexers, and I wouldn't be too surprised if it does actually take days to build the design.

 

 

Explorer
Explorer
310 Views
Registered: ‎05-23-2017

Re: Why the pragma hls unroll always unroll completely

@u4223374

Thanks very much for your exaplanation.

If so, I think the issue can be resolve by partitionnig the three array by cyclic factor 16 instead of 480.

Am I right?

0 Kudos