UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Explorer
Explorer
349 Views
Registered: ‎05-23-2017

estimated clock period (3.82435ns) exceeds the target

Jump to solution
typedef struct point_or //for originla feature vector
{
float x[D_OR];
} D_point_or;

void dist_calc_or0(D_point_or *feature_or,D_point_or *query_or,D_dist1 *dist_temp){ #pragma HLS INLINE off float dist_temp_buffer1[960]; #pragma HLS ARRAY_PARTITION variable=dist_temp_buffer1 cyclic factor=480 dim=1 Line 189: D_point_or query_or_local = *query_or; #pragma HLS ARRAY_PARTITION variable=query_or_local.x cyclic factor=480 dim=1 #pragma HLS ALLOCATION instances=fmul limit=32 operation #pragma HLS ALLOCATION instances=fsub limit=32 operation loopForReduceDSP:for(int j=0; j<960;j++){ #pragma HLS unroll Line 199: dist_temp_buffer1[j]=feature_or->x[j]-query_or_local.x[j]; dist_temp->x[j]= dist_temp_buffer1[j]*dist_temp_buffer1[j]; } }
INFO: [BIND 205-100] After resource sharing, estimated clock period (3.82435ns) exceeds the target (target clock period: 3.33333ns, clock uncertainty: 0.266667ns, effective delay budget: 3.06667ns).
INFO: [BIND 205-100] The critical path consists of the following:
    'load' operation ('query_or_local.x[0][0]', src/pcaf_fpga.cpp:189) on array 'query_or_x_0' (0.594 ns)
    'fsub' operation ('tmp_s', /src/pcaf_fpga.cpp:199) (3.23 ns)

During the hardware emulation I noticed this information in the vivado_hls.log file.

How can I modify the code to remove this timing issue?

 

Thanks.

 

0 Kudos
1 Solution

Accepted Solutions
Highlighted
Xilinx Employee
Xilinx Employee
314 Views
Registered: ‎01-09-2008

Re: estimated clock period (3.82435ns) exceeds the target

Jump to solution

Here is a copy of the comparison report.

Solution 1 is the original from your code.

Latency (clock cycles)

  solution1solution2solution3solution4

Latencymin433998050050
 max433998050050
Intervalmin433998050050
 max433998050050

 

Utilization Estimates

 solution1solution2solution3solution4

BRAM_18K2000
DSP48E160510160
FF50946832157223146
LUT5943654599613849
URAM0000
==================================
Olivier Trémois
XILINX EMEA DSP Specialist
4 Replies
Xilinx Employee
Xilinx Employee
315 Views
Registered: ‎01-09-2008

Re: estimated clock period (3.82435ns) exceeds the target

Jump to solution

When you PARTITION your array and UNROLL your loop you leave to HLS the task to schedule all the operations, leading to a big FSM to control everything. This will reduce the FMAX.

Line 189 is performing a copy of the input data (2 clock cycles per element). Is that really what you want, or is it just to be able to partition the array?

Here is a much simpler code, that uses the PIPELINE directive, instead of UNROLL:

void dist_calc_or0(D_point_or *feature_or,D_point_or *query_or,D_dist1 *dist_temp){
#pragma HLS INLINE off

	loopForReduceDSP:for(int j=0; j<960;j++){
#pragma HLS PIPELINE
		float tmp=feature_or->x[j]-query_or->x[j];
		dist_temp->x[j]= tmp*tmp;
	}
}

Latency is /4 (980 instead of 4339) , DSP usage is /32 , LUT/FF usage is /60+, Fmax is much higher. Furthermore VHLS is synthesizing this much faster.

If you want to use the 2 ports of the BRAMs:

void dist_calc_or0(D_point_or *feature_or,D_point_or *query_or,D_dist1 *dist_temp){
#pragma HLS INLINE off

	loopForReduceDSP:for(int j=0,ind=0; j<(960/2);j++){
#pragma HLS PIPELINE
		InnerLoop:for(int k=0; k<2;k++,ind++){
			float tmp=feature_or->x[ind]-query_or->x[ind];
			dist_temp->x[ind]= tmp*tmp;
		}
	}
}

Latency is /2 (500), at the expense of doubling other resources.

Now if your objective is to reduce even more the latency, you can do it only if your original array is already partitionned. For example for 32 parallel computations:

#define PRAGMA_SUB(x) _Pragma (#x)
#define PRAGMA_HLS(x) PRAGMA_SUB(x)

void dist_calc_or0(D_point_or *feature_or,D_point_or *query_or,D_dist1 *dist_temp){
#pragma HLS INLINE off

#define UNROLL_FACTOR 32
#define PARTITION_FACTOR 16

PRAGMA_HLS( HLS ARRAY_PARTITION variable=feature_or->x cyclic factor=PARTITION_FACTOR dim=1)
PRAGMA_HLS( HLS ARRAY_PARTITION variable=query_or->x cyclic factor=PARTITION_FACTOR dim=1)
PRAGMA_HLS( HLS ARRAY_PARTITION variable=dist_temp->x cyclic factor=PARTITION_FACTOR dim=1)

		loopForReduceDSP:for(int j=0,ind=0; j<(960/UNROLL_FACTOR);j++){
#pragma HLS PIPELINE
			InnerLoop:for(int k=0; k<UNROLL_FACTOR;k++,ind++){
				float tmp=feature_or->x[ind]-query_or->x[ind];
				dist_temp->x[ind]= tmp*tmp;
			}
		}
}

PRAGMA_HLS macro is defined to be able to use macro constants within directives.

latency is /10 (50), but resource is multiplied by 14 (LUT/FF) and 16 (DSP).

==================================
Olivier Trémois
XILINX EMEA DSP Specialist
0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
315 Views
Registered: ‎01-09-2008

Re: estimated clock period (3.82435ns) exceeds the target

Jump to solution

Here is a copy of the comparison report.

Solution 1 is the original from your code.

Latency (clock cycles)

  solution1solution2solution3solution4

Latencymin433998050050
 max433998050050
Intervalmin433998050050
 max433998050050

 

Utilization Estimates

 solution1solution2solution3solution4

BRAM_18K2000
DSP48E160510160
FF50946832157223146
LUT5943654599613849
URAM0000
==================================
Olivier Trémois
XILINX EMEA DSP Specialist
Explorer
Explorer
282 Views
Registered: ‎05-23-2017

Re: estimated clock period (3.82435ns) exceeds the target

Jump to solution

@oliviert

Thank you very much for your code.

It's very helpful.

I partitioned the query_or outside of this function.

I used your solution 3 and it gives a better result than mine.

 

Thanks.

 

0 Kudos
Explorer
Explorer
205 Views
Registered: ‎05-23-2017

Re: estimated clock period (3.82435ns) exceeds the target

Jump to solution

@oliviert

For the solution4 I can get the same result as yours.

But when I changed the data type from float to ap_fixed<32,8>.

 

  Module Name         Start Interval  Best Case  Avg Case  Worst Case
  ------------------  --------------  ---------  --------  ----------
    dist_calc_or0       30              32         32        32

Area Information
  Module Name         FF      LUT     DSP   BRAM
------------  -----------  ------------------  ------  ------  ----  ----
 dist_calc_or0       250592  89163   3840  0

 


D_point_or feature_or;
#pragma HLS ARRAY_PARTITION variable=feature_or.x cyclic factor=480 dim=1
D_dist1 dist_temp;
#pragma HLS ARRAY_PARTITION variable=dist_temp.x cyclic factor=480 dim=1
D_point_or query_or;
#pragma HLS ARRAY_PARTITION variable=query_or_oc.x cyclic factor=16 dim=1

void dist_calc_or0(D_point_or *feature_or,D_point_or *query_or,D_dist1 *dist_temp){ #pragma HLS INLINE off loopForReduceDSP:for(int j=0,ind=0; j<(960/32);j++){ #pragma HLS PIPELINE InnerLoop:for(int k=0; k<32;k++,ind++){ Dtype_l tmp=feature_or->x[ind]-query_or->x[ind]; dist_temp->x[ind]= tmp*tmp; } } } 

I did the Array partition outside of the function.

 

 

0 Kudos