UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Explorer
Explorer
292 Views
Registered: ‎08-31-2017

Same code/constraint BUT float design consumes only half DSPs than ap_fixed 12-bit design

Hi, dear HLS experts,

I have a hard time interpreting the following HLS results. I created a simple vector-matrix multiplication HLS code with an HLS pipeline pragma in the outer-loop as shown below. Also, I tried two different data types, i.e. one is float and the other one is ap_fixed<12,2>. Firstly, I expect the ap_fixed 12-bit version should have less resource area usage than the float version which is 32-bits.

However, the result is totally different. The float version only uses 10 DSPs under the same pipeline constraint. In addition, the latency is comparable to ap_fixed<12,2> design.

For the pipeline constraint in the outer-loop, it should unroll the inner-loop for trip count 20. But I don't understand why the float version can only consume 10 DSPs.

Also, the floating-point multiplier should have many more cycles than a fixed-point multiplier. But the latency is almost the same i.e. 8000 cycles around.

I have so many questions in interpreting the results and need your guidance and discussion together. The VHLS version is 2017.02.

#include "ap_fixed.h"

//typedef ap_fixed<12,2> data_in_T;
//typedef ap_fixed<12,2> weight_T;
//typedef ap_fixed<12,2> data_out_T;


typedef float data_in_T;
typedef float weight_T;
typedef float data_out_T;


void mvm(
  data_in_T in_data[800],
  data_out_T intBuf[20],
  weight_T weights_in[801][20]
)
{
 
data_in_T in_data_cache;
data_out_T acc[20];
data_out_T acc_tmp;
weight_T weights;

out_layer_in: for(int ii = 0; ii < 800; ++ii) {
#pragma HLS PIPELINE
  inn_layer_in: for(int jj = 0; jj < 20; ++jj) {
     in_data_cache = in_data[ii];  
     weights = weights_in[ii][jj]; 
     acc_tmp = acc[jj];
     acc[jj] = acc_tmp + in_data_cache * weights;
     if ( ii ==19 ) {
     intBuf[jj] = acc[jj];
}
}
}
}

 

Selection_204.pngap_fixed<12,2> versionSelection_205.pngfloat version

0 Kudos
5 Replies
Xilinx Employee
Xilinx Employee
238 Views
Registered: ‎09-04-2017

Re: Same code/constraint BUT float design consumes only half DSPs than ap_fixed 12-bit design

The performance is limited by the weights input array and the output array.  Since these would be synthesized as memory interface, even though you pipeline, we can’t get the perfomance.

 

You can partition the weights array and the output and then compare

 

You can apply these pragma’s

 

#pragma HLS ARRAY_PARTITION variable=intBuf complete dim=1

#pragma HLS ARRAY_PARTITION variable=weights_in complete dim=2

 

Below is what I have tried with 2019.2

solution5 is the one with floating point and solution6 is with fixed point.

 

You can clearly see the diff on how fixed point is better in performance and Area

 

latency_resource.JPG

 

0 Kudos
Explorer
Explorer
236 Views
Registered: ‎08-31-2017

Re: Same code/constraint BUT float design consumes only half DSPs than ap_fixed 12-bit design

@nithink  Thank you for your experiment and comment.

I'm curious if you got the same result for my original code and constraint in 2019.2 which float version only consumes 10 DSPs but fixed-point version uses 20 DSPs. I don't know how to explain it.

 

For your experiment results with memory partition, for the float version, how do we explain it consumes 35 DSPs? For the fixed-point version, I can explain that the pipeline will unroll the inner-loop which loop trip count is 20. Thus, we can see HLS uses 20 DSPs for the pipeline fixed-point version. But how would we explain the 35 DSPs in the float version you did?

0 Kudos
Explorer
Explorer
206 Views
Registered: ‎08-31-2017

Re: Same code/constraint BUT float design consumes only half DSPs than ap_fixed 12-bit design

@nithink 

I follow you to use 2019.2 which I just downloaded and installed on Ubuntu 16.04.

In order to compare with your result, I run the following cases :

solution 1 : float version with only pipeline w/o the two array partition pragmas

solution 2 : fixed-point version with only pipeline w/o the two array partition pragmas

solution 3 : float version with both pipeline and the two array partition pragmas

solution 4 : fixed-point version with both pipeline and the two array partition pragmas

However, somehow the results I got are different against yours but the trend is the same.

Do you know why we use the same code, constraints and VHLS version but get different results?

Also, how could we explain the DSP usages? Before array partitions, float version has less DSPs than fixed-point version.  However, after adding the array partition, the fixed-point version has the same number of DSPs but the float version increases to 25 DSPs from 10 DSPs. Why the array partition will impact the inference of DSPs ? Why the solution1 with the pipeline only has 10 DSPs which I assume it should also use 20 DSPs??

 

Selection_207.png

 

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
155 Views
Registered: ‎09-04-2017

Re: Same code/constraint BUT float design consumes only half DSPs than ap_fixed 12-bit design

@nanson 

The device that i used is different. with the device that you mentioned, i am able to see the same result

In floating point version without array partition pragma's HLS is trying inferring 2 floating point multipliers only. Each floating point multiplier takes 4 cycles to complete and the floating point adder takes 5 cycles. 

HLS is pipelining the 2 multipliers across the iterations. For 2 multipliers it takes 10 DSPs. 

pic1.png

 

Let’s look at the schedule view of how HLS has tried to implement the floating point version when arrays are partitioned

pic2.png

Each multiplication takes around 4 cycles to complete and the addition takes 5 cycles.  what HLS has done is inferred 5 multipliers. These multipliers are now pipelined and are being reused.

Each floating point multiplier takes around 5 DSPs, which adds upto 25 DSPs.

Coming to the fixed point version, each multiplier completes the operation in a single cycle.

pic3.png

 

HLS inferred 20 multipliers in this case and does the computation every cycle

Hope this clears it. Basically what you need to see is the trade off between throughput and area. 

 

Explorer
Explorer
140 Views
Registered: ‎08-31-2017

Re: Same code/constraint BUT float design consumes only half DSPs than ap_fixed 12-bit design

@nithink 

Thank you for your patience in explaining to me. I got it.

However, there is still one thing that puzzles me.

From the use of HLS pipeline pragma, as far as I know, it should unroll the inner-loop which trip count is 20 in the case. However, 2 float multipliers and adders are inferred in solution 1 which parallelism is only 2.  5 float multipliers and adders are inferred in solution 3 which parallelism is increased to 5. But in solutions 2 and 4, the parallelism is 20 which is the same and respect to the behavior to unroll the inner-loop fully.

My question is why float design with HLS pipeline pragma doesn't use the parallelism 20 which is the trip count of in the inner-loop. It seems it's out of expectation from the behavior of the HLS pipeline against used in the fixed-point case. Since the unexpected behavior in the float case, it becomes a bit stochastic in harnessing these pragmas in float design. I think this is an outstanding problem I got from these cases.

Thank you

0 Kudos