12-08-2019 12:53 PM - edited 12-08-2019 12:54 PM
Hi, dear HLS experts,
I have a hard time interpreting the following HLS results. I created a simple vector-matrix multiplication HLS code with an HLS pipeline pragma in the outer-loop as shown below. Also, I tried two different data types, i.e. one is float and the other one is ap_fixed<12,2>. Firstly, I expect the ap_fixed 12-bit version should have less resource area usage than the float version which is 32-bits.
However, the result is totally different. The float version only uses 10 DSPs under the same pipeline constraint. In addition, the latency is comparable to ap_fixed<12,2> design.
For the pipeline constraint in the outer-loop, it should unroll the inner-loop for trip count 20. But I don't understand why the float version can only consume 10 DSPs.
Also, the floating-point multiplier should have many more cycles than a fixed-point multiplier. But the latency is almost the same i.e. 8000 cycles around.
I have so many questions in interpreting the results and need your guidance and discussion together. The VHLS version is 2017.02.
12-09-2019 03:51 AM
The performance is limited by the weights input array and the output array. Since these would be synthesized as memory interface, even though you pipeline, we can’t get the perfomance.
You can partition the weights array and the output and then compare
You can apply these pragma’s
#pragma HLS ARRAY_PARTITION variable=intBuf complete dim=1
#pragma HLS ARRAY_PARTITION variable=weights_in complete dim=2
Below is what I have tried with 2019.2
solution5 is the one with floating point and solution6 is with fixed point.
You can clearly see the diff on how fixed point is better in performance and Area
12-09-2019 04:04 AM
@nithink Thank you for your experiment and comment.
I'm curious if you got the same result for my original code and constraint in 2019.2 which float version only consumes 10 DSPs but fixed-point version uses 20 DSPs. I don't know how to explain it.
For your experiment results with memory partition, for the float version, how do we explain it consumes 35 DSPs? For the fixed-point version, I can explain that the pipeline will unroll the inner-loop which loop trip count is 20. Thus, we can see HLS uses 20 DSPs for the pipeline fixed-point version. But how would we explain the 35 DSPs in the float version you did?
12-09-2019 06:50 AM - edited 12-09-2019 06:53 AM
I follow you to use 2019.2 which I just downloaded and installed on Ubuntu 16.04.
In order to compare with your result, I run the following cases :
solution 1 : float version with only pipeline w/o the two array partition pragmas
solution 2 : fixed-point version with only pipeline w/o the two array partition pragmas
solution 3 : float version with both pipeline and the two array partition pragmas
solution 4 : fixed-point version with both pipeline and the two array partition pragmas
However, somehow the results I got are different against yours but the trend is the same.
Do you know why we use the same code, constraints and VHLS version but get different results?
Also, how could we explain the DSP usages? Before array partitions, float version has less DSPs than fixed-point version. However, after adding the array partition, the fixed-point version has the same number of DSPs but the float version increases to 25 DSPs from 10 DSPs. Why the array partition will impact the inference of DSPs ? Why the solution1 with the pipeline only has 10 DSPs which I assume it should also use 20 DSPs??
12-10-2019 03:26 AM
The device that i used is different. with the device that you mentioned, i am able to see the same result
In floating point version without array partition pragma's HLS is trying inferring 2 floating point multipliers only. Each floating point multiplier takes 4 cycles to complete and the floating point adder takes 5 cycles.
HLS is pipelining the 2 multipliers across the iterations. For 2 multipliers it takes 10 DSPs.
Let’s look at the schedule view of how HLS has tried to implement the floating point version when arrays are partitioned
Each multiplication takes around 4 cycles to complete and the addition takes 5 cycles. what HLS has done is inferred 5 multipliers. These multipliers are now pipelined and are being reused.
Each floating point multiplier takes around 5 DSPs, which adds upto 25 DSPs.
Coming to the fixed point version, each multiplier completes the operation in a single cycle.
HLS inferred 20 multipliers in this case and does the computation every cycle
Hope this clears it. Basically what you need to see is the trade off between throughput and area.
12-10-2019 06:51 AM
Thank you for your patience in explaining to me. I got it.
However, there is still one thing that puzzles me.
From the use of HLS pipeline pragma, as far as I know, it should unroll the inner-loop which trip count is 20 in the case. However, 2 float multipliers and adders are inferred in solution 1 which parallelism is only 2. 5 float multipliers and adders are inferred in solution 3 which parallelism is increased to 5. But in solutions 2 and 4, the parallelism is 20 which is the same and respect to the behavior to unroll the inner-loop fully.
My question is why float design with HLS pipeline pragma doesn't use the parallelism 20 which is the trip count of in the inner-loop. It seems it's out of expectation from the behavior of the HLS pipeline against used in the fixed-point case. Since the unexpected behavior in the float case, it becomes a bit stochastic in harnessing these pragmas in float design. I think this is an outstanding problem I got from these cases.