cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Observer
Observer
733 Views
Registered: ‎12-03-2019

Inferring Floating-point Fused Multiply and Add in HLS

I spent the bulk of the past few days scouring through Xilinx's documentation and forum and trying multiple different coding patterns but could not get the HLS compiler to infer something so simple as an FP32 fused multiply and add (FMA). Is there a specific coding style that needs to be followed to get this? e.g., how can the following basic example be modfiied to infer an FMA which uses 2 DSPs on Zynq Ultrascale+ (2 DSPs based on numbers mentioned at https://www.xilinx.com/support/documentation/ip_documentation/ru/floating-point.html and also confirmed by instantiating the floating-point IP from the IP catalogue), rather than 5 DSPs (3 for mul and 2 for add) because the add and multiply are not being fused?

float test(float a, float b, float c)
{
	float d = 0;
	d = a * b + c;
	return d;
}

I also tried the built-in fma function as follows, but the DSP usage still stayed at 5.

float test(float a, float b, float c)
{
	float d = 0;
	d = hls::fma(a, b, c);
	return d;
}

I also placed and routed the above code as an IP, hoping that maybe the mapper would be smart enough to fuse the operations, but it didn't. I also checked the RESOURCE directive in the documentation, but there is no option to override resource usage for FMA; just add, mul and other operations. Also already tried the work-around mentioned here, but even though it seemed to work for integer, it didn't work for float.

 

I am using Vivado HLS v2019.2 and my target is the ZCU102 board.

0 Kudos
3 Replies
Highlighted
Teacher
Teacher
636 Views
Registered: ‎03-31-2012

Have you compared LUTs & DFFs also ? The xczu9 single precision FMA has this report for LUT/DFF/DSP usage:

704 1076 2

 

Maybe it's possible to trade off LUTs/DFFs with DSPs ? 

Another option is to limit the use of DSP resources for this module and see if it helps.

- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Highlighted
Observer
Observer
556 Views
Registered: ‎12-03-2019

@muzafferActually I did that recently by generating the IPs in Vivado 2019.2. For a fixed total number of DSPs (e.g. 2), the FMA core uses more logic and FFs compared to the combination of FMUL and FADD (e.g. using the 2-DSP variation of FMUL and 0-DSP variation of FADD) which is very counter intuitive. Hence, even if there was a possibility to infer FMA using HLS, there would be no point in doing so from an area utilization point of view and now I am wondering what is the point of having the FMA core in the first place since the peak operating frequency of the cores is also hardly different. The 4-DSP variation of the FMA core uses even more logic and FF than the 2-DSP version which is even more counter intuitive.

xilinx.jpg

P.S. For future reference, I am pretty positive Xilinx HLS, at this point of in time, is incapable of inferring FMA blocks, not that there is any point in doing so considering the above results.

Highlighted
Adventurer
Adventurer
317 Views
Registered: ‎05-29-2018

I'm working with Vitis, targeting an Alveo board, and I'm facing the same "problem", while wondering the same questions... In my C+HLS kernel I have a sequence of floating-point multiply and accumulation, which should perfectly map to a sequence of FMA cores, but they get mapped to a sequence of Mul and Add.

Is there any comment from Xilinx on this point?

0 Kudos