cancel
Showing results for
Show  only  | Search instead for
Did you mean:
Explorer
6,454 Views
Registered: ‎03-26-2010

## FDiv floating point divide produces garbage results in HLS

Hi all,

Ran into an issue with doing single precision floating point divide in HLS 2016.x (tried .2 and .4 both). I used pragmas to make the division be an FDiv core with 1 cycle of latency and 1 cycle of iteration, and compared the synthesized result with that of the normal Vivado floating point IP core set for the same performance.

Looking at the Analysis view in HLS I indeed see that the computation in HLS is performed in 1 clock cycle, but the timing summary is an order of magnitude off from the requirement. The synthesized result of the HLS logic shows why - it's a looooong chain of muxes and shifters - 267 levels of logic!!!! The Vivado IP is 3 levels or less... No wonder the HLS result couldn't meet timing.

Is there a way to force HLS to use the right IP core to perform the division, or is this really the performance to be expected? An example can be attached, but it's really just basic division.

Tags (5)
1 Solution

Accepted Solutions
Teacher
11,014 Views
Registered: ‎03-31-2012

@dima2882 you are basing your conclusions on inadequately constrained design. Add a set of input & output registers such that there is no IO connected to the FP IP directly and implement your design again.

Give Kudos to a post which you think is helpful and reply oriented.
13 Replies
Teacher
6,442 Views
Registered: ‎03-31-2012

@dima2882

>> FDiv core with 1 cycle of latency and 1 cycle of iteration

Did you really simulate the Vivado FP IP core and verified its performance for these numbers ? I suspect you did not. What you probably got is an iterative divider which retires 1-2 bits per cycle for some number of cycles. Dividers are complicated and I'd love to see the magical Xilinx IP which can do one single precision floating point divider in one cycle with 3 levels of logic. What you got from HLS is closer to the truth although number of levels seems a little bit excessive.

Give Kudos to a post which you think is helpful and reply oriented.
Explorer
6,411 Views
Registered: ‎03-26-2010

@muzaffer wrote:
Did you really simulate the Vivado FP IP core and verified its performance for these numbers ? I suspect you did not. What you probably got is an iterative divider which retires 1-2 bits per cycle for some number of cycles. Dividers are complicated and I'd love to see the magical Xilinx IP which can do one single precision floating point divider in one cycle with 3 levels of logic. What you got from HLS is closer to the truth although number of levels seems a little bit excessive.

Fair enough - wrote a test bench, did a simulation. Divided 7034.5429 by 589.2358. It took one clock cycle to execute. Happy to provide simulation. I used the excellent float to binary converter to create IEEE754 32-bit test vectors from decimal floats available here: http://www.binaryconvert.com/result_float.html

Played with other numbers, also got single cycle execution times.

Contention that HLS is producing garbage results still stands.

Teacher
6,398 Views
Registered: ‎03-31-2012

@dima2882 what about implementation? what do your area & timing numbers look like?

Give Kudos to a post which you think is helpful and reply oriented.
Explorer
6,387 Views
Registered: ‎03-26-2010

Just did an implementation. Keep in mind that the FP divider was the only thing in the design and on the chip...

The clock was set to be 200MHz, utilization was tiny. 748 LUTs, 69 FFs. Not too much different than the utilization by HLS, but the topology is of course very different.

Things seem to have worked quite well with the FP core, not so with the HLS...

Teacher
6,365 Views
Registered: ‎03-31-2012

@dima2882 can you show a timing report which has the fp_div in it ? I did an implementation with my brand spanking new 2016.4 and my implementation area & timing reports don't show the fp block at all but synthesis timing report fails with -52ns and 218 levels of logic. I am curious if this is specific to 2016.4. I'll try with 2015.4 too.

Give Kudos to a post which you think is helpful and reply oriented.
Explorer
6,353 Views
Registered: ‎03-26-2010

I'm using 2016.4  as well. The floating point core is definitely in there. Timing report is attached...

Teacher
6,326 Views
Registered: ‎03-31-2012

@dima2882 then you will have to do a timing back-annotated gate level simulation. I still maintain that it's not possible to have a single cycle fp32 divider at 5ns and 3 levels of logic in an FPGA.

Give Kudos to a post which you think is helpful and reply oriented.
Explorer
6,318 Views
Registered: ‎03-26-2010

In the timing report, what I'm seeing is that the TDATA ports are properly constrained by the 5ns constraint, but the TVALIDs are not covered by this and are unconstrained... That is where the long carry chains are located.

My project is attached - the sim is in there, shows all the single cycle goodness...

Teacher
11,015 Views
Registered: ‎03-31-2012

@dima2882 you are basing your conclusions on inadequately constrained design. Add a set of input & output registers such that there is no IO connected to the FP IP directly and implement your design again.

Give Kudos to a post which you think is helpful and reply oriented.
Scholar
4,930 Views
Registered: ‎01-28-2014

@muzaffer is certainly right, there is simply no way you have a floating point core that small with a latency of one cycle at 200 MHz. Look at the datasheet for the core generator to get a feel for the various trade offs you can make.

Teacher
4,923 Views
Registered: ‎03-31-2012
@jprice actually around 800 luts is the proper size for a 32 bit fdiv. The problem is timing. With proper constraints a single cycle divider should be around 50 ns in most recent xilinx chips.
Give Kudos to a post which you think is helpful and reply oriented.
Explorer
4,917 Views
Registered: ‎03-26-2010

Finally we're getting somewhere...

Putting FFs around the Vivado IP core I/O does indeed make it fail timing as @muzaffer predicted. Brilliant how the core will let you build a physically un-realizable system, although I suppose that's my fault for specifying something it couldn't do... I changed it to 20 cycles of latency and 1 iteration cycle, that's what made it pass timing. Going to try the same thing with the HLS variant, will see what happens...

Scholar
4,894 Views
Registered: ‎01-28-2014

I objected more to the complete lack of flip flops and 1 cycle of latency :). 800 LUTs seems reasonable depending on how they're configured.