cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Explorer
Explorer
6,132 Views
Registered: ‎03-26-2010

FDiv floating point divide produces garbage results in HLS

Jump to solution

Hi all,

 

Ran into an issue with doing single precision floating point divide in HLS 2016.x (tried .2 and .4 both). I used pragmas to make the division be an FDiv core with 1 cycle of latency and 1 cycle of iteration, and compared the synthesized result with that of the normal Vivado floating point IP core set for the same performance.

 

Looking at the Analysis view in HLS I indeed see that the computation in HLS is performed in 1 clock cycle, but the timing summary is an order of magnitude off from the requirement. The synthesized result of the HLS logic shows why - it's a looooong chain of muxes and shifters - 267 levels of logic!!!! The Vivado IP is 3 levels or less... No wonder the HLS result couldn't meet timing.

 

Is there a way to force HLS to use the right IP core to perform the division, or is this really the performance to be expected? An example can be attached, but it's really just basic division.

0 Kudos
Reply
1 Solution

Accepted Solutions
Highlighted
Teacher
Teacher
10,693 Views
Registered: ‎03-31-2012

@dima2882 you are basing your conclusions on inadequately constrained design. Add a set of input & output registers such that there is no IO connected to the FP IP directly and implement your design again.

- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.

View solution in original post

13 Replies
Highlighted
Teacher
Teacher
6,121 Views
Registered: ‎03-31-2012

@dima2882

 

>> FDiv core with 1 cycle of latency and 1 cycle of iteration

 

Did you really simulate the Vivado FP IP core and verified its performance for these numbers ? I suspect you did not. What you probably got is an iterative divider which retires 1-2 bits per cycle for some number of cycles. Dividers are complicated and I'd love to see the magical Xilinx IP which can do one single precision floating point divider in one cycle with 3 levels of logic. What you got from HLS is closer to the truth although number of levels seems a little bit excessive.

- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Reply
Highlighted
Explorer
Explorer
6,090 Views
Registered: ‎03-26-2010

@muzaffer wrote:
Did you really simulate the Vivado FP IP core and verified its performance for these numbers ? I suspect you did not. What you probably got is an iterative divider which retires 1-2 bits per cycle for some number of cycles. Dividers are complicated and I'd love to see the magical Xilinx IP which can do one single precision floating point divider in one cycle with 3 levels of logic. What you got from HLS is closer to the truth although number of levels seems a little bit excessive.

Fair enough - wrote a test bench, did a simulation. Divided 7034.5429 by 589.2358. It took one clock cycle to execute. Happy to provide simulation. I used the excellent float to binary converter to create IEEE754 32-bit test vectors from decimal floats available here: http://www.binaryconvert.com/result_float.html

 

Played with other numbers, also got single cycle execution times.

 

Contention that HLS is producing garbage results still stands.

0 Kudos
Reply
Highlighted
Teacher
Teacher
6,077 Views
Registered: ‎03-31-2012

@dima2882 what about implementation? what do your area & timing numbers look like?

- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Reply
Highlighted
Explorer
Explorer
6,066 Views
Registered: ‎03-26-2010

Just did an implementation. Keep in mind that the FP divider was the only thing in the design and on the chip...

 

The clock was set to be 200MHz, utilization was tiny. 748 LUTs, 69 FFs. Not too much different than the utilization by HLS, but the topology is of course very different.

 

Things seem to have worked quite well with the FP core, not so with the HLS...

0 Kudos
Reply
Highlighted
Teacher
Teacher
6,044 Views
Registered: ‎03-31-2012

@dima2882 can you show a timing report which has the fp_div in it ? I did an implementation with my brand spanking new 2016.4 and my implementation area & timing reports don't show the fp block at all but synthesis timing report fails with -52ns and 218 levels of logic. I am curious if this is specific to 2016.4. I'll try with 2015.4 too.

 

- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Reply
Highlighted
Explorer
Explorer
6,032 Views
Registered: ‎03-26-2010

I'm using 2016.4  as well. The floating point core is definitely in there. Timing report is attached...

0 Kudos
Reply
Highlighted
Teacher
Teacher
6,005 Views
Registered: ‎03-31-2012

@dima2882 then you will have to do a timing back-annotated gate level simulation. I still maintain that it's not possible to have a single cycle fp32 divider at 5ns and 3 levels of logic in an FPGA.

- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Reply
Highlighted
Explorer
Explorer
5,997 Views
Registered: ‎03-26-2010

In the timing report, what I'm seeing is that the TDATA ports are properly constrained by the 5ns constraint, but the TVALIDs are not covered by this and are unconstrained... That is where the long carry chains are located.

 

My project is attached - the sim is in there, shows all the single cycle goodness...

0 Kudos
Reply
Highlighted
Teacher
Teacher
10,694 Views
Registered: ‎03-31-2012

@dima2882 you are basing your conclusions on inadequately constrained design. Add a set of input & output registers such that there is no IO connected to the FP IP directly and implement your design again.

- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.

View solution in original post

Highlighted
Scholar
Scholar
4,609 Views
Registered: ‎01-28-2014

@muzaffer is certainly right, there is simply no way you have a floating point core that small with a latency of one cycle at 200 MHz. Look at the datasheet for the core generator to get a feel for the various trade offs you can make. 

0 Kudos
Reply
Highlighted
Teacher
Teacher
4,602 Views
Registered: ‎03-31-2012
@jprice actually around 800 luts is the proper size for a 32 bit fdiv. The problem is timing. With proper constraints a single cycle divider should be around 50 ns in most recent xilinx chips.
- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Reply
Highlighted
Explorer
Explorer
4,596 Views
Registered: ‎03-26-2010

Finally we're getting somewhere...

 

Putting FFs around the Vivado IP core I/O does indeed make it fail timing as @muzaffer predicted. Brilliant how the core will let you build a physically un-realizable system, although I suppose that's my fault for specifying something it couldn't do... I changed it to 20 cycles of latency and 1 iteration cycle, that's what made it pass timing. Going to try the same thing with the HLS variant, will see what happens...

0 Kudos
Reply
Highlighted
Scholar
Scholar
4,573 Views
Registered: ‎01-28-2014

I objected more to the complete lack of flip flops and 1 cycle of latency :). 800 LUTs seems reasonable depending on how they're configured.

0 Kudos
Reply