Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

- Community Forums
- :
- Forums
- :
- Software Development and Acceleration
- :
- HLS
- :
- Re: Is ap_fixed slower and larger than arithmetic ...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

aixpower

Visitor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-05-2014 09:31 AM

12,357 Views

Registered:
12-13-2011

Is ap_fixed slower and larger than arithmetic with scaled primitive data types?

I have a very simple test entity, which divides two numbers (a and b) by each other. It uses two alternative functions and returns both results (c and d). This is the definition in *test_fp.h*:

#ifndef INCL_TEST_FP_H #define INCL_TEST_FP_H #include <ap_fixed.h> #define NUMBER_OF_BITS 32 #define NUMBER_OF_DECIMAL_DIGITS 8 typedef ap_fixed<NUMBER_OF_BITS,(NUMBER_OF_BITS-NUMBER_OF_DECIMAL_DIGITS)> val1_t; typedef int val2_t; void test_fp (double *a, double *b, val1_t *c, val2_t *d); #endif

In the two alternate functions, one performs the division by using the ap_fixed data type, one with a scaled int. The following is the content of the *test_fp.cpp*. The pragmas have been added in order to bet a report for each function

#include "test_fp.h" val1_t div_ap_fixed(val1_t a, val1_t b) { #pragma HLS INLINE off return a / b; } val2_t div_cz_fixed(val2_t a, val2_t b) { #pragma HLS INLINE off return (a << NUMBER_OF_DECIMAL_DIGITS) / b; } /** * simple test which performs a "a / b" and assign it to c and d. * c is the result of an ap_fixed<32,8> operation, d of my version */ void test_fp(double *a, double *b, val1_t *c, val2_t *d) { *c = div_ap_fixed(*a, *b); *d = div_cz_fixed((val2_t) (*a * (1 << NUMBER_OF_DECIMAL_DIGITS)), (val2_t) (*b * (1 << NUMBER_OF_DECIMAL_DIGITS))); }

In a simple testbench *tb_test_fp.cpp* we divide 1 by 5 and print the two results:

#include "test_fp.h" int main () { double a = 1; double b = 5; val1_t c; val2_t d; test_fp (&a, &b, &c, &d); printf ("%10.8lf / %10.8lf = %10.8lf and %10.8lf\n", a, b, c.to_double(), (d * 1.0) / (1 << NUMBER_OF_DECIMAL_DIGITS)); }

As expected, both variants report the same result:

1.00000000 / 5.00000000 = 0.19921875 and 0.19921875

It is interesting that the variant with the scaled int seems to be smaller and faster(device used it xc7z020clg484-1):

== Vivado HLS Report for 'div_ap_fixed' +-----+-----+-----+-----+---------+ | Latency | Interval | Pipeline| | min | max | min | max | Type | +-----+-----+-----+-----+---------+ | 42| 42| 42| 42| none | +-----+-----+-----+-----+---------+ +-----------------+---------+-------+--------+-------+ | Name | BRAM_18K| DSP48E| FF | LUT | +-----------------+---------+-------+--------+-------+ |Expression | -| -| -| -| |FIFO | -| -| -| -| |Instance | -| -| 2332| 2374| |Memory | -| -| -| -| |Multiplexer | -| -| -| -| |Register | -| -| 6| -| |ShiftMemory | -| -| -| -| +-----------------+---------+-------+--------+-------+ |Total | 0| 0| 2338| 2374| +-----------------+---------+-------+--------+-------+ |Available | 280| 220| 106400| 53200| +-----------------+---------+-------+--------+-------+ |Utilization (%) | 0| 0| 2| 4| +-----------------+---------+-------+--------+-------+

== Vivado HLS Report for 'div_cz_fixed' +-----+-----+-----+-----+---------+ | Latency | Interval | Pipeline| | min | max | min | max | Type | +-----+-----+-----+-----+---------+ | 34| 34| 34| 34| none | +-----+-----+-----+-----+---------+ +-----------------+---------+-------+--------+-------+ | Name | BRAM_18K| DSP48E| FF | LUT | +-----------------+---------+-------+--------+-------+ |Expression | -| -| -| -| |FIFO | -| -| -| -| |Instance | -| -| 1712| 1736| |Memory | -| -| -| -| |Multiplexer | -| -| -| -| |Register | -| -| 6| -| |ShiftMemory | -| -| -| -| +-----------------+---------+-------+--------+-------+ |Total | 0| 0| 1718| 1736| +-----------------+---------+-------+--------+-------+ |Available | 280| 220| 106400| 53200| +-----------------+---------+-------+--------+-------+ |Utilization (%) | 0| 0| 1| 3| +-----------------+---------+-------+--------+-------+

Is this a usual behavior or am I'm missing something?

Thanks,

Christian

6 Replies

Highlighted
##

remi.girard

Observer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-06-2014 01:06 AM

12,346 Views

Registered:
03-05-2014

That looks a little bit strange... o_0

did you try with more complex calculation?

I am optimising my current design with the ap_fixed type and I don't have any problems like that.

Highlighted
##

I am very interested in this question too. I am trying to duplicate an existing RTL design's behavior and seeing significant area/latency increase with my ap_fixed code. I would like to know if there are ways to get better results.

>> I am optimising my current design with the ap_fixed type and I don't have any problems like that.

I am not sure whether you can quantify this. In the other poster's case we have two pieces of code which are very similar in behavior (if not exact) but have very different QOR. Do you have a way to compare your ap_fixed results to something else?

muzaffer

Teacher

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-06-2014 02:39 PM

12,336 Views

Registered:
03-31-2012

>> I am optimising my current design with the ap_fixed type and I don't have any problems like that.

I am not sure whether you can quantify this. In the other poster's case we have two pieces of code which are very similar in behavior (if not exact) but have very different QOR. Do you have a way to compare your ap_fixed results to something else?

- Please mark the Answer as "Accept as solution" if information provided is helpful.

Give Kudos to a post which you think is helpful and reply oriented.

Give Kudos to a post which you think is helpful and reply oriented.

Highlighted
##

>> typedef ap_fixed<NUMBER_OF_BITS,(NUMBER_OF_BITS-NUMBER_OF_DECIMAL_DIGITS)> val1_t;

actually there is a problem with this definition. ap_fixed takes W and I where I is the integer bits. Your definition seems to assume W and F (which is what matlab does in quantizer). So you ap_fixed is describing 24.8 number instead of 8.24 which you intended (these are in I.F format)

muzaffer

Teacher

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-06-2014 02:50 PM

12,334 Views

Registered:
03-31-2012

actually there is a problem with this definition. ap_fixed takes W and I where I is the integer bits. Your definition seems to assume W and F (which is what matlab does in quantizer). So you ap_fixed is describing 24.8 number instead of 8.24 which you intended (these are in I.F format)

- Please mark the Answer as "Accept as solution" if information provided is helpful.

Give Kudos to a post which you think is helpful and reply oriented.

Give Kudos to a post which you think is helpful and reply oriented.

Highlighted
##

herver

Xilinx Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-07-2014 07:39 AM

12,322 Views

Registered:
08-17-2011

hello everybody!

you need to be very careful about what you are using / describing.

I've only read the post, i didn't try to verify that what i'm saying is 100% correct.

__so please do those checks for yourself__ - ideally if you could report back that would be awesome too.

first, your top level funtion is void test_fp (double *a, double *b, val1_t *c, val2_t *d); so this accepts *double* values which are 64 bits floating point quantities.

the dividers are integer or ap_fixed dividers so i'm expecting to see one (or 2) double to integer converters in the RTL, which takes some resources and latency.

second, as val2_t is integer, the C rule say that wrt

val2_t div_cz_fixed(val2_t a, val2_t b) {

#pragma HLS INLINE off

return (a << NUMBER_OF_DECIMAL_DIGITS) / b;

}

(a << NUMBER_OF_DECIMAL_DIGITS) is 32 bits and NOT 32+NUMBER_OF_DECIMAL_DIGITS; NUMBER_OF_DECIMAL_DIGITS is 8 so my guess is that the tool is infering a divider that takes the 24 LSBs of a padded with 0 and divides by b;

basically div_cz_fixed is a 24b_divided_by_32b divider.

on the other side, the other function

val1_t div_ap_fixed(val1_t a, val1_t b) {

#pragma HLS INLINE off

return a / b;

}

-> that's a "true" 32b_divided_by_32b - that's 'div_ap_fixed' - with the fixed point scaling happening anyway.

Summary:

since 'div_ap_fixed' has 8 bits more to divide than 'div_cz_fixed', ==> then 'div_ap_fixed' will take up more resources and more latency.

I believe that VHLS use an iterative method, so 8 extra bits <=> 8 extra loops == 8 extra cycles . ... and i can convince myself that the differnce of latency between the two is 8 cycles ... and indeed 34 + 8 == 42

can you confirm Christian aka aixpower ?

easiest is probably to check out the report the names explicit and / or check out the generate RTL.

have fun...

- Hervé

SIGNATURE:

* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls

* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.

* Give Kudos to a post which you think is helpful and reply oriented.

SIGNATURE:

* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls

* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.

* Give Kudos to a post which you think is helpful and reply oriented.

Highlighted
##

aixpower

Visitor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-11-2014 11:51 PM

12,269 Views

Registered:
12-13-2011

Thanks you all for you fast feedback. Just to make sure we all have the same understanding, the intended format is

- 32 bit word

- 32-8=24 bit integer part

- 32-24=8 bit fractional part

So ap_fixed<32,24> should exactly represent this.

@Hervé, the double part will add some overhead for the top level function, but this must not have influence to the two functions I'm looking for (therefore I used the #pragma HLS_INLINE false) . The ports for the two resulting entities are also looking as expected:

entity div_ap_fixed is port ( ... a_V : IN STD_LOGIC_VECTOR (31 downto 0); b_V : IN STD_LOGIC_VECTOR (31 downto 0); ap_return : OUT STD_LOGIC_VECTOR (31 downto 0) ); end;

entity div_cz_fixed is port ( ... a : IN STD_LOGIC_VECTOR (31 downto 0); b : IN STD_LOGIC_VECTOR (31 downto 0); ap_return : OUT STD_LOGIC_VECTOR (31 downto 0) ); end;

But you are right, the main issue is the expression

(a << NUMBER_OF_DECIMAL_DIGITS) / b;

which is working fine from a logical point of view, but not working in case the divisor is greater or equal than 2^16. E.g. a division of 16777216.0 (2^24-1, represented by 0x7FFFFF00) divided by itself would not result in 1.0 (0x100) as I loose the MSBs.

So you are absolutely right, I generated a 24b_divided_by_32b divider. As VHLS generated HDL code is doing it in an iterative manner, the 40b_divided_by_32b needs 8 more cycles.

BTW, do you know of any way to control the generated code for the divider? As you know, as an extreme example you can also perform the division in one clock cycle at reduced speed and with more space. So it would be fine to be able to fine tune the generated HDL in this way.

Thanks a lot for improving my understanding,

Christian

Highlighted
##

aixpower

Visitor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-11-2014 11:55 PM

12,268 Views

Registered:
12-13-2011

Sory for the typo:

> So you are absolutely right, I generated a 24b_divided_by_32b divider. As VHLS generated HDL

> code is doing it in an iterative manner, the 40b_divided_by_32b needs 8 more cycles.

should be

... iterative manner, the 32b_divided_by_32b needs 8 more cycles.