cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Visitor
Visitor
12,365 Views
Registered: ‎12-13-2011

Is ap_fixed slower and larger than arithmetic with scaled primitive data types?

I have a very simple test entity, which divides two numbers (a and b) by each other. It uses two alternative functions and returns both results (c and d). This is the definition in test_fp.h:

#ifndef INCL_TEST_FP_H
#define INCL_TEST_FP_H

#include <ap_fixed.h>

#define NUMBER_OF_BITS				32
#define NUMBER_OF_DECIMAL_DIGITS	8

typedef ap_fixed<NUMBER_OF_BITS,(NUMBER_OF_BITS-NUMBER_OF_DECIMAL_DIGITS)> val1_t;
typedef int	val2_t;

void test_fp (double *a, double *b, val1_t *c, val2_t *d);

#endif

 In the two alternate functions, one performs the division by using the ap_fixed data type, one with a scaled int. The following is the content of the test_fp.cpp. The pragmas have been added in order to bet a report for each function

#include "test_fp.h"

val1_t div_ap_fixed(val1_t a, val1_t b) {
#pragma HLS INLINE off
	return a / b;
}

val2_t div_cz_fixed(val2_t a, val2_t b) {
#pragma HLS INLINE off
	return (a << NUMBER_OF_DECIMAL_DIGITS) / b;
}

/**
 * simple test which performs a "a / b" and assign it to c and d.
 * c is the result of an ap_fixed<32,8> operation, d of my version
 */
void test_fp(double *a, double *b, val1_t *c, val2_t *d) {
	*c = div_ap_fixed(*a, *b);

	*d = div_cz_fixed((val2_t) (*a * (1 << NUMBER_OF_DECIMAL_DIGITS)),
			(val2_t) (*b * (1 << NUMBER_OF_DECIMAL_DIGITS)));

}

 In a simple testbench tb_test_fp.cpp we divide 1 by 5 and print the two results:

#include "test_fp.h"

int main ()
{
	double a = 1;
	double b = 5;
	val1_t c;
	val2_t d;

	test_fp (&a, &b, &c, &d);

	printf ("%10.8lf / %10.8lf = %10.8lf and %10.8lf\n", a, b, c.to_double(), (d * 1.0) / (1 << NUMBER_OF_DECIMAL_DIGITS));
}

 As expected, both variants report the same result:

1.00000000 / 5.00000000 = 0.19921875 and 0.19921875

It is interesting that the variant with the scaled int seems to be smaller and faster(device used it xc7z020clg484-1):

== Vivado HLS Report for 'div_ap_fixed'
    +-----+-----+-----+-----+---------+
    |  Latency  |  Interval | Pipeline|
    | min | max | min | max |   Type  |
    +-----+-----+-----+-----+---------+
    |   42|   42|   42|   42|   none  |
    +-----+-----+-----+-----+---------+
+-----------------+---------+-------+--------+-------+
|       Name      | BRAM_18K| DSP48E|   FF   |  LUT  |
+-----------------+---------+-------+--------+-------+
|Expression       |        -|      -|       -|      -|
|FIFO             |        -|      -|       -|      -|
|Instance         |        -|      -|    2332|   2374|
|Memory           |        -|      -|       -|      -|
|Multiplexer      |        -|      -|       -|      -|
|Register         |        -|      -|       6|      -|
|ShiftMemory      |        -|      -|       -|      -|
+-----------------+---------+-------+--------+-------+
|Total            |        0|      0|    2338|   2374|
+-----------------+---------+-------+--------+-------+
|Available        |      280|    220|  106400|  53200|
+-----------------+---------+-------+--------+-------+
|Utilization (%)  |        0|      0|       2|      4|
+-----------------+---------+-------+--------+-------+

 

== Vivado HLS Report for 'div_cz_fixed'
    +-----+-----+-----+-----+---------+
    |  Latency  |  Interval | Pipeline|
    | min | max | min | max |   Type  |
    +-----+-----+-----+-----+---------+
    |   34|   34|   34|   34|   none  |
    +-----+-----+-----+-----+---------+
+-----------------+---------+-------+--------+-------+
|       Name      | BRAM_18K| DSP48E|   FF   |  LUT  |
+-----------------+---------+-------+--------+-------+
|Expression       |        -|      -|       -|      -|
|FIFO             |        -|      -|       -|      -|
|Instance         |        -|      -|    1712|   1736|
|Memory           |        -|      -|       -|      -|
|Multiplexer      |        -|      -|       -|      -|
|Register         |        -|      -|       6|      -|
|ShiftMemory      |        -|      -|       -|      -|
+-----------------+---------+-------+--------+-------+
|Total            |        0|      0|    1718|   1736|
+-----------------+---------+-------+--------+-------+
|Available        |      280|    220|  106400|  53200|
+-----------------+---------+-------+--------+-------+
|Utilization (%)  |        0|      0|       1|      3|
+-----------------+---------+-------+--------+-------+

 

Is this a usual behavior or am I'm missing something?

 

Thanks,

Christian

0 Kudos
6 Replies
Highlighted
Observer
Observer
12,354 Views
Registered: ‎03-05-2014

That looks a little bit strange... o_0

did you try with more complex calculation?

 

I am optimising my current design with the ap_fixed type and I don't have any problems like that.

0 Kudos
Highlighted
Teacher
Teacher
12,344 Views
Registered: ‎03-31-2012

I am very interested in this question too. I am trying to duplicate an existing RTL design's behavior and seeing significant area/latency increase with my ap_fixed code. I would like to know if there are ways to get better results.

>> I am optimising my current design with the ap_fixed type and I don't have any problems like that.

I am not sure whether you can quantify this. In the other poster's case we have two pieces of code which are very similar in behavior (if not exact) but have very different QOR. Do you have a way to compare your ap_fixed results to something else?
- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Highlighted
Teacher
Teacher
12,342 Views
Registered: ‎03-31-2012

>> typedef ap_fixed<NUMBER_OF_BITS,(NUMBER_OF_BITS-NUMBER_OF_DECIMAL_DIGITS)> val1_t;


actually there is a problem with this definition. ap_fixed takes W and I where I is the integer bits. Your definition seems to assume W and F (which is what matlab does in quantizer). So you ap_fixed is describing 24.8 number instead of 8.24 which you intended (these are in I.F format)
- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
12,330 Views
Registered: ‎08-17-2011

hello everybody!

 

you need to be very careful about what you are using / describing.

 

I've only read the post, i didn't try to verify that what i'm saying is 100% correct. 

so please do those checks for yourself - ideally if you could report back that would be awesome too.

 

first, your top level funtion is void test_fp (double *a, double *b, val1_t *c, val2_t *d); so this accepts *double* values which are 64 bits floating point quantities.

the dividers are integer or ap_fixed dividers so i'm expecting to see one (or 2) double to integer converters in the RTL, which takes some resources and latency.

 

second, as val2_t is integer, the C rule say that wrt

val2_t div_cz_fixed(val2_t a, val2_t b) {
#pragma HLS INLINE off
return (a << NUMBER_OF_DECIMAL_DIGITS) / b;
}

(a << NUMBER_OF_DECIMAL_DIGITS) is 32 bits and NOT 32+NUMBER_OF_DECIMAL_DIGITS; NUMBER_OF_DECIMAL_DIGITS is 8 so my guess is that the tool is infering a divider that takes the 24 LSBs of a padded with 0 and divides by b;

basically div_cz_fixed is a 24b_divided_by_32b divider.

 

on the other side, the other function

val1_t div_ap_fixed(val1_t a, val1_t b) {
#pragma HLS INLINE off
return a / b;
}
-> that's a "true" 32b_divided_by_32b - that's 'div_ap_fixed' - with the fixed point scaling happening anyway.

 

 

Summary:

since 'div_ap_fixed' has 8 bits more to divide than 'div_cz_fixed', ==> then 'div_ap_fixed' will take up more resources and more latency. 

I believe that VHLS use an iterative method, so 8 extra bits <=> 8 extra loops == 8 extra cycles . ... and i can convince myself that the differnce of latency between the two is 8 cycles ... and indeed 34 + 8 == 42

 

 

can you confirm Christian aka aixpower ?

easiest is probably to check out the report the names explicit and / or check out the generate RTL.

 

have fun...

 

- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.
Highlighted
Visitor
Visitor
12,277 Views
Registered: ‎12-13-2011

Thanks you all for you fast feedback. Just to make sure we all have the same understanding, the intended format is

- 32 bit word

- 32-8=24 bit integer part

- 32-24=8 bit fractional part

So ap_fixed<32,24> should exactly represent this.

 

@Hervé, the double part will add some overhead for the top level function, but this must not have influence to the two functions I'm looking for (therefore I used the #pragma HLS_INLINE false) . The ports for the two resulting entities are also looking as expected:

entity div_ap_fixed is
port (
    ...
    a_V : IN STD_LOGIC_VECTOR (31 downto 0);
    b_V : IN STD_LOGIC_VECTOR (31 downto 0);
    ap_return : OUT STD_LOGIC_VECTOR (31 downto 0) );
end;

 

entity div_cz_fixed is
port (
    ...
    a : IN STD_LOGIC_VECTOR (31 downto 0);
    b : IN STD_LOGIC_VECTOR (31 downto 0);
    ap_return : OUT STD_LOGIC_VECTOR (31 downto 0) );
end;

 

But you are right, the main issue is the expression

 (a << NUMBER_OF_DECIMAL_DIGITS) / b;

 which is working fine from a logical point of view, but not working in case the divisor is greater or equal than 2^16. E.g. a division of 16777216.0 (2^24-1, represented by 0x7FFFFF00) divided by itself would not result in 1.0 (0x100) as I loose the MSBs.

 

So you are absolutely right, I generated a 24b_divided_by_32b divider. As VHLS generated HDL code is doing it in an iterative manner, the 40b_divided_by_32b needs 8 more cycles.

 

BTW, do you know of any way to control the generated code for the divider? As you know, as an extreme example you can also perform the division in one clock cycle at reduced speed and with more space. So it would be fine to be able to fine tune the generated HDL in this way.

 

Thanks a lot for improving my understanding,

Christian

0 Kudos
Highlighted
Visitor
Visitor
12,276 Views
Registered: ‎12-13-2011

Sory for the typo:

> So you are absolutely right, I generated a 24b_divided_by_32b divider. As VHLS generated HDL

> code is doing it in an iterative manner, the 40b_divided_by_32b needs 8 more cycles.

 

should be

 

... iterative manner, the 32b_divided_by_32b needs 8 more cycles.

 

0 Kudos