cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Visitor
Visitor
179 Views
Registered: ‎09-04-2020

opencl kernel: "casting" uint512_dt type to array of 16 floats

Jump to solution

Hello,

I would like to use the 512bit data type on u250 to perform floating point operations.

I am modifying the wide_vadd tutorial code:

https://github.com/GrokImageCompression/latke/tree/master/tests/wide_vadd

I am using the 512 bit type to make use of the 512 bit bus on the 250 to efficiently transfer 512 bits to local memory.

But I really want to perform 16 floating point operations on the 512 bit piece of data once it is stored in local memory.

Is this possible? How can I access the 512 bits as float ?

0 Kudos
1 Solution

Accepted Solutions
Highlighted
Xilinx Employee
Xilinx Employee
118 Views
Registered: ‎08-17-2011

Hello  @boxerab , 

 

The general question that you ask could have several different solutions like reading the 512 bit bus and storing the values as float in an array that could be 1D or 2D; 2D array is the best to represent the solution in people's mind: myfloatstorage[..][16] the partition complete on dim 2, making each lane independent etc.

You can do the conversion raw bits to float using unions of unsigned int and float *but not* ap_uint as they are classes.

Some of that is (was ?) show in the Vivado HLS / Vitis HLS user guide 

template <typename T> // ap_uint or unsigned int can be used 
float uint_to_float(T x) {
    union {
        unsigned int i;
        float f;
    } conv;
    conv.i=(unsigned int)x;
    return conv.f;
}

unsigned int float_to_uint(float x) {
    union {
        unsigned int i;
        float f;
    } conv;
    conv.f=x;
    return conv.i;
}

 

Now in reference to the source code that you point to, the buffers are ap_uint<512> so let's not change that and tweak the answer by providing directly a vector add that would perform the vector add on slices of S bits or ranges of S bits from a larger bus of W bits directly.

Let's look first at the integer version that takes W=512 bits and adds by slices of say S=32 bits; 

for this let's (re)use a templatized function :

template <int S, int W>
void vadd(
        ap_uint<W> a,
        ap_uint<W> b,
        ap_uint<W> &res
        ) {
    // this implicitly unroll the loops
    #pragma HLS pipeline II=1
vaddloop:
    for(int i=0; i<W/S; i++) {
        res.range( S*(i+1)-1, S*i ) = a.range( S*(i+1)-1, S*i )   +   b.range( S*(i+1)-1, S*i ) ;
    }

// example usage 
        ap_uint<512> a;
        ap_uint<512> b;
        ap_uint<512> res_vadd_8bits_words;
        ap_uint<512> res_vadd_32bits_words;

// adds 8 bits at the time, ie 512/8=64 adders of 8 bits data
    vadd<8>(a,b,res_vadd_8bits_words); 
// adds 32 bits at the time, ie 512/32=16 adders of 32 bits data
    vadd<32>(a,b,res_vadd_32bits_words); 

In the examples, above the compiler will be able to work out that W=512 from the function signature so only need to provide the S bits value (for the slice size) provide as template argument.

 

Now you can see how we could merge the 2 together, for example:

template <int W> // hardcoding in this version because float means S=32
void vadd_as_float(
        ap_uint<W> a,
        ap_uint<W> b,
        ap_uint<W> &res
        ) {
    // this implicitly unroll the loops
    const int S =32;
    #pragma HLS pipeline II=1
vaddloop:
    for(int i=0; i<W/S; i++) {
        res.range( S*(i+1)-1, S*i ) = float_to_uint (
               uint_to_float(a.range( S*(i+1)-1, S*i ) )
           +   uint_to_float(b.range( S*(i+1)-1, S*i ) ) 
           );
    }

// example usage 
        ap_uint<512> a;
        ap_uint<512> b;
        ap_uint<512> res_vadd_as_floats;

// adds 32 bits at the time, ie 512/32=16 adders of 32 bits data
    vadd_as_float(a,b,res_vadd_as_floats); 

 

I hope this helps you and should fully answer your question, after maybe some small adaptations etc. 

- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.

View solution in original post

2 Replies
Highlighted
Xilinx Employee
Xilinx Employee
119 Views
Registered: ‎08-17-2011

Hello  @boxerab , 

 

The general question that you ask could have several different solutions like reading the 512 bit bus and storing the values as float in an array that could be 1D or 2D; 2D array is the best to represent the solution in people's mind: myfloatstorage[..][16] the partition complete on dim 2, making each lane independent etc.

You can do the conversion raw bits to float using unions of unsigned int and float *but not* ap_uint as they are classes.

Some of that is (was ?) show in the Vivado HLS / Vitis HLS user guide 

template <typename T> // ap_uint or unsigned int can be used 
float uint_to_float(T x) {
    union {
        unsigned int i;
        float f;
    } conv;
    conv.i=(unsigned int)x;
    return conv.f;
}

unsigned int float_to_uint(float x) {
    union {
        unsigned int i;
        float f;
    } conv;
    conv.f=x;
    return conv.i;
}

 

Now in reference to the source code that you point to, the buffers are ap_uint<512> so let's not change that and tweak the answer by providing directly a vector add that would perform the vector add on slices of S bits or ranges of S bits from a larger bus of W bits directly.

Let's look first at the integer version that takes W=512 bits and adds by slices of say S=32 bits; 

for this let's (re)use a templatized function :

template <int S, int W>
void vadd(
        ap_uint<W> a,
        ap_uint<W> b,
        ap_uint<W> &res
        ) {
    // this implicitly unroll the loops
    #pragma HLS pipeline II=1
vaddloop:
    for(int i=0; i<W/S; i++) {
        res.range( S*(i+1)-1, S*i ) = a.range( S*(i+1)-1, S*i )   +   b.range( S*(i+1)-1, S*i ) ;
    }

// example usage 
        ap_uint<512> a;
        ap_uint<512> b;
        ap_uint<512> res_vadd_8bits_words;
        ap_uint<512> res_vadd_32bits_words;

// adds 8 bits at the time, ie 512/8=64 adders of 8 bits data
    vadd<8>(a,b,res_vadd_8bits_words); 
// adds 32 bits at the time, ie 512/32=16 adders of 32 bits data
    vadd<32>(a,b,res_vadd_32bits_words); 

In the examples, above the compiler will be able to work out that W=512 from the function signature so only need to provide the S bits value (for the slice size) provide as template argument.

 

Now you can see how we could merge the 2 together, for example:

template <int W> // hardcoding in this version because float means S=32
void vadd_as_float(
        ap_uint<W> a,
        ap_uint<W> b,
        ap_uint<W> &res
        ) {
    // this implicitly unroll the loops
    const int S =32;
    #pragma HLS pipeline II=1
vaddloop:
    for(int i=0; i<W/S; i++) {
        res.range( S*(i+1)-1, S*i ) = float_to_uint (
               uint_to_float(a.range( S*(i+1)-1, S*i ) )
           +   uint_to_float(b.range( S*(i+1)-1, S*i ) ) 
           );
    }

// example usage 
        ap_uint<512> a;
        ap_uint<512> b;
        ap_uint<512> res_vadd_as_floats;

// adds 32 bits at the time, ie 512/32=16 adders of 32 bits data
    vadd_as_float(a,b,res_vadd_as_floats); 

 

I hope this helps you and should fully answer your question, after maybe some small adaptations etc. 

- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.

View solution in original post

Highlighted
Visitor
Visitor
88 Views
Registered: ‎09-04-2020
Fantastic, thank you very much!
0 Kudos