cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
mmyhui
Contributor
Contributor
424 Views
Registered: ‎10-13-2020

how to copy int into every lane of v4int32 ?

Jump to solution

Both these methods of creating max_val_mid do not work due to lack of an intrinsic with a matching signature:

int32     max_val;
v4int32   max_val_mid  = undef_v4int32();
v16int32  max_val_wide = undef_v16int32();

max_val_mid  = upd_v ( max_val_mid, 0, max_val);
max_val_mid  = upd_v ( max_val_mid, 1, max_val);
max_val_mid  = upd_v ( max_val_mid, 2, max_val);
max_val_mid  = upd_v ( max_val_mid, 3, max_val);

max_val_mid = concat ( max_val, max_val, max_val, max_val );

max_val_wide = concat ( max_val_mid, max_val_mid, max_val_mid, max_val_mid );

The last statement works as it matches an existing function with identical signature.

I want to load the same integer into all lanes of a v16int32. What is the fastest way to do that? The integer is determined at run time.

 

Tags (3)
0 Kudos
1 Solution

Accepted Solutions
florentw
Moderator
Moderator
298 Views
Registered: ‎11-09-2015

Hi @mmyhui 

Sorry on the delay on this. I believe there is a tradeoff depending on what you want. First did you try to see how many cycles this was taking? The compiler might be able to optimise it.

You might be able to get better performance by creating directly a v8int32 array into memory and load the data directly from memory (as you can do 256-bit wide loads). Something like:

int32 chess_storage(%chess_alignof(v8int32)) weights[8] = {max_val,max_val,max_val,max_val,max_val,max_val,max_val,max_val};

void yourkernel(...){
const int32 * restrict Wptr = weights;
v8int32 * vWptr = (v8int32 *)Wptr;
max_val_wide = upd_w(max_val_wide,0,*vWptr);
max_val_wide = upd_w(max_val_wide,1,*vWptr);
...
}

 

I haven't tested this code but this is what I would try.


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**

View solution in original post

0 Kudos
4 Replies
mmyhui
Contributor
Contributor
409 Views
Registered: ‎10-13-2020

I believe this works, although I hope there is a faster way:

    max_val_mid  = shft_elem (	max_val_mid, max_val);
    max_val_mid  = shft_elem (	max_val_mid, max_val);
    max_val_mid  = shft_elem (	max_val_mid, max_val);
    max_val_mid  = shft_elem (	max_val_mid, max_val);
    max_val_wide = concat ( max_val_mid, max_val_mid, max_val_mid, max_val_mid 
0 Kudos
florentw
Moderator
Moderator
299 Views
Registered: ‎11-09-2015

Hi @mmyhui 

Sorry on the delay on this. I believe there is a tradeoff depending on what you want. First did you try to see how many cycles this was taking? The compiler might be able to optimise it.

You might be able to get better performance by creating directly a v8int32 array into memory and load the data directly from memory (as you can do 256-bit wide loads). Something like:

int32 chess_storage(%chess_alignof(v8int32)) weights[8] = {max_val,max_val,max_val,max_val,max_val,max_val,max_val,max_val};

void yourkernel(...){
const int32 * restrict Wptr = weights;
v8int32 * vWptr = (v8int32 *)Wptr;
max_val_wide = upd_w(max_val_wide,0,*vWptr);
max_val_wide = upd_w(max_val_wide,1,*vWptr);
...
}

 

I haven't tested this code but this is what I would try.


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**

View solution in original post

0 Kudos
mmyhui
Contributor
Contributor
276 Views
Registered: ‎10-13-2020

@florentw :

I will try your solution next. My code fragment above works, but I don't know how many cycles it takes.

Is there a document or webpage showing what tools are available for benchmarking scalar or vector code? I need to see pipeline stalls so that I can rearrange the code to speed it up.

I see this: https://github.com/Xilinx/Vitis-Tutorials/tree/master/AI_Engine_Development/Feature_Tutorials/09-debug-walkthrough 

I have seen the debug mode in Vitis, and it is impressive, but I need to refer back to my course notes on how to use it properly. Is there new documentation on how to use it?

0 Kudos
florentw
Moderator
Moderator
216 Views
Registered: ‎11-09-2015

Hi @mmyhui 

The github example that you have found is the best reference for this. UG1076 and UG1079 will have some references but the tutorial is the best in my opinion


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
0 Kudos