cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Explorer
Explorer
544 Views
Registered: ‎08-31-2017

How to implement fully parallel adder tree for a for-loop accumulation

Jump to solution

Hi, 

For the following accumulation loop, the target is to explore how to implement in fully parallel adder tree as the following figure shown.

Assume that the x[4] is completely partitioned and thus x[0]-x[3] is available at the same time. 

I try either unroll the loop loop_acc or pipeline the function but still see the accumulation is executed in sequence.

Would you please help to point out how to achieve the goal ? Thank you.

 

---HLS code---

const unsigned int N_IN   = 4;    

float add_tree(float x[N_IN]) {

   float acc_tmp=0;

   loop_acc:for(int j=0;j<N_IN;++j) {
      acc_tmp = acc_tmp + x[j];
   }

   return acc_tmp;
}

parallel_adder.PNG
0 Kudos
1 Solution

Accepted Solutions
Highlighted
Xilinx Employee
Xilinx Employee
413 Views
Registered: ‎09-04-2017

@nanson you have floating point datatypes. HLS by default doesn't do expression balancing on these to avoid any functional mismatches with float types.

You can turn on unsafe_math_optimizations

config_compile  -unsafe_math_optimizations

partition the array fully and unroll the loop. That should work

Thanks,

Nithin

View solution in original post

4 Replies
Highlighted
Explorer
Explorer
448 Views
Registered: ‎08-31-2017

Hi, folks,

Does anyone have a comment/experience on how to implement a fully parallel adder tree for a for-loop accumulation?

Thank you very much.

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
414 Views
Registered: ‎09-04-2017

@nanson you have floating point datatypes. HLS by default doesn't do expression balancing on these to avoid any functional mismatches with float types.

You can turn on unsafe_math_optimizations

config_compile  -unsafe_math_optimizations

partition the array fully and unroll the loop. That should work

Thanks,

Nithin

View solution in original post

Highlighted
Explorer
Explorer
373 Views
Registered: ‎08-31-2017

@nithink 

Thank you for the information. It works but I have one question.

When I set loop trip count as 20 and then apply the commands you suggested, I got the following results.

In my expectation, the first layer should have 10 adders run in parallel, it seems to have 8 parallel adders as shown below. Do you know how to explain it?

 

 

Selection_238.png
0 Kudos
Highlighted
Adventurer
Adventurer
290 Views
Registered: ‎07-08-2019

Hi @nanson,

Vivado tries to minimize resource usage while achieving maximum performance. So, it reuses adders of level-1 in subsequent levels of adder tree whenever possible.

When performing 8 additions on 16 inputs (in0,..., in15) in level-1, 8 partial sums are generated and therefore, only four adders are needed to add pairs of these partial sums in level-2. So, we have four free unused adders in level-2 and six free adders in level-3, etc. As a result, vivado can schedule additions of four remaining inputs (in16 - in19) to be performed in level-2 and/or later levels using available unused adders.

You may take a look at the following figure for more clarification.

Nansun Adder Tree.png

 

You can see that using maximum of 8 adders in each level (instead of 10 adders!) the best possible performance (unpipelined) is achieved.

Regards,

Ali