cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
bhavinlapasia
Adventurer
Adventurer
848 Views
Registered: ‎05-18-2018

Vivado HLS Division accelerate

Hello,

I am using HLS with xfopenCV for image processing application. I am doing a division in the end of the a nested for loop which is ap_uint<24> numerator and ap_uint<16> denominator using '/' symbol. 

Can someone help me in accelerating this process? Since I am using a pipeline directive in for loop, the estimated clock is going beyond 10ns and I want to reduce latency also. 

I want to know if DSP can do this division faster, if yes then how to assign the resource. 

Regards

0 Kudos
5 Replies
u4223374
Advisor
Advisor
775 Views
Registered: ‎04-26-2015

DSP slices cannot do division.

 

If the denominator is constant then you can compute 1/<denominator> at compile-time and then use a DSP to quickly multiply each new numerator by 1/<denominator>. I'm pretty sure HLS will do this automatically for small denominators.

If it's not constant, but it doesn't change often (eg. changes once per frame, but gets used for every pixel in that frame) then you can compute 1/<denominator> slowly (eg. tens of cycles) and use the same process as above to apply it quickly.

 

If the denominator isn't at all constant, then you have to look into other approaches. There are numerous approximate methods that use a series expansion with a lookup table.

nithink
Xilinx Employee
Xilinx Employee
770 Views
Registered: ‎09-04-2017

@bhavinlapasia  you can try an approach like below for integers.

typedef int data_t;
void top(data_t a, data_t b, data_t &c)
{
data_t temp=1;

  while(a - temp*b > 0)
  {
  #pragma HLS pipeline 
   temp++;
  }

  if( a < temp*b)
  {
   c = temp-1;
  }
  else
  {
   c = temp;
  }
}

 

Thanks,

Nithin

0 Kudos
bhavinlapasia
Adventurer
Adventurer
726 Views
Registered: ‎05-18-2018

@u4223374 ,

Thanks for the inputs. My denominator is not constant, however I checked and it may not change frequently and I can follow the 2nd option which you gave. However, I have tried few things based on your suggestion and have following doubts

  • If using 1/denominator instead of numerator/denominator, will it reduce the clock cycles as it still requires division core? Because I tried in a new project where I replaced the division with 1/denominator and then multiplied with numerator, it takes the same number of clock cycles as it used to take with standard division
  • If I do the division slowly i.e. once in every 5 or 10 clocks, will the synthesis tool understand that it needs to include a division core once in 5 loop count or something? Because I tried doing it and synthesised the design, the clock cycles were almost the same. I feel that the tool will allocate the division core's computation time while synthesising and just use it when the condition becomes true which results in same number of clock cycles. Please correct me if wrong. (PS. My denominator is standard deviation of a window kernel)

Regards & Thanks 

0 Kudos
bhavinlapasia
Adventurer
Adventurer
723 Views
Registered: ‎05-18-2018

@nithink ,

I understood your method but will it work in a video processing application along with other operations? I mean will it be able to comply to the timings and improve the overall latency and clock time?

Thanks & regards

0 Kudos
nithink
Xilinx Employee
Xilinx Employee
713 Views
Registered: ‎09-04-2017

@bhavinlapasia  This code infers a DSP. You can try once standalone on how much timing it meets based on the device and decide accordingly

Thanks,

Nithin

0 Kudos