UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Visitor gilg1
Visitor
3,301 Views
Registered: ‎12-18-2016

Loops with variable loop boundaries

Hi,

I'm working with Vivado HLS 2016.3

I'm trying to synthesize the following loop code with variable boundaries:

 

#include <stdio.h>
#include "math.h"

 

#define min(a,b)  ((a<b) ? (a) : (b))
#define max(a,b) ((a>b) ? (a) : (b))

 

#define NumSamples 128

 

int loop (int InputData[NumSamples],
             int half_loop,
             int OutputData)
{

   int start_loop;
   int end_loop;
   int i;

   start_loop = max(0,NumSamples/2-half_loop);
   end_loop  = min(NumSamples,NumSamples/2+half_loop);

   OutputData = 0;

   for(i=start_loop;i<end_loop;i++)
  {
   OutputData+= <calculation with InputData[i]>;
  }

 

return 0;
}

 

My first trial was:

 

   int valid_loop;

 

   start_loop = max(0,NumSamples/2-half_loop);
   end_loop  = min(NumSamples,NumSamples/2+half_loop); 

 

   for(i=0;i<NumSamples;i++)
#pragma HLS UNROLL factor=32 

{
 if ((i>=start_loop ) && (i<end_loop)) {

    valid_loop = 1;

   }

 else {

    valid_loop = 0;

  }

 

  OutputData+= <calculation with InputData[i]> * valid_loop ;
  }

 

HLS result was constant loop latency of ~200.

 

My second trial was 

 

 

   start_loop = max(0,NumSamples/2-half_loop);
   end_loop  = min(NumSamples,NumSamples/2+half_loop); 

 

   for(i=0;i<NumSamples;i++)
#pragma HLS UNROLL factor=32 

{
 if ((i>=start_loop ) && (i<end_loop)) {

   OutputData+= <calculation with InputData[i]> ;

  }

}

 

HLS result was variable loop latency of from ~20 to ~1000

and about half resources being used than the first trial.

 

I would like to understand why in the second trial HLS does not use all resources and get worst latency than the first trial.

Is there a better way to unroll loop with variable boundaries?

 

Thanks,

 

 Gil

 

 

0 Kudos
3 Replies
Scholar u4223374
Scholar
3,262 Views
Registered: ‎04-26-2015

Re: Loops with variable loop boundaries

I suspect that it's something to do with array partitioning (or lack thereof). In the first case, HLS knows that it needs to access all elements in the array. It can't read 32 at once, but it can build a state machine that accesses those once per cycle. In the second case, it might be able to skip lots of reads (which will be fast) - but it can't plan ahead to have data ready in advance.

 

If your array is a "normal" one (ie block RAM or LUT RAM) you'll get better results just from pipelining the loop. Far lower resources, better maximum clock speed, and performance is likely to be just as good. After all, it can only read one element per cycle anyway.

 

If you need much higher performance, you need to partition the array to allow 32 simultaneous reads.

0 Kudos
Visitor gilg1
Visitor
3,249 Views
Registered: ‎12-18-2016

Re: Loops with variable loop boundaries

Hi,

I'm already having partition of InputData array (forgot to mention it in my code):

 

#pragma HLS ARRAY_PARTITION variable=InputData dim=1

 

So HLS can access all array elements simultaneously (I verified it in simulation).

I still don't understand why option #2 use less resources and gives worse performance than option #1. 

 

Thanks,

 

   Gil

0 Kudos
Scholar u4223374
Scholar
3,211 Views
Registered: ‎04-26-2015

Re: Loops with variable loop boundaries

I suspect it's the same sort of thing as before. With option #1, HLS just builds a fairly simple structure to apply to all elements of the array. With option #2, HLS has to add an enable line to each of those (and occasionally HLS does this pretty inefficiently, taking a lot of time and space).

 

I haven't yet found a really good way of unrolling a loop with variable boundaries. Pretty much any approach in hardware is going to be 'nasty' - lots of adders, all with enable inputs. When the array is partitioned you also have a bunch of multiplexers, which allow more throughput but add resources. If you can narrow down the boundaries at all then that helps; in my code I've got a loop that does either 60 or 64 iterations. The solution here was to do a loop over 60 iterations and a separate loop over 4, and sum the results if/when the full 64 are required.

0 Kudos