UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

Reply

Minimizing Initiation Interval of Simple For Loop with Addition/Multiplication

Accepted Solution Solved
Visitor
Posts: 2
Registered: ‎03-27-2017
Accepted Solution

Minimizing Initiation Interval of Simple For Loop with Addition/Multiplication

Hello,

 

I am currently using a MicroZED to develop a system that does the following:

 

  1. Receives a digitized waveform over ethernet (consisting of 256 shorts)
  2. DMAs the values of that waveform into an IP core that consists of an algorithm I've implemented with HLS
  3. Returns two values as output to the user via UART

I've been successful in doing this. However, I am now trying to do these steps in the least amount of clock cycles possible. I've toyed with the HLS directives and have succeeded in reducing the latency and initiation interval of my algorithm by pipelining things, but I think I could do a lot better.

 

For simplicity, let's say I want to get these 256 shorts in DDR and just multiply each by a constant and add the results all together. Right now my code to do this would look like the following:

 

// Declare 16-bit unsigned integer with minimum side-channel (Includes TLAST signal)
typedef ap_axiu<16,1,1,1> uintSdCh;

void SimpleAddMult(hls::stream<uintSdCh> &inStream, int* Sum)
{

#pragma HLS INTERFACE axis port=inStream
#pragma HLS INTERFACE s_axilite port=Sum bundle=CRTL_BUS;
#pragma HLS INTERFACE s_axilite port=return bundle=CRTL_BUS

// Create storage for the sums and elements
int Gain = 5;
int this_mult = 0;

for (int idx = 0; idx < (256); idx++)
{
#PRAGMA HLS PIPELINE

  // Read and cache (Block here if FIFO sender is empty)
  uintSdCh valIn = inStream.read();

  // Multiply
  this_mult = Gain * valin;

  // Add to total
  *sum = *sum + this_mult;
}

I think I could greatly increase the throughput if I created multiple channels that each did the multiplication and kept their own individual sums that were later summed together. I just don't know how to achieve that after everything I've read in UG902.

 

The HP port is 64 bits wide, so it should be able to DMA 4 shorts into the module at once, correct? And in theory I should be able to do the above operation on four separate channels and speed things up by a factor of 4 (plus clock cycles needed to do the final sum of the four channels).

 

Can anybody tell me if this is possible? And if it is:

 

A) What sort of input interface do I need to use? Is it an AXI stream?

B) Do I need to completely change my C code or can I just use some directives (UNROLL, etc.) to do it on multiple channels?

 

Thanks,

 

Lance

 


Accepted Solutions
Highlighted
Scholar
Posts: 1,950
Registered: ‎04-26-2015

Re: Minimizing Initiation Interval of Simple For Loop with Addition/Multiplication

That should be fairly easy. The problem is that AXI Streams can only do one data transfer per cycle; there's no way around that. The easy solution is to stream a 64-bit value and cut it up into 16-bit elements inside the HLS block.

 

I'd split the loop into two separate ones: a pipelined outer loop and an unrolled inner loop. Something like this:

 

// Declare 16-bit unsigned integer with minimum side-channel (Includes TLAST signal)
typedef ap_axiu<64,1,1,1> uintSdCh;

void SimpleAddMult(hls::stream<uintSdCh> &inStream, int* Sum)
{

#pragma HLS INTERFACE axis port=inStream
#pragma HLS INTERFACE s_axilite port=Sum bundle=CRTL_BUS;
#pragma HLS INTERFACE s_axilite port=return bundle=CRTL_BUS

// Create storage for the sums and elements
int Gain = 5;
int this_mult = 0;

for (int idx = 0; idx < (256/4); idx++)
{
#PRAGMA HLS PIPELINE

	// Read and cache (Block here if FIFO sender is empty)
	uintSdCh valIn = inStream.read();
	ap_uint<64> value = valIn.data;
	int internalSum = 0;
	for (int i = 0; i < 4; i++) {
	// This loop is automatically unrolled because it's inside a pipelined loop.
		ap_uint<16> dataIn = value.range(i*16 - 1, (i-1)*16); // Extract the relevant 16 bits.
		int mul_result = Gain * dataIn;
		internalSum += mul_result;
	}
	
  // Add to total
  *sum += internalSum;
}

View solution in original post


All Replies
Highlighted
Scholar
Posts: 1,950
Registered: ‎04-26-2015

Re: Minimizing Initiation Interval of Simple For Loop with Addition/Multiplication

That should be fairly easy. The problem is that AXI Streams can only do one data transfer per cycle; there's no way around that. The easy solution is to stream a 64-bit value and cut it up into 16-bit elements inside the HLS block.

 

I'd split the loop into two separate ones: a pipelined outer loop and an unrolled inner loop. Something like this:

 

// Declare 16-bit unsigned integer with minimum side-channel (Includes TLAST signal)
typedef ap_axiu<64,1,1,1> uintSdCh;

void SimpleAddMult(hls::stream<uintSdCh> &inStream, int* Sum)
{

#pragma HLS INTERFACE axis port=inStream
#pragma HLS INTERFACE s_axilite port=Sum bundle=CRTL_BUS;
#pragma HLS INTERFACE s_axilite port=return bundle=CRTL_BUS

// Create storage for the sums and elements
int Gain = 5;
int this_mult = 0;

for (int idx = 0; idx < (256/4); idx++)
{
#PRAGMA HLS PIPELINE

	// Read and cache (Block here if FIFO sender is empty)
	uintSdCh valIn = inStream.read();
	ap_uint<64> value = valIn.data;
	int internalSum = 0;
	for (int i = 0; i < 4; i++) {
	// This loop is automatically unrolled because it's inside a pipelined loop.
		ap_uint<16> dataIn = value.range(i*16 - 1, (i-1)*16); // Extract the relevant 16 bits.
		int mul_result = Gain * dataIn;
		internalSum += mul_result;
	}
	
  // Add to total
  *sum += internalSum;
}
Visitor
Posts: 2
Registered: ‎03-27-2017

Re: Minimizing Initiation Interval of Simple For Loop with Addition/Multiplication

Thank you very much for the concise response and example. I was able to incorporate your suggestion in my algorithm, and it compiled and synthesized just fine in HLS. The II shrank from 280 clock cycles to 89!

 

Now I will see if I can get the DMA transfer working in Vivado.

 

Note: I think there was a slight mistake in the indexing in the inner loop. I believe it should read:

 

ap_uint<16> dataIn = value.range((j+1)*16 - 1, j*16);
Scholar
Posts: 1,950
Registered: ‎04-26-2015

Re: Minimizing Initiation Interval of Simple For Loop with Addition/Multiplication

@lancesimms

 

Good to hear that it worked, and you're completely correct that there was a mistake in my code. Your proposed change looks like it'll do the job nicely.