UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Visitor yh_
Visitor
4,465 Views
Registered: ‎09-03-2016

How to run functions sequencially and ignore dependencies?

Jump to solution

Hello,

I am new to HLS and I have, I think, a simple question. The code is attached
below. It does coss-correlation of input with the coeff_vec defined in corr1.h.

The main functions of concern are: corr1_core and shift_reg.

corr1_core loops over N coefficients, multiplies them by the input sample, and
accumulates the result in N registers. This can be fully unrolled.

shift_reg rotates the index register of N values (indexes into an array).
I found this was, for me, easier to think in terms of low latency hardware
implementation, as opposed to an "if" with modulo.

corr1_core reads from ICC_IDX and shift_reg reads and writes from/to ICC_IDX
(ideally shifts values as a shift register?). My thinking is that once
corr1_core is done in 3 cycles on the 4th cycle shift_reg should have no
issues.

By itself corr1_core synthesizes very well, with HLS estimated latency of 3
cycles. The shift_reg synthesizes very well by itself as well - just a shift
register.

Now when I combined them I expected to have corr1_core to be done before
shift_reg and the whole process to take 4 cycles. Well... HLS is doing something
weird. It takes 2 hours to synthesize (it was just a few seconds for individual
functions) and generates huge code, which it estimates would take about 40 cycles
and occupy 70% of the FPGA. My target is xc7k410tffg900-2.

Am I missing something simple here?
What am I doing wrong?

Thank you.

 

corr1.h:

#ifndef _CORR1_H_
#define _CORR1_H_

#include <complex>
#include "ap_int.h"

#define N 48

struct axis_cplx_int16 {
    std::complex<short> data;
    ap_uint<1> last;
};

struct axis_cplx_int32 {
    std::complex<int> data;
    ap_uint<1> last;
};

const std::complex<short> coeff_vec[N] = {
		std::complex<short>(1507,-1507),
		std::complex<short>(-4339,-76),
		std::complex<short>(-441,2573),
		std::complex<short>(4677,414),
		std::complex<short>(3014,0),
		std::complex<short>(4677,414),
		std::complex<short>(-441,2573),
		std::complex<short>(-4339,-76),
		std::complex<short>(1507,-1507),
		std::complex<short>(76,4339),
		std::complex<short>(-2573,441),
		std::complex<short>(-414,-4677),
		std::complex<short>(0,-3014),
		std::complex<short>(-414,-4677),
		std::complex<short>(-2573,441),
		std::complex<short>(76,4339),
		std::complex<short>(1507,-1507),
		std::complex<short>(-4339,-76),
		std::complex<short>(-441,2573),
		std::complex<short>(4677,414),
		std::complex<short>(3014,0),
		std::complex<short>(4677,414),
		std::complex<short>(-441,2573),
		std::complex<short>(-4339,-76),
		std::complex<short>(1507,-1507),
		std::complex<short>(76,4339),
		std::complex<short>(-2573,441),
		std::complex<short>(-414,-4677),
		std::complex<short>(0,-3014),
		std::complex<short>(-414,-4677),
		std::complex<short>(-2573,441),
		std::complex<short>(76,4339),
		std::complex<short>(1507,-1507),
		std::complex<short>(-4339,-76),
		std::complex<short>(-441,2573),
		std::complex<short>(4677,414),
		std::complex<short>(3014,0),
		std::complex<short>(4677,414),
		std::complex<short>(-441,2573),
		std::complex<short>(-4339,-76),
		std::complex<short>(1507,-1507),
		std::complex<short>(76,4339),
		std::complex<short>(-2573,441),
		std::complex<short>(-414,-4677),
		std::complex<short>(0,-3014),
		std::complex<short>(-414,-4677),
		std::complex<short>(-2573,441),
		std::complex<short>(76,4339)};

void corr1(axis_cplx_int16 &A, axis_cplx_int32 &out);
std::complex<int> corr1_core(std::complex<int> a);
void shift_reg(int*);

#endif // _CORR1_H_

corr1.cpp:

corr1.cpp:

#include "corr1.h"

static std::complex<int> out_accum[N];
static int curr_idx = 0;
static int ACC_IDX[48] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
						11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
						21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
						31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
						41, 42, 43, 44, 45, 46, 47};


void corr1(axis_cplx_int16 &A, axis_cplx_int32 &out) {
  #pragma HLS ARRAY_PARTITION variable=coeff_vec complete
  #pragma HLS ARRAY_PARTITION variable=out_accum complete

  //#pragma HLS dataflow

  // Remove ap ctrl ports (ap_start, ap_ready, ap_idle, etc) since we only use the AXI-Stream ports
  #pragma HLS INTERFACE ap_ctrl_none port=return
  // Set ports as AXI-Stream
  #pragma HLS INTERFACE axis register port=A
  #pragma HLS INTERFACE axis register port=out
  // Need to pack our complex<short int> into a 32-bit word
  // Otherwise, compiler complains that our AXI-Stream interfaces have two data fields (i.e. data.real, data.imag)
  #pragma HLS DATA_PACK variable=A.data
  #pragma HLS DATA_PACK variable=out.data

  //#pragma HLS DATAFLOW
  
  out.data = corr1_core(std::complex<int>(A.data));
  shift_reg(ACC_IDX);

  // Pass through tlast
  out.last = A.last;
}

std::complex<int> corr1_core(std::complex<int> a) {
  int x;
  int acc_idx = 0;
  std::complex<int> output;
  LOOP_CORR:for (x=0;x<N; x++) {
  #pragma HLS unroll
    out_accum[ACC_IDX[x]] += a*std::complex<int>(coeff_vec[x]);
  }
  output = out_accum[curr_idx];

  out_accum[curr_idx] = std::complex<int>(0,0);
  if(++curr_idx == N) {
   curr_idx = 0;
  }
  return output;
}

void shift_reg(int* idx_array) {
//#pragma HLS ARRAY_PARTITION variable=ACC_IDX complete
	int tmp = idx_array[0];
	int x;
	SHIFT_LOOP:for (x = 0; x < N-1; x++) {
	#pragma HLS unroll
		idx_array[x] = idx_array[x+1];
	}
	idx_array[N-1] = tmp;
}
Tags (2)
0 Kudos
1 Solution

Accepted Solutions
Scholar u4223374
Scholar
8,341 Views
Registered: ‎04-26-2015

Re: How to run functions sequencially and ignore dependencies?

Jump to solution

@yh_

 

The obvious problem that occurs to me is that ACC_IDX is used in unrolled loops, but the array itself is not partitioned.

 

When running corr1_core alone, HLS can probably see that nothing actually changes ACC_IDX - so those values can be hard-coded into the unrolled loop. When running shift_reg alone, it looks like you had ACC_IDX fully partitioned (there's a commented-out pragma in there) which completely resolves this problem.

 

However, when you run them together, HLS is faced with two problems:

 

(1) In corr1_core you're trying to use the elements of a changing array as indices. As a result, rather than each "iteration" of the unrolled loop essentially being a single adder (which adds an incoming value to a register and writes back to the same register), each iteration now requires a 48-input multiplexer to select from the 48 possible inputs. Each iteration has its own hardware, so you've got 48 copies of those. This will be really, really expensive. However, because ACC_IDX is not fully partitioned, the function can only access one (or possibly two) values from that in each cycle - so even though it's got 48 copies of the hardware, only one or two can be used at a time. Now HLS has to carefully schedule all of those hardware blocks, which needs even more hardware!

 

(2) In shift_reg you're trying to shift a 48-input down one space, which requires access to all 48 elements. Without the array partitioned, you can't do this.

 

 

 

You'll have to check whether it works for your situation, but in this case it may well make more sense to shift out_reg rather than ACC_IDX. out_reg will have to be fully partitioned, but you're accessing constant indices within it. Your correlation loop would look like:

 

 

 LOOP_CORR:for (x=0;x<N; x++) {
  #pragma HLS unroll
    out_accum[x] += a*std::complex<int>(coeff_vec[x]);
  }

 

 

No more massive multiplexers, which will drastically cut down on resources.

 

0 Kudos
2 Replies
Scholar u4223374
Scholar
8,342 Views
Registered: ‎04-26-2015

Re: How to run functions sequencially and ignore dependencies?

Jump to solution

@yh_

 

The obvious problem that occurs to me is that ACC_IDX is used in unrolled loops, but the array itself is not partitioned.

 

When running corr1_core alone, HLS can probably see that nothing actually changes ACC_IDX - so those values can be hard-coded into the unrolled loop. When running shift_reg alone, it looks like you had ACC_IDX fully partitioned (there's a commented-out pragma in there) which completely resolves this problem.

 

However, when you run them together, HLS is faced with two problems:

 

(1) In corr1_core you're trying to use the elements of a changing array as indices. As a result, rather than each "iteration" of the unrolled loop essentially being a single adder (which adds an incoming value to a register and writes back to the same register), each iteration now requires a 48-input multiplexer to select from the 48 possible inputs. Each iteration has its own hardware, so you've got 48 copies of those. This will be really, really expensive. However, because ACC_IDX is not fully partitioned, the function can only access one (or possibly two) values from that in each cycle - so even though it's got 48 copies of the hardware, only one or two can be used at a time. Now HLS has to carefully schedule all of those hardware blocks, which needs even more hardware!

 

(2) In shift_reg you're trying to shift a 48-input down one space, which requires access to all 48 elements. Without the array partitioned, you can't do this.

 

 

 

You'll have to check whether it works for your situation, but in this case it may well make more sense to shift out_reg rather than ACC_IDX. out_reg will have to be fully partitioned, but you're accessing constant indices within it. Your correlation loop would look like:

 

 

 LOOP_CORR:for (x=0;x<N; x++) {
  #pragma HLS unroll
    out_accum[x] += a*std::complex<int>(coeff_vec[x]);
  }

 

 

No more massive multiplexers, which will drastically cut down on resources.

 

0 Kudos
Highlighted
Visitor yh_
Visitor
4,353 Views
Registered: ‎09-03-2016

Re: How to run functions sequencially and ignore dependencies?

Jump to solution

Thank you for your reply!

 

Your suggestion worked great. I did not think about shifting the whole out_accum. No more indexes to worry about.

 

Thank you.

0 Kudos