cancel
Showing results for
Show  only  | Search instead for
Did you mean:
Highlighted
Observer
632 Views
Registered: ‎11-20-2018

## Unable to fully parallelize a simple loop using loop unrolling and pipelining

My vivado tool version is 20182 and fpga device is xc7z010clg400-1

I have a simple for loop which does element by element multiplication in two arrays A[48] & B[48]. The results is stored in another array C[48].

```A[48]=[a1, a2....,a48]
B[48]=[b1, b2....,b48]
C[i] = A[i] * B[i] , i = 1  to 48```

I want to parallelize the above multiplication. I use 16 dsp multipliers to do this, with each multiplier doing 48/16 = 3 multiplications.

I do cyclic partition of each array into 16 sub-arrays (factor=16) such that each sub-array has 48/16 = 3 elements. Each of the 16 multipliers would handle multiplication of 3 elements from each corresponding sub-arrays generated from A and B.

```A[48] after cyclic partition (factor=16) should produce following 16 sub arrays [a1,a17, a33][a2,a18, a34]...[a15, a31,a47][a16, a32,a48]
B[48] after cyclic partition (factor=16) should produce following 16 sub arrays[b1,b17, b33][b2,b18, b34]...[b15, b31,b47][b16, b32,b48]
Total number of sub-arrays produced should be 16+16=32```

I try to do this in the following code. I also unroll and pipeline the loop L1 where I do multiplication:

```#include <ap_int.h>
#define N 48

// bit size of SUM is 16+16+log2(N)
void hls_accel(ap_uint<16> A[N], ap_uint<16> B[N], ap_uint<38> *SUM) {
#pragma HLS ARRAY_PARTITION variable=A cyclic factor=16 dim=1
#pragma HLS ARRAY_PARTITION variable=B cyclic factor=16 dim=1
#pragma HLS INTERFACE s_axilite port=SUM bundle=gp0
#pragma HLS INTERFACE s_axilite port=return bundle=gp0
#pragma HLS STREAM variable=A depth=48 dim=1
#pragma HLS STREAM variable=B depth=48 dim=1

int i;
static ap_uint<32> C[N];
static ap_uint<38> result;

L1: for (i = 0; i < N; ++i) {
#pragma HLS PIPELINE
#pragma HLS UNROLL factor=16
C[i] = A[i] * B[i];
}

L2: for (i = 0; i < N; ++i) {
#pragma HLS PIPELINE
result += C[i];
}

*SUM = result;
}```

The HLS compiler generally decays the statement

`C[i] = A[i] * B[i];`

into following sub-operations each of which usually  completes in 1 clock cycle (assuming sufficiently long clock cycle time):

3. temp = A[i] * B[i]
4. C[i] = temp

At any given clock cycle, I expect HLS to schedule in parallel the following sub-operations of the ith iteration of loop L1:

• temp[i-1] = A[i-1] * B[i-1]
• C[i-2] = temp[i-2]

I was hoping my code would generate the following schedule with II = 1, iteration latency = 3 and loop latency = 5

However, HLS generates the following schedule which is much different then what I had expected. The actual II = 8, Iteration latency = 11 and Loop latency = 26.

Looking at the above schedule generated by HLS, it can do only two mutiplications in any given clock cycle even though it resource profile shows 16 dsp slices (multipliers) are synthesized.

Only 2 block rams are generated (instead of expected 32 brams) to store 32 sub-arrays with 3 elements each. With 2 brams, only two writes can be done in any given clock cycle. thus forcing HLS tool to schedule to only two multiplications in the clock cycle prior to this write cycle.

My question is what should I fix in my code to force HLS tool to generate 32 brams to store 32 sub-arrays so that it can schedule 16 multiplications in parallel during each iteration ?

Or if there is something else missing in the code which is preventing my multiplier loop from being completely parallelized and pipelined with II =1 ??

Thanks for help

N.

Tags (5)
1 Solution

Accepted Solutions
Highlighted
Xilinx Employee
621 Views
Registered: ‎09-04-2017

Hi,

You should partition the array C similar to the A and B arrays. Since this is not partitioned, only 2 writes are happening every cycle since it's a memory.

You will not see memories inferred for A and B ports, since these are reading from memory, outside the HLS block

Thanks,

Nithin

2 Replies
Highlighted
Xilinx Employee
622 Views
Registered: ‎09-04-2017

Hi,

You should partition the array C similar to the A and B arrays. Since this is not partitioned, only 2 writes are happening every cycle since it's a memory.

You will not see memories inferred for A and B ports, since these are reading from memory, outside the HLS block

Thanks,

Nithin

Highlighted
Observer
600 Views
Registered: ‎11-20-2018

Thanks for reply. I did find same solution just a minute back.

```#pragma HLS RESOURCE variable=C core=RAM_S2P_BRAM
#pragma HLS ARRAY_PARTITION variable=C cyclic factor=16 dim=1
static ap_uint<38> result;```

Also, as the HLS tool was generating no brams, I explicitly directed it to generate a 2-port bram.  Interstingly, when no bram was generated, and assuming HLS was generating FFs instead, it was able to do multiply and writes in same clock cycle as a FF is imediately after multiply. Clock cycle time est. 6.5 ns

However, when I explictly directed it to use bram, multiply and write now happen in different clock cycles perhaps because the bram is clock synchronous. Also est. clock cycle time now increased to 8.1 ns.