cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
stefanoribes
Contributor
Contributor
1,412 Views
Registered: ‎07-25-2016

Limiting function allocation when implementing double buffering

Jump to solution

Hi everybody,

I'm implementing a module which calls two functions, let's call them A and B. The function B requires the output of A and they are called several times over a batch. I've implemented something that looks like this:

data_t buffer[2][N];
ap_uint<1> buf_ptr = 0;

#pragma HLS ALLOCATION instances=A limit=1 function
#pragma HLS ALLOCATION instances=B limit=1 function

A(a_inputs[0], ..., buffer[buf_ptr]); // Write to the 'first' buffer.
for (int i = 1; i < batch_size; ++i) { // notice i = 1
    A(a_inputs[i], ..., buffer[~buf_ptr]);
    B(b_inputs[i - 1], ..., buffer[buf_ptr]); // Read from 'previous' buffer.
    buf_ptr = ~buf_ptr; // Swap pointers.
}
B(b_inputs[batch_size - 1], ..., buffer[buf_ptr]); // Last call to B.

It does what it's suppose to do, however, the allocation directive doesn't work. The function A is actually replicated twice. This makes sense since it's working on different/independent data with respect to the initial call. That's why I tried to use the ALLOCATION directive. I know I'm gonna save some cycles having two instances, but the design would go 50% bigger than what it should.

The problem is that the directive is ignored, leaving me with this log message:

WARNING: [HLS 200-463] Ignoring allocation pragma because function 'A_1' is missing or was optimized away

I tried to guard in my code with an if-statement, like the following:

for (i = 0; i < batch_size; ++i) {
    if (i == 0) {
        A();
    else {
        A(); // like above
        B(); // like above
        buf_ptr = ~buf_ptr;
    }
}
B(); // like above

It didn't work out, still two instances.

 

Other details:

- the inputs are always of the same type, I'm just reading array ports at different indexes (which, by the way, should make the two instances of A 'dependent'/sharing the same resource)

- I don't like a DATAFLOW solution because I'm sharing a buffer (randomly accessed) and with double buffering I should be able to perfectly overlap A and B execution

- inlining increases area (I'm quite sure it's making two 'instances' of the same code in this case too) and the execution cycles go up quite a lot

- Using Vivado HLS 2018.3

 

My question is then: how do I force VHLS to use only one instance of my function? Because this situation starts sounding like a bug to me.

Thank you for your time, BR,

Stefano

0 Kudos
1 Solution

Accepted Solutions
nithink
Xilinx Employee
Xilinx Employee
1,156 Views
Registered: ‎09-04-2017

You can try something like this, Move the condition when we want to execute into the function

 

for (int i = 0; i < batch_size+1; ++i) {
#pragma HLS LOOP_TRIPCOUNT min=cosim_batch_size+1 max cosim_batch_size+1
i_eq_batch = i == batch_size;
i_neq_batch = i != batch_size;
i_neq_0 = i != 0;
if (i % 2 == 0) {
//if (i != batch_size) {
gemm_a<m, n, k>(&a[i * a_size], b, buffer[0],i_neq_batch); // (m, k) @ (k, n) = (m, n)
//}
//if (i != 0) {
gemm_b<m, q, n>(buffer[1], w, &c[(i-1) * c_size],i_neq_0); // (m, n) @ (n, q) = (m, q)
//}
} else {
//if (i == batch_size) {
gemm_a<m, n, k>(&a[i * a_size], b, buffer[1],i_eq_batch); // (m, k) @ (k, n) = (m, n)
//}
//if (i != 0) {
gemm_b<m, q, n>(buffer[0], w, &c[(i-1) * c_size], i_neq_0); // (m, n) @ (n, q) = (m, q)
//}
}
}

template <int M, int N, int K>
void gemm_a(const int *A, const int *B, int C[M][N], bool if_true) {
#pragma HLS ALLOCATION instances=gemm_a limit=1 function
#pragma HLS INLINE off
int sum = 0;
int a_bram[M][K];
int b_bram[K][N];
#pragma HLS ARRAY_PARTITION variable=a_bram complete dim=2
#pragma HLS ARRAY_PARTITION variable=b_bram complete dim=1
if(if_true) {
for (int i = 0; i < M; ++i) {
for (int j = 0; j < K; ++j) {
#pragma HLS PIPELINE II=1
a_bram[i][j] = A[i * K + j];
}
}

for (int i = 0; i < K; ++i) {
for (int j = 0; j < N; ++j) {
#pragma HLS PIPELINE II=1
b_bram[i][j] = B[i * N + j];
}
}

for(int i = 0; i < M; i++) {
for(int j = 0; j < N; ++j) {
#pragma HLS PIPELINE II=1
for(int k = 0; k < K; ++k) {
if (k == 0) {
sum = 0;
}
sum += a_bram[i][k] * b_bram[k][j];
#pragma HLS RESOURCE variable=sum core=DSP48 latency=3
if (k == K-1) {
C[i][j] = sum;
}
}
}
}
}
}


template <int M, int N, int K>
void gemm_b(const int A[M][K], const int *B, int *C, bool if_true) {
#pragma HLS ALLOCATION instances=gemm_b limit=1 function
#pragma HLS INLINE off
int b_bram[K][N];
int c_bram[M][N];
#pragma HLS ARRAY_PARTITION variable=b_bram complete dim=1
#pragma HLS ARRAY_PARTITION variable=c_bram complete dim=2

if(if_true) {
int sum = 0;
for (int i = 0; i < M; ++i) {
for (int j = 0; j < N; ++j) {
#pragma HLS PIPELINE II=1
for (int k = 0; k < K; ++k) {
if (k == 0) {
sum = 0;
}
sum += A[i][k] * b_bram[k][j];
#pragma HLS RESOURCE variable=sum core=DSP48 latency=3
if (k == K - 1) {
c_bram[i][j] = sum;
}
}
}
}

for (int i = 0; i < M; ++i) {
for (int j = 0; j < N; ++j) {
#pragma HLS PIPELINE II=1
C[i * N + j] = c_bram[i][j];
}
}
}
}

 

View solution in original post

0 Kudos
12 Replies
nithink
Xilinx Employee
Xilinx Employee
1,399 Views
Registered: ‎09-04-2017

@stefanoribes Did you turn off inlining on functions Aand B?

Thanks,

Nithin

0 Kudos
stefanoribes
Contributor
Contributor
1,363 Views
Registered: ‎07-25-2016

Hi Nithin, 

Not 'explicitly' with the directive, I'mggonna try that. However, from the reports I see that the sub functions (A and B in the example) are NOT inlined, as they should.

Thanks, 

Stefano 

0 Kudos
stefanoribes
Contributor
Contributor
1,337 Views
Registered: ‎07-25-2016

Hi again @nithink,

I've just tried putting the following INLINE pragma in both A and B functions:

void A(...) {
#pragma HLS INLINE off
...
}

void B(...) {
#pragma HLS INLINE off
...
}


And VHLS still generates two instances of A.
BR,
Stefano

0 Kudos
nithink
Xilinx Employee
Xilinx Employee
1,322 Views
Registered: ‎09-04-2017

@stefanoribes , is it possible to share the code?

Thanks,

Nithin

0 Kudos
stefanoribes
Contributor
Contributor
1,317 Views
Registered: ‎07-25-2016

Hi @nithink,

Sort of, I'll prepare some code, but you can reasonably treat them as matrix matrix multipliers. Unfortunately, I bet the problem would still show up though. 

BR, 

Stefano 

0 Kudos
stefanoribes
Contributor
Contributor
1,290 Views
Registered: ‎07-25-2016

Hi @nithink,

I've prepared the following snippet, which reproduces the issue (just set dummy_gemm as a top function for synthesizing it):

 

#include "ap_int.h"

template <int M, int N, int K> void gemm_a(const int *A, const int *B, int C[M][N]) { #pragma HLS INLINE off int sum = 0; int a_bram[M][K]; int b_bram[K][N]; #pragma HLS ARRAY_PARTITION variable=a_bram complete dim=2 #pragma HLS ARRAY_PARTITION variable=b_bram complete dim=1 for (int i = 0; i < M; ++i) { for (int j = 0; j < K; ++j) { #pragma HLS PIPELINE II=1 a_bram[i][j] = A[i * K + j]; } } for (int i = 0; i < K; ++i) { for (int j = 0; j < N; ++j) { #pragma HLS PIPELINE II=1 b_bram[i][j] = B[i * N + j]; } } for(int i = 0; i < M; i++) { for(int j = 0; j < N; ++j) {
#pragma HLS PIPELINE II=1 for(int k = 0; k < K; ++k) { if (k == 0) { sum = 0; } sum += a_bram[i][k] * b_bram[k][j]; #pragma HLS RESOURCE variable=sum core=DSP48 latency=3 if (k == K-1) { C[i][j] = sum; } } } } } template <int M, int N, int K> void gemm_b(const int A[M][K], const int *B, int *C) { #pragma HLS INLINE off int b_bram[K][N]; int c_bram[M][N]; #pragma HLS ARRAY_PARTITION variable=b_bram complete dim=1 #pragma HLS ARRAY_PARTITION variable=c_bram complete dim=2 int sum = 0; for (int i = 0; i < M; ++i) { for (int j = 0; j < N; ++j) { #pragma HLS PIPELINE II=1 for (int k = 0; k < K; ++k) { if (k == 0) { sum = 0; } sum += A[i][k] * b_bram[k][j]; #pragma HLS RESOURCE variable=sum core=DSP48 latency=3 if (k == K - 1) { c_bram[i][j] = sum; } } } } for (int i = 0; i < M; ++i) { for (int j = 0; j < N; ++j) { #pragma HLS PIPELINE II=1 C[i * N + j] = c_bram[i][j]; } } } void dummy_gemm(const int batch_size, const int *a, const int *b, const int *w, int *c) {
assert(batch_size > 2); const int m = 64; const int n = 16; const int k = 8; const int q = 24; const int cosim_batch_size = 8; const int a_size = m * k; const int b_size = n * k; const int w_size = n * q; const int c_size = m * q; #pragma HLS INTERFACE s_axilite port=return bundle=ctrl #pragma HLS INTERFACE s_axilite port=batch_size bundle=ctrl #pragma HLS INTERFACE m_axi port=a offset=slave depth=a_size*cosim_batch_size bundle=a_dmem #pragma HLS INTERFACE m_axi port=b offset=slave depth=b_size bundle=b_dmem #pragma HLS INTERFACE m_axi port=w offset=slave depth=w_size bundle=w_dmem #pragma HLS INTERFACE m_axi port=c offset=slave depth=c_size*cosim_batch_size bundle=c_dmem ap_uint<1> buf_ptr = 0; int buffer[2][m][n]; #pragma HLS ARRAY_PARTITION variable=buffer complete dim=1 #pragma HLS ARRAY_PARTITION variable=buffer complete dim=3 // for unrolled reads in gemm_b gemm_a<m, n, k>(a, b, buffer[buf_ptr]); // (m, k) @ (k, n) = (m, n) for (int i = 1; i < batch_size; ++i) { #pragma HLS LOOP_TRIPCOUNT min=cosim_batch_size-1 max cosim_batch_size-1 gemm_a<m, n, k>(&a[i * a_size], b, buffer[~buf_ptr]); // (m, k) @ (k, n) = (m, n) gemm_b<m, q, n>(buffer[buf_ptr], w, &c[(i - 1) * c_size]); // (m, n) @ (n, q) = (m, q) buf_ptr = ~buf_ptr; } gemm_b<m, q, n>(buffer[buf_ptr], w, &c[(batch_size - 1) * c_size]); // (m, k) @ (k, n) = (m, n) #pragma HLS ALLOCATION instances=gemm_a limit=1 function #pragma HLS ALLOCATION instances=gemm_b limit=1 function }

The ALLOCATION pragma gets "totally ignored", i.e. not even mentioned in the logs, synthesizing two instances of gemm_a() function. In particular I get this report:

+ Latency (clock cycles): 
    * Summary: 
    +-------+-------+-------+-------+---------+
    |    Latency    |    Interval   | Pipeline|
    |  min  |  max  |  min  |  max  |   Type  |
    +-------+-------+-------+-------+---------+
    |  38353|  38353|  38353|  38353|   none  |
    +-------+-------+-------+-------+---------+

    + Detail: 
        * Instance: 
        +------------------------------+-------------------+------+------+------+------+---------+
        |                              |                   |   Latency   |   Interval  | Pipeline|
        |           Instance           |       Module      |  min |  max |  min |  max |   Type  |
        +------------------------------+-------------------+------+------+------+------+---------+
        |grp_gemm_a_64_16_8_s_fu_303   |gemm_a_64_16_8_s   |  1689|  1689|  1689|  1689|   none  |
        |grp_gemm_b_64_24_16_s_fu_393  |gemm_b_64_24_16_s  |  3101|  3101|  3101|  3101|   none  |
        |grp_gemm_a_64_16_8_1_fu_483   |gemm_a_64_16_8_1   |  1690|  1690|  1690|  1690|   none  |
        +------------------------------+-------------------+------+------+------+------+---------+

 

Which tells me that also double buffering is not implemented, since we have:

 

 

tripcount = 8
A_cycles = 1689
B_cycles = 3101 expected n.cycles = A_cycles + tripcount * max(A_cycles, B_cycles) + B_cycles = ~29598 obtained n.cycles = tripcount * (A_cycles + B_cycles) = ~38320

Any clue?

Thanks in advance, BR,

Stefano

 

0 Kudos
stefanoribes
Contributor
Contributor
1,266 Views
Registered: ‎07-25-2016

By the way, a possible work around would be something that looks like this:

  for (int i = 0; i < batch_size + 1; ++i) {
#pragma HLS LOOP_TRIPCOUNT min=cosim_batch_size+1 max cosim_batch_size+1
int a_idx = 0;
int b_idx = 0;
if (i == batch_size) {
a_idx = 0;
} else {
a_idx = i;
} if (i == 0) {
b_idx = 0;
} else {
b_idx = i - 1;
} if (i % 2 == 0) { gemm_a<m, n, k>(&a[a_idx * a_size], b, buffer[0]); // (m, k) @ (k, n) = (m, n) gemm_b<m, q, n>(buffer[1], w, &c[b_idx * c_size]); // (m, n) @ (n, q) = (m, q) } else { gemm_a<m, n, k>(&a[a_idx * a_size], b, buffer[1]); // (m, k) @ (k, n) = (m, n) gemm_b<m, q, n>(buffer[0], w, &c[b_idx * c_size]); // (m, n) @ (n, q) = (m, q) } }

Which generates only one instance of both A and B. The problem is now that for i == 0, B will run on dummy data anyway (A will do the same when i == batch_size), wasting memory bandwidth (and eventually execution cycles).


Rewriting the code in the following way will save those two extra calls and keep only one instance of A and B, but won't schedule the two functions A and B to run in parallel:

  for (int i = 0; i < batch_size+1; ++i) {
#pragma HLS LOOP_TRIPCOUNT min=cosim_batch_size+1 max cosim_batch_size+1
    if (i % 2 == 0) {
      if (i != batch_size) {
        gemm_a<m, n, k>(&a[i * a_size], b, buffer[0]); // (m, k) @ (k, n) = (m, n)
      }
      if (i != 0) {
        gemm_b<m, q, n>(buffer[1], w, &c[(i-1) * c_size]); // (m, n) @ (n, q) = (m, q)
      }
    } else {
      if (i == batch_size) {
        gemm_a<m, n, k>(&a[i * a_size], b, buffer[1]); // (m, k) @ (k, n) = (m, n)
      }
      if (i != 0) {
        gemm_b<m, q, n>(buffer[0], w, &c[(i-1) * c_size]); // (m, n) @ (n, q) = (m, q)
      }
    }
  }

Thanks, BR,

Stefano

 

0 Kudos
nithink
Xilinx Employee
Xilinx Employee
1,246 Views
Registered: ‎09-04-2017
for (int i = 0; i < batch_size+1; ++i) {
#pragma HLS LOOP_TRIPCOUNT min=cosim_batch_size+1 max cosim_batch_size+1
    if (i % 2 == 0) {
             gemm_a<m, n, k>(&a[i * a_size], b, buffer[0]); // (m, k) @ (k, n) = (m, n)
             gemm_b<m, q, n>(buffer[1], w, &c[(i-1) * c_size]); // (m, n) @ (n, q) = (m, q)
         } else {
             gemm_a<m, n, k>(&a[i * a_size], b, buffer[1]); // (m, k) @ (k, n) = (m, n)
             gemm_b<m, q, n>(buffer[0], w, &c[(i-1) * c_size]); // (m, n) @ (n, q) = (m, q)
         }
  }

 

with this code, we get around 27919 cycles. It looks similar to what you expect isn't it?

Thanks,

Nithin

0 Kudos
stefanoribes
Contributor
Contributor
1,239 Views
Registered: ‎07-25-2016

Yes, that's right, it's similar but it's still a sub-optimal solution. Again, I'm trading off execution cycles (I'm still running an extra iteration) and especially precious memory bandwidth.
So it seems a lack of the tool, apparently.

Any future help will be appreciated, Thanks, BR,
Stefano

 

p.s. Your code trimmed away the checks on the first and last iterations. But without them, it would go out of bounds on the inputs. Luckily, the checks don't add area or timing overhead.

0 Kudos
nithink
Xilinx Employee
Xilinx Employee
1,157 Views
Registered: ‎09-04-2017

You can try something like this, Move the condition when we want to execute into the function

 

for (int i = 0; i < batch_size+1; ++i) {
#pragma HLS LOOP_TRIPCOUNT min=cosim_batch_size+1 max cosim_batch_size+1
i_eq_batch = i == batch_size;
i_neq_batch = i != batch_size;
i_neq_0 = i != 0;
if (i % 2 == 0) {
//if (i != batch_size) {
gemm_a<m, n, k>(&a[i * a_size], b, buffer[0],i_neq_batch); // (m, k) @ (k, n) = (m, n)
//}
//if (i != 0) {
gemm_b<m, q, n>(buffer[1], w, &c[(i-1) * c_size],i_neq_0); // (m, n) @ (n, q) = (m, q)
//}
} else {
//if (i == batch_size) {
gemm_a<m, n, k>(&a[i * a_size], b, buffer[1],i_eq_batch); // (m, k) @ (k, n) = (m, n)
//}
//if (i != 0) {
gemm_b<m, q, n>(buffer[0], w, &c[(i-1) * c_size], i_neq_0); // (m, n) @ (n, q) = (m, q)
//}
}
}

template <int M, int N, int K>
void gemm_a(const int *A, const int *B, int C[M][N], bool if_true) {
#pragma HLS ALLOCATION instances=gemm_a limit=1 function
#pragma HLS INLINE off
int sum = 0;
int a_bram[M][K];
int b_bram[K][N];
#pragma HLS ARRAY_PARTITION variable=a_bram complete dim=2
#pragma HLS ARRAY_PARTITION variable=b_bram complete dim=1
if(if_true) {
for (int i = 0; i < M; ++i) {
for (int j = 0; j < K; ++j) {
#pragma HLS PIPELINE II=1
a_bram[i][j] = A[i * K + j];
}
}

for (int i = 0; i < K; ++i) {
for (int j = 0; j < N; ++j) {
#pragma HLS PIPELINE II=1
b_bram[i][j] = B[i * N + j];
}
}

for(int i = 0; i < M; i++) {
for(int j = 0; j < N; ++j) {
#pragma HLS PIPELINE II=1
for(int k = 0; k < K; ++k) {
if (k == 0) {
sum = 0;
}
sum += a_bram[i][k] * b_bram[k][j];
#pragma HLS RESOURCE variable=sum core=DSP48 latency=3
if (k == K-1) {
C[i][j] = sum;
}
}
}
}
}
}


template <int M, int N, int K>
void gemm_b(const int A[M][K], const int *B, int *C, bool if_true) {
#pragma HLS ALLOCATION instances=gemm_b limit=1 function
#pragma HLS INLINE off
int b_bram[K][N];
int c_bram[M][N];
#pragma HLS ARRAY_PARTITION variable=b_bram complete dim=1
#pragma HLS ARRAY_PARTITION variable=c_bram complete dim=2

if(if_true) {
int sum = 0;
for (int i = 0; i < M; ++i) {
for (int j = 0; j < N; ++j) {
#pragma HLS PIPELINE II=1
for (int k = 0; k < K; ++k) {
if (k == 0) {
sum = 0;
}
sum += A[i][k] * b_bram[k][j];
#pragma HLS RESOURCE variable=sum core=DSP48 latency=3
if (k == K - 1) {
c_bram[i][j] = sum;
}
}
}
}

for (int i = 0; i < M; ++i) {
for (int j = 0; j < N; ++j) {
#pragma HLS PIPELINE II=1
C[i * N + j] = c_bram[i][j];
}
}
}
}

 

View solution in original post

0 Kudos
nithink
Xilinx Employee
Xilinx Employee
1,116 Views
Registered: ‎09-04-2017

@stefanoribes can you let me know on the last suggestion if it works for you

Thanks,

Nithin

0 Kudos
stefanoribes
Contributor
Contributor
1,072 Views
Registered: ‎07-25-2016
Hi again,
Despite not being an elegant solution (well, HLS per se is not elegant, afterall), it worked. In the end, no allocation pragma (which I still believe is broken), rewriting the for loop and finally setting an internal flag in the functions fixed the issue.
Thanks, BR,
Stefano
0 Kudos