Showing results for 
Show  only  | Search instead for 
Did you mean: 
Registered: ‎06-15-2016

HLS overlapping computation and memory transfer process?

Hi All,


I tried to use HLS for matrix multiplication. It basic read data from DDR to block ram, and then do computation. I was wondering if it is possible to ping-pong the data transfer and computation process in C++. 


For example, before I have matrix A,B and C. Every time, I read A,B from DDR and store in on block ram, then do matrix multiplication and save the result to C.


Is it possible that I have A1,B1,C1 and A2,B2,C2. When I finish reading A1,B1 and try to compute C1, it will continue reading data to A2 and B2. After C1 is Done, it will continue computing C2.



0 Kudos
2 Replies
Xilinx Employee
Xilinx Employee
Registered: ‎09-05-2018

Hey @3008202060,

Vivado HLS has a library for linear algebra functions that includes a matrix_multiply function. As well, there are examples of how to call these functions in the provided examples. I highly recommend you check out "matrix_multiply" and "matrix_multiply_alt" in the "linear algebra folder", under "design examples". You can open example projects from welcome page, which you can find by going to "Welcome" under the "Help" menu item.

The optimization you describe sounds like a pipelined loop. You can read about this under "Loops" in UG902. Below is an example that I wrote up that I think does what you're looking for. If you synthesize the function, you should see that A_i and B_i are loaded in the first clock cycle, then the multiply and the accumulate are completed in the second.

MATRIX_T A_i, B_i, prod,mult;
ITER_T r, c, i, j;
for( r = 0 ; r < C_ROWS ; r ++ ) { // over rows of C
	for( c = 0 ; c < C_COLS ; c ++ ) { /// over cols of C
		prod = 0;
		for( int i = 0 ; i < B_ROWS; i ++ ) {
			A_i = A[r][i];
			B_i = B[i][c];
			mult = A_i*B_i;
			prod += mult;;


However, I would always recommend reusing code rather than implementing your own. In the HLS library, the matrix_multiply function not only pipelines that inner for loop but also partitions the arrays and unrolls the loop apropriately depending on the optimization factor you set.

Nicholas Moellers

Xilinx Worldwide Technical Support
0 Kudos
Registered: ‎04-26-2015

@3008202060 You can certainly overlap computation and processing. The normal way to do this would be to set up a loop in the code that calls two functions: one to perform processing, one to perform I/O. In each loop iteration they should not share any input/output arrays, although having both read from the same scalar inputs (eg. matrix size) is acceptable.


My basic layout is:

int matrix_in_0A[1024];
int matrix_in_1A[1024];
int matrix_in_0B[1024];
int matrix_in_1B[1024];
int matrix_out_0[1024];
int matrix_out_1[1024];

for (int i = 0; i < 100; i++) {
	if ((i & 1) == 1) {
		process(matrix_in_0A, matrix_in_0B,matrix_out_0);
	} else {
		process(matrix_in_1A, matrix_in_1B,matrix_out_1);

In each iteration, dataIO has to write out matrix_out, and read in the two new input matrices. In the next iteration, these new inputs are fed to the processing function so it can produce a new output.



0 Kudos