cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
akokha
Adventurer
Adventurer
1,630 Views
Registered: ‎07-08-2019

Synthesize IO arrays to BRAM blocks with pipelined operations in Vivado HLS

Jump to solution

Hi all,

I want to write a C++ function in Vivado HLS (e.g., a matrix-vector multiplication) whose input and output data are three arrays of 2D and 1D dimensions.

MY GOAL: I want my design to be pipelined and/or parallelized as much as possible, especially the main loop of multiplication while my input arrays are synthesized to BRAM blocks.

In order to obtain enough parallelism while utilizing BRAMs, I want Vivado to assign a distinct BRAM block to each row of input matrix as well as to the input vector. But, each single element of the output vector must be synthesized to a distinct location (NOT to a BRAM) in order to parallelize MAC operations.

 

void mat_vec_multiply
(
	float vec_out[10],
	float mat_in[10][20],
	float vec_in[20]
) {

#pragma HLS PIPELINE

#pragma HLS ARRAY_PARTITION variable=vec_out complete
#pragma HLS ARRAY_PARTITION variable=mat_in  block factor=10 dim=1
#pragma HLS RESOURCE variable=mat_in  core=RAM_2P_BRAM
#pragma HLS RESOURCE variable=vec_in  core=RAM_2P_BRAM

	int row, col, i;

	for (i=0; i<10; i++) {
		vec_out[i] = 0;
	}

	for (col=0; col<20; col++) {
	#pragma HLS PIPELINE
		for (row=0; row<10; row++) {
			#pragma HLS PIPELINE
			// #pragma HLS UNROLL
			vec_out[row] += mat_in[row][col] * vec_in[col];
		}
	}
}

 

 

I think I have provided all necessary directives to obtain the above goals. But, when declare IO arrays as function parameters, they are not synthesized to BRAMs (and it seems enough pipelined parallelim is not reached).

I tried to declare IO arrays as local variables. But, in this case nothing is synthesized and all utilization reports become ZERO as well as clock estimates!

Can anybody help me solving this problem?

Thanks in advance

Ali Kokhazadeh, PhD Candidate

 

cc: @nmoeller 

0 Kudos
1 Solution

Accepted Solutions
u4223374
Advisor
Advisor
1,604 Views
Registered: ‎04-26-2015

Right, a few issues:

 

(1) You've got pipeline directives at three different levels. Only the top-level one counts; every loop under that gets fully unrolled and therefore can't be pipelined. I suggest removing the top one and the bottom one, and leaving just the middle one (in your "col" loop). This alone will largely prevent HLS from complaining about RAM ports, because now it's only trying to access ten elements in a cycle rather than 200.

(2) You're doing this in floating-point. The floating-point addition takes five clock cycles to complete, and it has to complete before the next loop iteration can start (because the result gets used for the next loop iteration). I'm not sure if there's a good fix for this.

 

 

The reason it's not actually using block RAM is that the default interface type is ap_memory (you haven't specified a different interface). This just tells HLS to build something that can drive an external memory - the memory will not be inside the block.

View solution in original post

6 Replies
u4223374
Advisor
Advisor
1,605 Views
Registered: ‎04-26-2015

Right, a few issues:

 

(1) You've got pipeline directives at three different levels. Only the top-level one counts; every loop under that gets fully unrolled and therefore can't be pipelined. I suggest removing the top one and the bottom one, and leaving just the middle one (in your "col" loop). This alone will largely prevent HLS from complaining about RAM ports, because now it's only trying to access ten elements in a cycle rather than 200.

(2) You're doing this in floating-point. The floating-point addition takes five clock cycles to complete, and it has to complete before the next loop iteration can start (because the result gets used for the next loop iteration). I'm not sure if there's a good fix for this.

 

 

The reason it's not actually using block RAM is that the default interface type is ap_memory (you haven't specified a different interface). This just tells HLS to build something that can drive an external memory - the memory will not be inside the block.

View solution in original post

akokha
Adventurer
Adventurer
1,583 Views
Registered: ‎07-08-2019

Thank you @u4223374

I modified the code such that different iterations of the outer loop (to be pipelined) have no dependency. In fact, I switched the col and row loops and used the transpose of input matrix instead. Correspondingly, I modified the directives as needed.

Here is the modified version:

// This is version 2.0 in order to facilitate and speed up pipelining
void matrix_vector_mult (float vec_out[10])
{

	#pragma HLS INTERFACE bram port=vec_out

	float mat_in[20][10]; // Positions of rows and columns are switched (Transposed) [col][row]
	float vec_in[20];

	int row, col, i;

	#pragma HLS ARRAY_PARTITION variable=vec_in complete
	#pragma HLS ARRAY_PARTITION variable=mat_in  block factor=20 dim=1
	#pragma HLS RESOURCE variable=mat_in  core=RAM_2P_BRAM

	for (i=0; i<10; i++) {
		vec_out[i] = 0;
	}

	for (row=0; row<10; row++) {
	#pragma HLS PIPELINE
		for (col=0; col<20; col++) {
			vec_out[row] += mat_in[col][row] * vec_in[col];
		}
	}
}

Now, inner loop (col) can be unrolled and the outer loop (row) can be fully pipelined and timing results are much better than the previous version.

The only issue is that despite using INTERFACE directive on output port vec_out, it is not counted as BRAM block in synthesis report and the number of used BRAM blocks is 20 (i.e., It equals the number of matrix columns and output BRAM for vec_out is not accounted).

Any more comment(s) are highly appreciated.

0 Kudos
u4223374
Advisor
Advisor
1,569 Views
Registered: ‎04-26-2015

@akokha Is that actually working? I just built the new code and it's reporting a loop initiation interval of 1 ... and an iteration latency of 106 cycles. Total is 114 cycles to complete ten loop iterations (128 cycles for the whole block), and resources are pretty high (20/100/8431/14346). With the previous approach (doing ten additions in parallel) I can get down to 111 cycles for the entire block, with resources 10/10/1714/2605 (code is below).

 

Not using any block RAM for vec_out is still OK. The "bram" interface type just means "build an interface that can connect to block RAM" - it doesn't actually allocate the block RAM.

 

Putting mat_in and vec_in inside the block (rather than on the interface) seems like an odd move. How are you planning to load data into them?

 

Code that I implemented with the original approach:

void matrix_vector_mult2 (float vec_out[10])
{

	#pragma HLS INTERFACE bram port=vec_out

	float mat_in[20][10]; // Positions of rows and columns are switched (Transposed) [col][row]
	float vec_in[20];

	int row, col, i;

	#pragma HLS ARRAY_PARTITION variable=mat_in dim=2 complete
	#pragma HLS ARRAY_PARTITION variable=vec_out dim=0 complete
	#pragma HLS RESOURCE variable=mat_in  core=RAM_2P_BRAM

	clear_loop:
	for (i=0; i<10; i++) {
#pragma HLS UNROLL
		vec_out[i] = 0;
	}

	col_loop:
	for (col=0; col<20; col++) {
	#pragma HLS PIPELINE
		row_loop:
		for (row=0; row<10; row++) {
			vec_out[row] += mat_in[col][row] * vec_in[col];
		}
	}
}

This has to pipeline at a much longer II (II=5), but because you've got a relatively short loop and the iteration latency is much shorter (14 cycles) it's significantly faster.

 

wenchen
Moderator
Moderator
1,546 Views
Registered: ‎05-27-2018

Hi @akokha ,

    You can refer to UG902 V2019.1 page302, there is a similar example showing how the pipeline directive using at different loops effect the synthesis report.

    As you can see from the reports blow, When you select a row or column loop to expand, the mat_in array also needs to select different dimensions to expand.

    The first report is to expand the column dimension. HLS adds a large number of DSPs to multiply and add operations so that II=1, but the iteration delay is longer due to the data lantency of the loop. The second report is to expand the row dimension and use less resources. ,  II = 5, but the iteration lantency is shorter.

    pipeline.PNGpipeline3.PNG

void matrix_vector_mult2 (float vec_out[10])
{

	#pragma HLS INTERFACE bram port=vec_out

	float mat_in[20][10]; // Positions of rows and columns are switched (Transposed) [col][row]
	float vec_in[20];

	int row, col, i;

//	#pragma HLS ARRAY_PARTITION variable=mat_in dim=2 complete
    #pragma HLS ARRAY_PARTITION variable=mat_in dim=1 complete
//	#pragma HLS ARRAY_PARTITION variable=vec_in complete
	#pragma HLS RESOURCE variable=mat_in  core=RAM_2P_BRAM

	clear_loop:
	for (i=0; i<10; i++) {
#pragma HLS UNROLL
		vec_out[i] = 0;
	}

	row_loop:
	for (row=0; row<10; row++) {
	#pragma HLS PIPELINE
		for (col=0; col<20; col++) {
			vec_out[row] += mat_in[col][row] * vec_in[col];
		}
	}
/*	col_loop:
	for (col=0; col<20; col++) {
	#pragma HLS PIPELINE
		row_loop:
		for (row=0; row<10; row++) {
			vec_out[row] += mat_in[col][row] * vec_in[col];
		}
	}*/
}
-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.


**~ Got a minute? Answer our Vitis HLS survey here! ~**


-------------------------------------------------------------------------
如果提供的信息能解决您的问题,请标记为“接受为解决方案”。
如果您认为帖子有帮助,请点击“奖励”。谢谢!
-------------------------------------------------------------------------------------------------
akokha
Adventurer
Adventurer
1,533 Views
Registered: ‎07-08-2019

@u4223374yes, It is working to some extent. But, I had made a mistake and the improvement is not significant. Note that To achieve a fair comparison, you need to unroll the vec_out initialization loop in my code, too (I missed it above and focused on multiplication loop).

Setting target clock to 5 ns (uncertainty=0) using Vivado 2017.04 on virtex7-xc7vx980tffg1930-2l device, your code executes in 111 cycles with an estimated clock period of 5.62 ns (More than target period?!). But, my code needs a total of 121 cycles with estimated period=4.97 ns.

In your code, as you mentioned in previous reply, itarations have dependency and this prevents full pipelining (II=5). On the other hand, my code can be fully pipeined (II=1), But the latency of each iteration is high (107) and this constitutes a large fraction of total latency of the block and it also consumes a large volume of resources to build a multiple-MAC structure (specificly, DSPs).

You are right! the idea of declaring input arrays inside the block (in general) does not make sence. But, I did so just to determine the BRAM usage (requirement) of my module(s) in a larger design. Also, The matrix-vector multiply is not my main design. I used it as an example to learn and test some ideas.

But what can we say about the estimated clock period (5.62 ns) which is larger that my target clock period (5 ns)? Is this an ordinary case?

Thanks for your comments

0 Kudos
akokha
Adventurer
Adventurer
1,528 Views
Registered: ‎07-08-2019

Thanks @wenchen ,

Your comments and reference are good and very expressive. I gave the kudos. But, regarding the solution, the first reply had answered majority of my questions, especially about array parametters vs. local arrays.

0 Kudos