07-22-2019 11:21 PM
Hi all,
I want to write a C++ function in Vivado HLS (e.g., a matrix-vector multiplication) whose input and output data are three arrays of 2D and 1D dimensions.
MY GOAL: I want my design to be pipelined and/or parallelized as much as possible, especially the main loop of multiplication while my input arrays are synthesized to BRAM blocks.
In order to obtain enough parallelism while utilizing BRAMs, I want Vivado to assign a distinct BRAM block to each row of input matrix as well as to the input vector. But, each single element of the output vector must be synthesized to a distinct location (NOT to a BRAM) in order to parallelize MAC operations.
void mat_vec_multiply ( float vec_out[10], float mat_in[10][20], float vec_in[20] ) { #pragma HLS PIPELINE #pragma HLS ARRAY_PARTITION variable=vec_out complete #pragma HLS ARRAY_PARTITION variable=mat_in block factor=10 dim=1 #pragma HLS RESOURCE variable=mat_in core=RAM_2P_BRAM #pragma HLS RESOURCE variable=vec_in core=RAM_2P_BRAM int row, col, i; for (i=0; i<10; i++) { vec_out[i] = 0; } for (col=0; col<20; col++) { #pragma HLS PIPELINE for (row=0; row<10; row++) { #pragma HLS PIPELINE // #pragma HLS UNROLL vec_out[row] += mat_in[row][col] * vec_in[col]; } } }
I think I have provided all necessary directives to obtain the above goals. But, when declare IO arrays as function parameters, they are not synthesized to BRAMs (and it seems enough pipelined parallelim is not reached).
I tried to declare IO arrays as local variables. But, in this case nothing is synthesized and all utilization reports become ZERO as well as clock estimates!
Can anybody help me solving this problem?
Thanks in advance
Ali Kokhazadeh, PhD Candidate
cc: @nmoeller
07-23-2019 06:05 AM
Right, a few issues:
(1) You've got pipeline directives at three different levels. Only the top-level one counts; every loop under that gets fully unrolled and therefore can't be pipelined. I suggest removing the top one and the bottom one, and leaving just the middle one (in your "col" loop). This alone will largely prevent HLS from complaining about RAM ports, because now it's only trying to access ten elements in a cycle rather than 200.
(2) You're doing this in floating-point. The floating-point addition takes five clock cycles to complete, and it has to complete before the next loop iteration can start (because the result gets used for the next loop iteration). I'm not sure if there's a good fix for this.
The reason it's not actually using block RAM is that the default interface type is ap_memory (you haven't specified a different interface). This just tells HLS to build something that can drive an external memory - the memory will not be inside the block.
07-23-2019 06:05 AM
Right, a few issues:
(1) You've got pipeline directives at three different levels. Only the top-level one counts; every loop under that gets fully unrolled and therefore can't be pipelined. I suggest removing the top one and the bottom one, and leaving just the middle one (in your "col" loop). This alone will largely prevent HLS from complaining about RAM ports, because now it's only trying to access ten elements in a cycle rather than 200.
(2) You're doing this in floating-point. The floating-point addition takes five clock cycles to complete, and it has to complete before the next loop iteration can start (because the result gets used for the next loop iteration). I'm not sure if there's a good fix for this.
The reason it's not actually using block RAM is that the default interface type is ap_memory (you haven't specified a different interface). This just tells HLS to build something that can drive an external memory - the memory will not be inside the block.
07-23-2019 11:19 PM
Thank you @u4223374
I modified the code such that different iterations of the outer loop (to be pipelined) have no dependency. In fact, I switched the col and row loops and used the transpose of input matrix instead. Correspondingly, I modified the directives as needed.
Here is the modified version:
// This is version 2.0 in order to facilitate and speed up pipelining void matrix_vector_mult (float vec_out[10]) { #pragma HLS INTERFACE bram port=vec_out float mat_in[20][10]; // Positions of rows and columns are switched (Transposed) [col][row] float vec_in[20]; int row, col, i; #pragma HLS ARRAY_PARTITION variable=vec_in complete #pragma HLS ARRAY_PARTITION variable=mat_in block factor=20 dim=1 #pragma HLS RESOURCE variable=mat_in core=RAM_2P_BRAM for (i=0; i<10; i++) { vec_out[i] = 0; } for (row=0; row<10; row++) { #pragma HLS PIPELINE for (col=0; col<20; col++) { vec_out[row] += mat_in[col][row] * vec_in[col]; } } }
Now, inner loop (col) can be unrolled and the outer loop (row) can be fully pipelined and timing results are much better than the previous version.
The only issue is that despite using INTERFACE directive on output port vec_out, it is not counted as BRAM block in synthesis report and the number of used BRAM blocks is 20 (i.e., It equals the number of matrix columns and output BRAM for vec_out is not accounted).
Any more comment(s) are highly appreciated.
07-24-2019 04:52 AM
@akokha Is that actually working? I just built the new code and it's reporting a loop initiation interval of 1 ... and an iteration latency of 106 cycles. Total is 114 cycles to complete ten loop iterations (128 cycles for the whole block), and resources are pretty high (20/100/8431/14346). With the previous approach (doing ten additions in parallel) I can get down to 111 cycles for the entire block, with resources 10/10/1714/2605 (code is below).
Not using any block RAM for vec_out is still OK. The "bram" interface type just means "build an interface that can connect to block RAM" - it doesn't actually allocate the block RAM.
Putting mat_in and vec_in inside the block (rather than on the interface) seems like an odd move. How are you planning to load data into them?
Code that I implemented with the original approach:
void matrix_vector_mult2 (float vec_out[10]) { #pragma HLS INTERFACE bram port=vec_out float mat_in[20][10]; // Positions of rows and columns are switched (Transposed) [col][row] float vec_in[20]; int row, col, i; #pragma HLS ARRAY_PARTITION variable=mat_in dim=2 complete #pragma HLS ARRAY_PARTITION variable=vec_out dim=0 complete #pragma HLS RESOURCE variable=mat_in core=RAM_2P_BRAM clear_loop: for (i=0; i<10; i++) { #pragma HLS UNROLL vec_out[i] = 0; } col_loop: for (col=0; col<20; col++) { #pragma HLS PIPELINE row_loop: for (row=0; row<10; row++) { vec_out[row] += mat_in[col][row] * vec_in[col]; } } }
This has to pipeline at a much longer II (II=5), but because you've got a relatively short loop and the iteration latency is much shorter (14 cycles) it's significantly faster.
07-25-2019 12:48 AM
Hi @akokha ,
You can refer to UG902 V2019.1 page302, there is a similar example showing how the pipeline directive using at different loops effect the synthesis report.
As you can see from the reports blow, When you select a row or column loop to expand, the mat_in array also needs to select different dimensions to expand.
The first report is to expand the column dimension. HLS adds a large number of DSPs to multiply and add operations so that II=1, but the iteration delay is longer due to the data lantency of the loop. The second report is to expand the row dimension and use less resources. , II = 5, but the iteration lantency is shorter.
void matrix_vector_mult2 (float vec_out[10]) { #pragma HLS INTERFACE bram port=vec_out float mat_in[20][10]; // Positions of rows and columns are switched (Transposed) [col][row] float vec_in[20]; int row, col, i; // #pragma HLS ARRAY_PARTITION variable=mat_in dim=2 complete #pragma HLS ARRAY_PARTITION variable=mat_in dim=1 complete // #pragma HLS ARRAY_PARTITION variable=vec_in complete #pragma HLS RESOURCE variable=mat_in core=RAM_2P_BRAM clear_loop: for (i=0; i<10; i++) { #pragma HLS UNROLL vec_out[i] = 0; } row_loop: for (row=0; row<10; row++) { #pragma HLS PIPELINE for (col=0; col<20; col++) { vec_out[row] += mat_in[col][row] * vec_in[col]; } } /* col_loop: for (col=0; col<20; col++) { #pragma HLS PIPELINE row_loop: for (row=0; row<10; row++) { vec_out[row] += mat_in[col][row] * vec_in[col]; } }*/ }
**~ Got a minute? Answer our Vitis HLS survey here! ~**
07-25-2019 05:54 AM
@u4223374yes, It is working to some extent. But, I had made a mistake and the improvement is not significant. Note that To achieve a fair comparison, you need to unroll the vec_out initialization loop in my code, too (I missed it above and focused on multiplication loop).
Setting target clock to 5 ns (uncertainty=0) using Vivado 2017.04 on virtex7-xc7vx980tffg1930-2l device, your code executes in 111 cycles with an estimated clock period of 5.62 ns (More than target period?!). But, my code needs a total of 121 cycles with estimated period=4.97 ns.
In your code, as you mentioned in previous reply, itarations have dependency and this prevents full pipelining (II=5). On the other hand, my code can be fully pipeined (II=1), But the latency of each iteration is high (107) and this constitutes a large fraction of total latency of the block and it also consumes a large volume of resources to build a multiple-MAC structure (specificly, DSPs).
You are right! the idea of declaring input arrays inside the block (in general) does not make sence. But, I did so just to determine the BRAM usage (requirement) of my module(s) in a larger design. Also, The matrix-vector multiply is not my main design. I used it as an example to learn and test some ideas.
But what can we say about the estimated clock period (5.62 ns) which is larger that my target clock period (5 ns)? Is this an ordinary case?
Thanks for your comments
07-25-2019 06:18 AM - edited 07-27-2019 03:34 AM
Thanks @wenchen ,
Your comments and reference are good and very expressive. I gave the kudos. But, regarding the solution, the first reply had answered majority of my questions, especially about array parametters vs. local arrays.