UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Observer a1300012709
Observer
5,389 Views
Registered: ‎07-11-2016

why this add operation consumed 4 dsp?

Jump to solution

捕获.PNG

and this is the code:

for(int col = 0; col < Q; col+=4)
{
data_tf add_rst[4][4] = {0};
// out_buf[0][col][to] = 0,out_buf[0][col+1][to] = 0,out_buf[0][col+2][to] = 0,out_buf[0][col+3][to] = 0;
// out_buf[1][col][to] = 0,out_buf[1][col+1][to] = 0,out_buf[1][col+2][to] = 0,out_buf[1][col+3][to] = 0;
// out_buf[2][col][to] = 0,out_buf[2][col+1][to] = 0,out_buf[2][col+2][to] = 0,out_buf[2][col+3][to] = 0;
// out_buf[3][col][to] = 0,out_buf[3][col+1][to] = 0,out_buf[3][col+2][to] = 0,out_buf[3][col+3][to] = 0;
C_loop: for(int tn = 0; tn < C; tn+=tile_c)
{
#pragma HLS PIPELINE
tile_c_loop: for(int tnn = 0; tnn < tile_c; ++tnn)
{
data_tf array[4][4];
data_tf in_buf[6][6];
int tnnn = tn + tnn;
for(int i = 0; i < 6; ++i)
{
in_buf[0][i] = buf_line_1[col+i][tnnn];
in_buf[1][i] = buf_line_2[col+i][tnnn];
in_buf[2][i] = buf_line_3[col+i][tnnn];
in_buf[3][i] = buf_line_4[col+i][tnnn];
in_buf[4][i] = buf_line_5[col+i][tnnn];
in_buf[5][i] = buf_line_6[col+i][tnnn];
}

from analysis, it shows that 'col+i' consume 4 dsp.
I am confused, because it's just an add operation, why the final module indicated it was a 'mul'.

0 Kudos
1 Solution

Accepted Solutions
Scholar u4223374
Scholar
9,815 Views
Registered: ‎04-26-2015

Re: why this add operation consumed 4 dsp?

Jump to solution

It's using the DSPs for array indexing.

 

There's no such thing as a "2D array", in either C or HDL. In both cases it's just a 1D array which the compiler or synthesis tool has accessed in a way that makes it look 2D.

 

Say you're at this line:

 

in_buf[4][i] = buf_line_5[col+i][tnnn];

"buf_line_5" is a (Q+2)*C array and "in_buf" is a 6*6 array. For now, let's say that the variables are:

i = 5;
col = 4;
tnnn = 7;

in_buf[4][i] is therefore in_buf[4][5]. Its position in the "real" 1D array is 4*6 + 5 = 29. Finding this address requires computation of 4*6, which costs one DSP slice (unless HLS is smart and recognises that it's always multiplying by 4 here).

 

buf_line_5[col + i][tnnn] is buf_line_5[9][7]. Its position in the "real" 1D array is C*9 + 7. Again, this costs at least one DSP slice - perhaps quite a few more if C is a large number.

 

In addition to this, you've unrolled the loop (by putting it under a PIPELINE directive), so HLS is potentially having to do this multiplication for every element in the array simultaneously, requiring even more DSP slices.

 

 

 

In terms of fixes: if you can make the dimensions powers of 2, that eliminates DSP slices. For example, in_buf only needs to be 6*6 (36 elements), but if you make it 8*8 instead (or even 6*8) then the multiplication for addressing is always just "*8" - which is the same as a 3-bit shift.

 

 

Edit: as for why the "add" operation is apparently consuming DSPs: the DSP48 slice can perform operations of the form "(A+D)*C+B", which is exactly what you need (A = col, D = i; C = C; B = tnnn). HLS is probably using a single slice to do the entire calculation. Since the first thing that uses that slice is the (A+D) addition, the resource usage is being "assigned" to that operation, even though the slice is actually doing quite a lot more.

0 Kudos
6 Replies
Xilinx Employee
Xilinx Employee
5,378 Views
Registered: ‎08-01-2008

Re: why this add operation consumed 4 dsp?

Jump to solution
In the HLS tool Xilinx provided HLS coding example template . You can use these example and see if you can better performance.

Thanks and Regards
Balkrishan
--------------------------------------------------------------------------------------------
Please mark the post as an answer "Accept as solution" in case it helped resolve your query.
Give kudos in case a post in case it guided to the solution.
0 Kudos
Scholar u4223374
Scholar
9,816 Views
Registered: ‎04-26-2015

Re: why this add operation consumed 4 dsp?

Jump to solution

It's using the DSPs for array indexing.

 

There's no such thing as a "2D array", in either C or HDL. In both cases it's just a 1D array which the compiler or synthesis tool has accessed in a way that makes it look 2D.

 

Say you're at this line:

 

in_buf[4][i] = buf_line_5[col+i][tnnn];

"buf_line_5" is a (Q+2)*C array and "in_buf" is a 6*6 array. For now, let's say that the variables are:

i = 5;
col = 4;
tnnn = 7;

in_buf[4][i] is therefore in_buf[4][5]. Its position in the "real" 1D array is 4*6 + 5 = 29. Finding this address requires computation of 4*6, which costs one DSP slice (unless HLS is smart and recognises that it's always multiplying by 4 here).

 

buf_line_5[col + i][tnnn] is buf_line_5[9][7]. Its position in the "real" 1D array is C*9 + 7. Again, this costs at least one DSP slice - perhaps quite a few more if C is a large number.

 

In addition to this, you've unrolled the loop (by putting it under a PIPELINE directive), so HLS is potentially having to do this multiplication for every element in the array simultaneously, requiring even more DSP slices.

 

 

 

In terms of fixes: if you can make the dimensions powers of 2, that eliminates DSP slices. For example, in_buf only needs to be 6*6 (36 elements), but if you make it 8*8 instead (or even 6*8) then the multiplication for addressing is always just "*8" - which is the same as a 3-bit shift.

 

 

Edit: as for why the "add" operation is apparently consuming DSPs: the DSP48 slice can perform operations of the form "(A+D)*C+B", which is exactly what you need (A = col, D = i; C = C; B = tnnn). HLS is probably using a single slice to do the entire calculation. Since the first thing that uses that slice is the (A+D) addition, the resource usage is being "assigned" to that operation, even though the slice is actually doing quite a lot more.

0 Kudos
Observer a1300012709
Observer
5,347 Views
Registered: ‎07-11-2016

Re: why this add operation consumed 4 dsp?

Jump to solution

Thank you very much for your reply. 

Actually, a constant multiply(like 7*i) won't consume DSP in HLS.

for example:

 

捕获.PNG

data_tf transform1_input[6][6];
data_tf transform_output[4][6];

when synthesizing this loop, no DSP was used(even replace *2,*4 with *5,*7, the compiler is clever enough).

In addition,

捕获1.PNG

Even if I have calculated the index number before the loop, it also consumed extra DSP.

I am really confused

0 Kudos
Xilinx Employee
Xilinx Employee
5,346 Views
Registered: ‎08-01-2008

Re: why this add operation consumed 4 dsp?

Jump to solution
checks these links it may help you
http://www.wiki.xilinx.com/HLS+Filter2D
https://forums.xilinx.com/t5/High-Level-Synthesis-HLS/hls-mat-to-2d-array/td-p/651195
https://forums.xilinx.com/t5/High-Level-Synthesis-HLS/2015-4-2D-convolution-with-linebuffer-example-RTL-cosim-fails/td-p/690016
https://youtu.be/38lj0VQci7E
https://www.youtube.com/watch?v=4nG68rZaFGs
Thanks and Regards
Balkrishan
--------------------------------------------------------------------------------------------
Please mark the post as an answer "Accept as solution" in case it helped resolve your query.
Give kudos in case a post in case it guided to the solution.
0 Kudos
Scholar u4223374
Scholar
5,328 Views
Registered: ‎04-26-2015

Re: why this add operation consumed 4 dsp?

Jump to solution

Huh, that's news to me. It's always insisted on using DSP slices for constant multiplication for me, unless the multiplication is a constant power of two (converted to a shift) or both inputs are constants (multiplied at compile-time). You can definitely force it to use a LUT multiplier instead of a DSP slice, and that may well make sense for constant values (after all, multiplying by five is just a free bit-shift and a single addition) but I've never seen HLS do it automatically.

 

In your new example, I suspect that it's still having trouble with the unrolled loop. At some point, it has to calculate ((col + <constant>) * C) + tnnn). I don't know of any way to stop it doing that for each constant value individually, apart from switching to a 1D array and doing some minor buffering:

 

int buf_line_1[(Q+2)*C];
int buf_line_2[(Q+2)*C];
int buf_line_3[(Q+2)*C];
int buf_line_4[(Q+2)*C];
int buf_line_5[(Q+2)*C];
int buf_line_6[(Q+2)*C];

int Q_tmp = 0;

for (int col = 0; col < Q; col+= 4) {
	index_tmp[0] = Q_tmp + tnnn;
	index_tmp[1] = Q_tmp + C + tnnn;
	index_tmp[2] = Q_tmp + 2*C + tnnn;
	index_tmp[3] = Q_tmp + 3*C + tnnn;
	index_tmp[4] = Q_tmp + 4*C + tnnn;
	index_tmp[5] = Q_tmp + 5*C + tnnn;
	Q_tmp += 4*C;
	
	for (int tn = 0; tn < C; tn += tile_c) {
		for (int tnn = 0; tnn < tile_c; tnn++) {
			data_tf in_buf[6][6];
			for (int i = 0; i < 6; i++) {
				for (int j = 0; j < 6; j++) {
				in_buf[j][i] = buf_line_1[index_tmp[j]];
			}
		}
	}
}

As far as I can tell, this guarantees that the only multiplication is constant * constant.

 

 

0 Kudos
Observer a1300012709
Observer
5,311 Views
Registered: ‎07-11-2016

Re: why this add operation consumed 4 dsp?

Jump to solution
I think this problem may result from the directives of array_partition. if the loop is not pipelined, and the array is a block, no dsp was used to calculate the index (like a[col+i][j]=b[i][j]).
0 Kudos