cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
dadduni
Observer
Observer
1,164 Views
Registered: ‎10-08-2020

help pragmas covariance code

Jump to solution

Hi, I need help to set the correct pragmas in this code. It's a covariance calculation

#define MAT_ROW 512
#define MAT_COL 16


void cov(int mat_in[MAT_ROW][MAT_COL], int mat_out[MAT_COL][MAT_COL]){
	for(i=0;i<MAT_COL;i++){
		for(j=i; j< MAT_COL; j++){
			int temp=0;
			for(r=0;r<MAT_ROW;r++){
				temp= temp + mat_in[r][i]*mat_in[r][j];
			}
			mat_out[i][j]=temp;
			mat_out[j][i]=temp;
		}
	}
}

I need 16 parallel inputs that operates the most parallel possible way. I tried varius array_reshaping and pipelining pragmas but the latency (in C syntesis Vivado HLS) is always about 5ms that's a lot.

Any suggestion please?

 

0 Kudos
1 Solution

Accepted Solutions
joancab
Teacher
Teacher
795 Views
Registered: ‎05-11-2015

Changing ints to floats is not a minor or subtle thing, is a major change. It affects the resources and timing. 

Yes, you are right, variable length loops cannot be unrolled. It's not gonna work. So the alternatives are:

- use the full matrix and waste half of it

- hand code all the products, good finger exercise...

- Maybe if you have full loops for i and j and you skip the duplicated calculations with if (i < j) continue;   Vivado is smart enough to unroll and optimize them (?). 

View solution in original post

17 Replies
joancab
Teacher
Teacher
1,153 Views
Registered: ‎05-11-2015

 

Part of the latency is also moving data in and out, a 512x16 matrix is not transferred in a usec.

There is an intrinsic problem in your approach, every output is a summation of products, so even if you take all binary products in parallel at once, you still have the delay of 511 additions, because they are always binary. Adding up 512 numbers can be done in 9 steps if you can afford 256+128+64+32+16+8+4+2+1 = 511 adders. Speed has a price.

I think a long latency here is unavoidable but you can reduce the interval, if that helps, by reading one row at a time, doing all products of that row in parallel, and adding them to an accumulator. If there is space for you to pipeline 512 of these you can process matrices back to back.

A final note on integer arithmetic: when you multiply two numbers up to N bits each, the result may need 2N bits, and if you add up 512 of those you will need 2N+9 bits for a correct result. So if your input values are uint16, you need an uint32 for the product and a custom type with 41 bits or more for the output.

 

0 Kudos
dadduni
Observer
Observer
1,131 Views
Registered: ‎10-08-2020

Thank you, I really appreciate your answer.

I tought that if i can instantiate 16 multiply and add, I can reduce the time doing more operation in parallel. The problem here is there it seems there is no way to do that with combinations of pragmas. Do you have any ideas about how can I do that? I tried resizing the input matrix but it seems that there will be always just one multiplier. Like also you said, I want to do all the parallel multiplying waiting only the 512 serial summation.

How can I do that?

0 Kudos
joancab
Teacher
Teacher
1,114 Views
Registered: ‎05-11-2015

 

Basically, if you have a loop, the UNROLL pragma will parallelize everything inside, EXCEPT if, as in your case, one iteration depends on others.

Again, I insist on looking at the global problem, including the data transfer. What's the point in a magic kitchen that cooks in one second but it's half an hour away from your fridge... Let's put some numbers about your case:

Moving 512x16 uint16 over 64 bit AXI at 200 MHz takes at least 2048 cycles (10 us). Then work begins. Then results go out, 256 64-bit values will take another 256 cycles. Imagine you reduce your computation time to 1000 cycles by using a lot of parallelism (= silicon = money = power), your total latency will still be 3300 cycles, is not very efficient, you have that silicon doing nothing for 2300 cycles out of 3300 (70% idle). You want to speed up a process on a 512 row matrix, but what has that process being doing while all these rows were being copied? Why not starting operations as the first row comes in? The best way to use an FPGA is to process data as it comes in, minimum wait, minimum transfer.

Latency and speed can be diffuse and ambiguous terms. Of course everyone wants things "fast" but some context is needed. What is your source data rate? That should be your target. Also consider if you can buffer, for example, you get 512 sets of values in 512 clock cycles, but next there are 512 cycles without data. In that case, you could set a buffer and process your data one row every two cycles.

u4223374
Advisor
Advisor
1,035 Views
Registered: ‎04-26-2015

@dadduni  Can you try this code?

 

#define MAT_ROW 512
#define MAT_COL 16


void cov(int mat_in[MAT_ROW][MAT_COL], int mat_out[MAT_COL][MAT_COL]){
	for(i=0;i<MAT_COL;i++){
		for(j=i; j< MAT_COL; j++){
			int temp=0;
			for(r=0;r<MAT_ROW;r+=16){
				int temp2 = 0;
				#pragma HLS PIPELINE II=1
				for(r2=0;r2<16;r2++) {
					temp2 += mat_in[r][i]*mat_in[r][j];
				}
				temp += temp2;
			}
			mat_out[i][j]=temp;
			mat_out[j][i]=temp;
		}
	}
}

I haven't tested this, but it should force HLS to instantiate 16 32-bit multipliers (each will be four DSP48 slices, I think - so 128 DSP48s in total). Whether that actually gives you a meaningful increase in performance depends on how many additions HLS is willing to perform in one cycle, but 16 additions might be practical.

0 Kudos
dadduni
Observer
Observer
1,030 Views
Registered: ‎10-08-2020

I have all the data in a BRAM from previous operations. What I tought to do is this: I need 136 scalar product each one made by 512MAC.

So if I could read 8 BRAM in parallel I can use 36 parallel multiplier to speed up the process.

So after 512 cycle I would have 36 output results.

I'm not an expert but I think the input streaming is not a problem because, having all the data already in BRAM I can acces the data parallelly using more BRAMs (partitioning them, but it didin't work for me and I don't know why).

With the code attached above I obtain a latency of 394578clock cycle, I think that this result means I'm not using any parallelization and it would be better to make this calculation performed by the processor and not in FPGA.

Thanks fot the answer, it wold be great if you can help me a little bit more, I really appreciate it!

0 Kudos
dadduni
Observer
Observer
1,009 Views
Registered: ‎10-08-2020

It's not clear to me where r2 is used. Maybe I'm not understanding the logic, but it seems to me it will be summed only the r rows that are in steps of 16.

Is the r2 variable used in any way?

0 Kudos
joancab
Teacher
Teacher
1,004 Views
Registered: ‎05-11-2015

"the input streaming is not a problem because, having all the data already in BRAM I can access the data parallelly using more BRAMs"

A typical misconception. Even if you have a BRAM per column(16 channels) you still need to send 512 addresses and read data 512 times. And the same number of cycles from whatever provides your data to store it in that BRAM. Can't you see that intermediate storage is a waste of silicon and time?

Gathering all data while doing nothing to then start doing things fast, very fast, is not a problem. Is just a wasteful way of doing things and contrary to the principles for doing things on FPGA.

 

 

0 Kudos
dadduni
Observer
Observer
995 Views
Registered: ‎10-08-2020

The system until now is:  the 512x16 matrix sent from the processor of a Zynq to the PL. First I need to filter each column separately and that's not a problem doing that in parallel while the data are coming in. And than I need to calculate the covariance of the filtered data. How do you suggest to do that?

Thanks for the answers.

 

PS " Why not starting operations as the first row comes in?"

It's exactly what I want to do, and I hoped that my code does that but it doens't

 

"why

0 Kudos
joancab
Teacher
Teacher
933 Views
Registered: ‎05-11-2015

"to filter"... another ambiguous term. Coffee is filtered, engine oil as well. I imagine kind of a digital FIR, IIR filter. In that case, pipelining the processes makes sense to me.

"the 512x16 matrix sent from the processor of a Zynq to the PL" Again, the PS is producing data and keeping it for itself while your PL is waiting for the whole thing then you want things fast at expense of parallelism. Equations in textbooks (like covariance calculation) are not written with execution speed in mind, neither software, at least the plain sort of it. 

I'd suggest you use AXI stream to send data from the PS, one piece of 16 values at a time. Your filter (maybe an x16 bank) will produce the result in 1 clock cycle. Those filtered values go to the covariance and at every cycle, you can have all the products and accumulation in static variables, no need for BRAM. Your output will come out 2 cycles after the last row is entered. Even more, such a system can be pipelined so you can send those 512-data packets back to back.

One question that should be at the very beginning, what is the frequency of your data? How many ns, us for each row of data?

0 Kudos
dadduni
Observer
Observer
896 Views
Registered: ‎10-08-2020

"One question that should be at the very beginning, what is the frequency of your data? How many ns, us for each row of data?"

I have a sampling system that put the data in a RAM. Than the RAW data will be displayed by the processor and sent to the PL that will perform a fir filter and the covariance calculation.

The problem is that already wrote  a 16x channel fir filter, but the output of the fir filter could not be passed to the covariance IP parallel because it would require (16+1)(16/2) multiplier to calculate the covariance in 512clock cycle. So I tought about storing data in RAM.

It's okay for me to use AXI to transmit data from the processor to the filter bank, but after that how can I transmit the data to the covariance calculator?

0 Kudos
joancab
Teacher
Teacher
879 Views
Registered: ‎05-11-2015

"it would require (16+1)(16/2) multiplier to calculate the covariance in 512 clock cycle."

Yes, well, didn't you want speed? Things have a price. The price is silicon, resources, power. That's why I've been trying to stress how to use FPGA resources wisely and effectively, with minimum storage and maximum pipelining to have all parts doing something useful on data as much as possible.

The above ("in 512 clock cycles") is not necessarily true if you feed row by row, and calculate each of the 512 iterations in the time between data items come. That why I ask you if data comes at GHz, MHz or kHz, to have an idea whether doing so is possible, but you still haven't mentioned a figure in Hertz.

The best for speed, imo, is AXI stream because it allows pipelining and you can process every received data without holding up stuff.

0 Kudos
dadduni
Observer
Observer
871 Views
Registered: ‎10-08-2020

Thank you for the answer,

The data are already sampled in memory so I can choose the frequency I want to process them. It's an offline processing so I have not the constrain of the data coming in, they are already in a memory and I can send them in the PL with the processor at any rate I want.

Let's suppose I want to instantiate 16 MAC and perform the covariance with them, reshaping the input data and processing row by row waiting the right amount of time before the next row will come in: how can I do that?

Thank you, I really appreciate the help you are giving me

0 Kudos
dadduni
Observer
Observer
869 Views
Registered: ‎10-08-2020

Now I re-wrote the code, to accept each time one row and perform all the multiplications in static variables. I made perfect nested loops to be unrolled. Am I moving in the right direction to parallelize the process or not?

void cov1(float mat_in[MAT_COL], float mat_out[MAT_COL][MAT_COL]){
volatile static float temp[MAT_COL][MAT_COL];
	int i,j,r,r2;
	for(i=0;i<MAT_COL;i++){
#pragma hls unroll factor=8;
		for(j=0; j< MAT_COL; j++){
			temp[i][j] += mat_in[i]*mat_in[j];
			mat_out[i][j]=temp[i][j];
		}
	}
}
0 Kudos
joancab
Teacher
Teacher
838 Views
Registered: ‎05-11-2015

 

I'd say so. I thought of axi stream for passing data, but 16 ints may be passed straight away, you'll see if it blows up at synthesis.

A few points:

1. You need to reset your temp accumulator, either with an external signal (or the block reset) or with a static counter doing that every 512 times, in the latter case you need to synchronize your data flow with the reset, so probably easier the external signal.

2. Your output is only valid (sense-making I think is better) after you have fed 512 samples, no problem in outputting, just know that.

3. A matrix is just a series of variables, 2D is an illusion. Your output matrix is symmetric, no need of returning 16*16 values, you could just return an array with the calculated values. It saves wires and lowers the risk of not being routable after synthesis. One way of keeping results in a 1D array is:

    int out_i = 0;

	for(i=0;i<MAT_COL;i++){
#pragma hls unroll factor=8;
		for(j=i; j< MAT_COL; j++, out_i++){
			temp[out_i] += mat_in[i]*mat_in[j];
			mat_out[out_i]=temp[out_i];
		}
	}

4. You are now using float operations that typically take a number of cycles, so you may need FIFOs and/or overclocking (so for example, if that takes 4 cycles, you clock that at 200 MHz but data comes in/out at 50 MHz)

0 Kudos
dadduni
Observer
Observer
815 Views
Registered: ‎10-08-2020

I switched to float just for testing purpose, for now it doesn't change the architecture.

I tought about that if the input matrix is 2D I would have received one row at time, I was wrong.

Thank you so much for the help, one last question if you can: I tought that the unroll can be valid only if the inner loop is not variable size, but in your code the inner loop starts each cycle from j=i and there is an unloop pragma. How this work?

 

0 Kudos
joancab
Teacher
Teacher
796 Views
Registered: ‎05-11-2015

Changing ints to floats is not a minor or subtle thing, is a major change. It affects the resources and timing. 

Yes, you are right, variable length loops cannot be unrolled. It's not gonna work. So the alternatives are:

- use the full matrix and waste half of it

- hand code all the products, good finger exercise...

- Maybe if you have full loops for i and j and you skip the duplicated calculations with if (i < j) continue;   Vivado is smart enough to unroll and optimize them (?). 

View solution in original post

u4223374
Advisor
Advisor
606 Views
Registered: ‎04-26-2015

@joancab Whoops, my bad. See below for corrected code that does use r2.

 

#define MAT_ROW 512
#define MAT_COL 16


void cov(int mat_in[MAT_ROW][MAT_COL], int mat_out[MAT_COL][MAT_COL]){
	for(i=0;i<MAT_COL;i++){
		for(j=i; j< MAT_COL; j++){
			int temp=0;
			for(r=0;r<MAT_ROW;r+=16){
				int temp2 = 0;
				#pragma HLS PIPELINE II=1
				for(r2=0;r2<16;r2++) {
					temp2 += mat_in[r + r2][i]*mat_in[r + r2][j];
				}
				temp += temp2;
			}
			mat_out[i][j]=temp;
			mat_out[j][i]=temp;
		}
	}
}

 

The intention here is to produce a single loop that you want to be unrolled completely. By pipelining the loop outside that, you tell HLS to (a) unroll the sub-loops (ie the loop that you wanted to unroll), and (b) try to pipeline everything so that the outer loop can start a new iteration on every cycle. This is much more powerful than using a partial unroll, because it actually tells HLS to aim for a performance level.

0 Kudos