UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Observer akokha
Observer
356 Views
Registered: ‎07-08-2019

How to parallelize independent inner loops

Jump to solution

Hi,

I want to process a convolutional layer of a CNN using a tiling method.
Towards this goal, I use two distinct groups of on-chip memories. group0 = {IFM[0], OFM[0], WGT[0]} and group1 = {IFM[1], OFM[1], WGT[1]}.

Here is my code:

#include "tile.h"

#define MAX_N 28
#define MAX_M 15

#define MAX_TN 7
#define MAX_TM 3
#define MAX_K 5
#define MAX_IH 14
#define MAX_IW 24
#define MAX_OH 10
#define MAX_OW 20

typedef float data_t;

void cnn_layer(
	data_t input_fm[MAX_N][MAX_IH][MAX_IW],
	data_t output_fm[MAX_M][MAX_OH][MAX_OW],
	data_t kernel[MAX_M][MAX_N][MAX_K][MAX_K],
	int TM, int TN)
{

#pragma HLS RESOURCE variable=input_fm  core=RAM_2P_BRAM
#pragma HLS RESOURCE variable=output_fm core=RAM_2P_BRAM
#pragma HLS RESOURCE variable=kernel    core=RAM_2P_BRAM

	data_t IFM[2][MAX_TN][MAX_IH][MAX_IW];
	data_t OFM[2][MAX_TM][MAX_OH][MAX_OW];
	data_t WGT[2][MAX_TM][MAX_TN][MAX_K][MAX_K];

	int mm, nn;
	bool out_even=0, in_even=0;

#pragma HLS RESOURCE variable=IFM core=RAM_2P_BRAM
#pragma HLS RESOURCE variable=OFM core=RAM_2P_BRAM
#pragma HLS RESOURCE variable=WGT core=RAM_2P_BRAM

#pragma HLS array_partition variable=IFM complete dim=1
#pragma HLS array_partition variable=IFM complete dim=2
#pragma HLS array_partition variable=OFM complete dim=1
#pragma HLS array_partition variable=OFM complete dim=2
#pragma HLS array_partition variable=WGT complete dim=1
#pragma HLS array_partition variable=WGT complete dim=2
#pragma HLS array_partition variable=WGT complete dim=3


	loop1: for (mm=0; mm<MAX_M; mm+=MAX_TM, out_even = !out_even) {
//	#pragma HLS dependence variable=OFM intra false


		// FIRST STORE PARTIAL RESULTS OF PREVIOUS ITERATION (output_fm[mm-TM])
		loop1_1: for (int orow=0; orow<MAX_OH; orow++) {
			for (int ocol=0; ocol<MAX_OW; ocol++) {
				for (int tmm=0; tmm<MAX_TM; tmm++) {
					if (out_even==0) {
						output_fm[mm-TM+tmm][orow][ocol] = OFM[0][tmm][orow][ocol];
					}
					else {
						output_fm[mm-TM+tmm][orow][ocol] = OFM[1][tmm][orow][ocol];
					}
				}
			}
		}

		// THEN LOAD A NEW OFM TILE FOR NEXT ITERATION (output_fm[mm+TM])
		loop1_2: for (int orow=0; orow<MAX_OH; orow++) {
			for (int ocol=0; ocol<MAX_OW; ocol++) {
				for (int tmm=0; tmm<MAX_TM; tmm++) {
					if (out_even==0) {
						OFM[0][tmm][orow][ocol] = output_fm[mm+TM+tmm][orow][ocol];
					}
					else {
						OFM[1][tmm][orow][ocol] = output_fm[mm+TM+tmm][orow][ocol];
					}
				}
			}
		}

		// For each iteration of inner loop (nn-loop) we need to load a new IFM tile as well as a new WEIGHT tile
		loop1_3: for (nn=0; nn<MAX_N; nn+=MAX_TN, in_even = !in_even) {

		#pragma HLS dependence variable=IFM intra false
		#pragma HLS dependence variable=WGT intra false

			// Loading next IFM Tile for next iteration (input_fm[nn+TN])
			loop1_3_1: for (int irow=0; irow<MAX_IH; irow++) {
				for (int icol=0; icol<MAX_IW; icol++) {
					for (int tnn=0; tnn<MAX_TN; tnn++) {
						int next_nn = (nn+TN < N) ? nn + TN : 0; // To consider last tile of each row
						if (in_even==0) {
							IFM[0][tnn][irow][icol] = input_fm[nn+TN+tnn][irow][icol];
						}
						else {
							IFM[1][tnn][irow][icol] = input_fm[nn+tnn][irow][icol];
						}
					}
				}
			} // for (irow)

			// Loading next WEIGHT tile Corresponding to next IFM tile (kernel[mm+tmm][nn+TN+tnn])
			loop1_3_2: for (int krow=0; krow<MAX_K; krow++) {
				for (int kcol=0; kcol<MAX_K; kcol++) {
					for (int tmm=0; tmm<MAX_TM; tmm++) {
						for (int tnn=0; tnn<MAX_TN; tnn++) {
							/////////////////////////////////////
							int next_nn = nn + TN; // IF kernel array is stored in row-order, then there is no need to reset nn and increase mm for next tile
							if (in_even==0) {
								WGT[0][tmm][tnn][krow][kcol] = kernel[mm+tmm][next_nn+tnn][krow][kcol];
							}
							else {
								WGT[1][tmm][tnn][krow][kcol] = kernel[mm+tmm][next_nn+tnn][krow][kcol];
							}
							/////////////////////////////////////
						}
					}
				}
			} // for (krow)


			//////////////// P R O C E S S I N G     E N G I N E  //////////////////
			loop1_3_3: for (int row=0; row<MAX_OH; row++) {
				for (int col=0; col<MAX_OW; col++) {
					for (int kr=0; kr<MAX_K; kr++)   {
						for (int kc=0; kc<MAX_K; kc++) {
							#pragma HLS PIPELINE
							int tn, tm;

							tile_engine (
									row, col, kr, kc,
									OFM[1-out_even], IFM[1-in_even], WGT[1-in_even]
								);
							/* // The functionality of tile_engine is a pipelined version of the following loops
							for (int tm=0; tm<TM; tm++)  {
								for (int tn=0; tn<TN; tn++) {
									OFM[out_even][tm][row][col] += WGT[in_even][tm][tn][kr][kc] * IFM[in_even][tn][row+kr][col+kc];
								}
							}
							*/
						}
					}
				}
			}

		} // for (nn)
	} // for (mm)
}

In each iteration of loop1_3, the data of current tile are in one group (e.g. IFM[0] and WGT[0]) and this tile is processed using loop1_3_3.
During processing of current tile, the data of next tile are loaded into the other group of arrays (e.g. IFM[1] and WGT[1]) and this loading is done using loop1_3_1 and loop1_3_2.
the processing engine of tile is pipelined with latency=61 and II=1.

In next iteration, the roles of group0 and group1 arrays are exchanged, and so on.

Since group0 arrays are absolutely distinct fron group1 arrays and also IFM and WGT arrays can be loaded in parallel, I expect all three loops loop1_3_1, loop1_3_2 and loop1_3_3 to be executed completely in parallel. But unfortunately, the synthesis results show that these loops are executed sequentially.


I used #pragma HLS dependence intra for arrays IFM and WGT at the start of loop1_3. But again, no parallelism is applied to the three loops. Here is the synthesis result:

syn_result.png

 

I will be grateful if anybody can explain how to parallelise these loops (loop1_3_1, loop1_3_2, and loop1_3_3).

Many thanks in advance,

Ali Kokhazadeh

0 Kudos
1 Solution

Accepted Solutions
Xilinx Employee
Xilinx Employee
311 Views
Registered: ‎09-05-2018

Re: How to parallelize independent inner loops

Jump to solution

Hey @akokha,

The way to do this is to refactor each loop into a function. After converting the 3 loops into three functions, HLS should be able to run them independently given that their data is independent.

The example design "loop_functions" shows how one can convert loops into functions for parallel execution. I know this technique is also mentioned in either UG871 or UG902, but unfortunately, I can't find the exact passage in there at the moment.

Nicholas Moellers

Xilinx Worldwide Technical Support
5 Replies
Xilinx Employee
Xilinx Employee
312 Views
Registered: ‎09-05-2018

Re: How to parallelize independent inner loops

Jump to solution

Hey @akokha,

The way to do this is to refactor each loop into a function. After converting the 3 loops into three functions, HLS should be able to run them independently given that their data is independent.

The example design "loop_functions" shows how one can convert loops into functions for parallel execution. I know this technique is also mentioned in either UG871 or UG902, but unfortunately, I can't find the exact passage in there at the moment.

Nicholas Moellers

Xilinx Worldwide Technical Support
Highlighted
Contributor
Contributor
299 Views
Registered: ‎03-31-2017

Re: How to parallelize independent inner loops

Jump to solution

Page 305 of UG902 (v2019.1) shows an example of how to convert loops to functions so they will run in parallel

Xilinx Employee
Xilinx Employee
283 Views
Registered: ‎09-05-2018

Re: How to parallelize independent inner loops

Jump to solution

Hey @p27803 ,

Thanks for noting that! I knew it was in there somewhere.

Nicholas Moellers

Xilinx Worldwide Technical Support
0 Kudos
Observer akokha
Observer
241 Views
Registered: ‎07-08-2019

Re: How to parallelize independent inner loops

Jump to solution

Thanks @p27803 ,

Thanks Nicholas @nmoeller ,

I rewrote a modular version of the function cnn_layer_modular() and tried to declare those parts to be parallel in distinct functions.

Some parts are parallelized very well. But, two of them are not parallelized despite using false dependence directives!

In the first step, I modularized three loops within the loop1_3 of the previous code. In fact, I defined a function named load_inputs_and_process_tile which corresponds to loop1_3. Then I defined three other functions corresponding to three inner loops of loop1_3. The code is as follows, everything is OK, and three inner modules are parallelized as I expected:

 

void load_next_tile_ifm (
		data_t layer_ifm[MAX_N][MAX_IH][MAX_IW],
		data_t tile_ifm[2][MAX_TN][MAX_IH][MAX_IW],
		bool in_even,
		int tile_start_nn
)
{
		// Loading next IFM Tile for next iteration (input_fm[nn+TN])
	loop1_3_1: for (int tnn=0; tnn<MAX_TN; tnn++) {
		for (int irow=0; irow<MAX_IH; irow++) {
			for (int icol=0; icol<MAX_IW; icol++) {
				tile_ifm[in_even][tnn][irow][icol] = layer_ifm[tile_start_nn+tnn][irow][icol];
			}
		}
	}

}

void load_next_tile_wgt (
		data_t layer_wgt[MAX_M][MAX_N][MAX_K][MAX_K],
		data_t tile_wgt[2][MAX_TM][MAX_TN][MAX_K][MAX_K],
		bool in_even,
		int tile_start_nn,
		int tile_start_mm
)
{
	// Loading next WEIGHT tile Corresponding to next IFM tile (kernel[mm+tmm][nn+TN+tnn])
	loop1_3_2: for (int tmm=0; tmm<MAX_TM; tmm++) {
		for (int tnn=0; tnn<MAX_TN; tnn++) {
			for (int krow=0; krow<MAX_K; krow++) {
				for (int kcol=0; kcol<MAX_K; kcol++) {
					/////////////////////////////////////
					tile_wgt[in_even][tmm][tnn][krow][kcol] = layer_wgt[tile_start_mm+tmm][tile_start_nn+tnn][krow][kcol];
					/////////////////////////////////////
				}
			}
		}
	} // for (tmm)

}


void process_current_tile (
		data_t tile_ifm[2][MAX_TN][MAX_IH][MAX_IW],
		data_t tile_wgt[2][MAX_TM][MAX_TN][MAX_K][MAX_K],
		data_t tile_ofm[2][MAX_TM][MAX_OH][MAX_OW],
		bool in_even,
		bool out_even
)
{

	//////////////// P R O C E S S I N G     E N G I N E  //////////////////
	loop1_3_3: for (int row=0; row<MAX_OH; row++) {
		for (int col=0; col<MAX_OW; col++) {
			for (int kr=0; kr<MAX_K; kr++)   {
				for (int kc=0; kc<MAX_K; kc++) {
					#pragma HLS PIPELINE
					tile_engine (
							row, col, kr, kc,
							tile_ofm[!out_even], tile_ifm[!in_even], tile_wgt[!in_even]
						);
				}
			}
		}
	}
}


void load_inputs_and_process_tile (
	data_t tile_ofm[2][MAX_TM][MAX_OH][MAX_OW],
	data_t tile_wgt[2][MAX_TM][MAX_TN][MAX_K][MAX_K],
	data_t tile_ifm[2][MAX_TN][MAX_IH][MAX_IW],

	data_t layer_ifm[MAX_N][MAX_IH][MAX_IW],
	data_t layer_wgt[MAX_M][MAX_N][MAX_K][MAX_K],

	bool out_even, int mm,
	int TM, int TN
)
{
	bool in_even = 0;
	// For each iteration of inner loop (nn-loop) we need to load a new IFM tile as well as a new WEIGHT tile
	loop1_3: for (int nn=0; nn<MAX_N; nn+=MAX_TN, in_even = !in_even) {

		#pragma HLS dependence variable=tile_ifm intra false
		#pragma HLS dependence variable=tile_wgt intra false

		int next_tile_nn = nn + TN; 
		int next_tile_mm = mm + TM;

		// Loading next IFM Tile for next iteration (input_fm[nn+TN])
		load_next_tile_ifm (layer_ifm, tile_ifm, in_even, next_tile_nn);

		// Loading next WGT Tile for next iteration (input_fm[nn+TN])
		load_next_tile_wgt (layer_wgt, tile_wgt, in_even, next_tile_nn, next_tile_mm);

		//////////////// P R O C E S S I N G     E N G I N E  //////////////////
		process_current_tile (tile_ifm, tile_wgt, tile_ofm, in_even,out_even);

	}
}

The synthesis results below demonstrate that three loops are parallelized well:

 

 

modular_result_1.png

 

In the second step, I decided to parallelize the above function load_inputs_and_process_tile (i.e. loop1_3) together with loop1_1 and loop1_2.

loop1_1 and loop1_2 have dependencies and must be serial. But they can be executed in parallel with loop1_3.

Based on the above timing report, the aggregated latency of loop1_1 and loop1_2 is about 3300 cycles while the latency of parallelized (modular) version of loop1_3 is 20260. So, after parallelizing loop1_1+loop1_2 with loop1_3, I expect the iteration latency of loop1 to be max of 3300 and 20260 (i.e. about 20260). Below is the code of loop1 and loop1_1+loop1_2 after second step of modification:

// This funciton stores partial ofm results of previous output tile and ...
// then, loads partial ofm of next output tile to be updated after processing the current one

void load_store_partial_ofm (
		data_t layer_ofm[MAX_M][MAX_OH][MAX_OW],
		data_t tile_ofm[2][MAX_TM][MAX_OH][MAX_OW],
		int current_tile_mm, int TM,
		bool out_even
)
{
	int next_mm = current_tile_mm + TM;
	int prev_mm = current_tile_mm - TM;

	// FIRST STORE PARTIAL RESULTS OF PREVIOUS ITERATION (output_fm[mm-TM])
	loop1_1: for (int tmm=0; tmm<MAX_TM; tmm++) {
		for (int orow=0; orow<MAX_OH; orow++) {
			for (int ocol=0; ocol<MAX_OW; ocol++) {
					layer_ofm[next_mm+tmm][orow][ocol] = tile_ofm[out_even][tmm][orow][ocol];
			}
		}
	}

	// THEN LOAD A NEW OFM TILE FOR NEXT ITERATION (output_fm[mm+TM])
	loop1_2: for (int tmm=0; tmm<MAX_TM; tmm++) {
		for (int orow=0; orow<MAX_OH; orow++) {
			for (int ocol=0; ocol<MAX_OW; ocol++) {
					tile_ofm[out_even][tmm][orow][ocol] = layer_ofm[next_mm+tmm][orow][ocol];
			}
		}
	}

}

void cnn_layer_modular (
	data_t layer_ifm[MAX_N][MAX_IH][MAX_IW],
	data_t layer_ofm[MAX_M][MAX_OH][MAX_OW],
	data_t layer_wgt[MAX_M][MAX_N][MAX_K][MAX_K],
	int TM, int TN, int M, int N)
{

#pragma HLS RESOURCE variable=layer_ifm core=RAM_2P_BRAM
#pragma HLS RESOURCE variable=layer_ofm core=RAM_2P_BRAM
#pragma HLS RESOURCE variable=layer_wgt core=RAM_2P_BRAM

	data_t IFM[2][MAX_TN][MAX_IH][MAX_IW];
	data_t OFM[2][MAX_TM][MAX_OH][MAX_OW];
	data_t WGT[2][MAX_TM][MAX_TN][MAX_K][MAX_K];

	int mm, nn;
	bool out_even=0, in_even=0;

	#pragma HLS RESOURCE variable=IFM core=RAM_2P_BRAM
	#pragma HLS RESOURCE variable=OFM core=RAM_2P_BRAM
	#pragma HLS RESOURCE variable=WGT core=RAM_2P_BRAM
	
	#pragma HLS array_partition variable=IFM complete dim=1
	#pragma HLS array_partition variable=IFM complete dim=2
	#pragma HLS array_partition variable=OFM complete dim=1
	#pragma HLS array_partition variable=OFM complete dim=2
	#pragma HLS array_partition variable=WGT complete dim=1
	#pragma HLS array_partition variable=WGT complete dim=2
	#pragma HLS array_partition variable=WGT complete dim=3


	loop1: for (mm=0; mm<MAX_M; mm+=MAX_TM, out_even = !out_even) {

		#pragma HLS dependence variable=OFM intra false
		#pragma HLS dependence variable=mm intra false
		#pragma HLS dependence variable=TM intra false
		#pragma HLS dependence variable=out_even intra false

		load_store_partial_ofm(layer_ofm, OFM, mm, TM, out_even);      // loop1_1+loop1_2

		load_inputs_and_process_tile(OFM, WGT, IFM, layer_ifm, layer_wgt, out_even, mm, TM, TN);   // loop1_3
	} // for (mm)

}

But unfortunately, the synthesis results shown below, imply that loop1_1+loop1_2 is not parallelized together with loop1_3. In fact iteration latency of main loop equals the sum of two latencies not max of them.


modular_result_2.png

Another unexpected case that can be observed in the synthesis results is that:
In the final version, I have a main loop (i.e. loop1) that executes two parallel functions in each iteration. So, in the synthesis results I expect that we have just two modules (i.e. load_store_partial_ofm and load_inputs_and_process_tile) in the Instances table and we have just one loop (i.e. loop1) in Loops table.

But the synthesis contains an inner loop (loop1_1 that corresponds to original loop1_3) in Loops table and four modules (load_store_partial_ofm and three submodules of load_inputs_and_process_tile) in Instances table. I think the reason of this case is somehow related to the parallelization problem in step 2.

 

Can anybody explain why loop1_1+loop1_2 is not parallelized with loop_3??

 

Thanks in advance

Ali

0 Kudos
Observer akokha
Observer
208 Views
Registered: ‎07-08-2019

Re: How to parallelize independent inner loops

Jump to solution

Hi all,

I myself solved the second part.

I think the reason was that the function load_inputs_and_process_tile is inlined automatically and hence, the resulting loop (within the body of the inline function) is not parallelized.

So, I used directive #pragma HLS inline off in the body of load_inputs_and_process_tile function. As the result, the function load_inputs_and_process_tile is not inlined within the main loop. Therefore, synthesis results involve a single loop (main loop) that contains two functions load_store_partial_ofm and load_inputs_and_process_tile in the Instances table which are executed in parallel.

 

Untitled.png

 

0 Kudos