cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Adventurer
Adventurer
863 Views
Registered: ‎07-08-2019

How to pipeline storing an array to off-chip memory

Hi,

I want to write a C++ module in Vivado HLS that sums over rows of an input 3D array xin[M][N][P] and stores the results into an output 2D array yout[M][N]. (For example, assuming that M=30, N=20, and P=10, yout[0][0]=sum(xin[0][0][0], …, xin[0][0][9] ).

I want to perform the operations in 6 epochs using an array ytile[2][TM][N] (TM=5) which is mapped onto on-chip BRAM and completely partitioned across dim=1 and dim=2 dimensions.

#include <ap_int.h>

#define M 30
#define N 20
#define P 10
#define TM 5

static ap_uint<1> oflag = 0;

static int ytile[2][TM][N];

void compute_ytile(int xin[M][N][P], int yt_start);
void store_ytile(int yout[M][N],     int yt_start);

void top_module(
	int xin[M][N][P],
	int yout[M][N]
)
{
	#pragma HLS ARRAY_PARTITION variable=ytile dim=1
	#pragma HLS ARRAY_PARTITION variable=ytile dim=2
	#pragma HLS RESOURCE variable=ytile core=RAM_2P_BRAM

	#pragma HLS ARRAY_PARTITION variable=xin dim=1
	#pragma HLS RESOURCE variable=xin core=RAM_1P_BRAM  // Single-port is OK

	#pragma HLS ARRAY_PARTITION variable=yout dim=1


	for (int nn=0; nn<N; nn++) {
		#pragma HLS PIPELINE
		for (int tmm=0; tmm<TM; tmm++) {
			ytile[oflag][tmm][nn] = 0;
		}
	}


	for (int mm=0; mm<M; mm+=TM) {
//		#pragma HLS PIPELINE

		compute_ytile(xin, mm);
		store_ytile(yout, mm-TM);

		oflag = 1 - oflag;
	}

}

void compute_ytile(
	int xin[M][N][P],
	int yt_start
)
{
	#pragma HLS INLINE OFF

	if (yt_start < M) {

		for (int pp=0; pp<P; pp++) {
			for (int nn=0; nn<N; nn++) {
				#pragma HLS PIPELINE
				for (int tmm=0; tmm<TM; tmm++) {
					int m_index = yt_start + tmm;
					ytile[oflag][tmm][nn] += xin[m_index][nn][pp];
				}
			}
		}

	}
}

void store_ytile(
	int yout[M][N],
	int yt_start
)
{
	#pragma HLS INLINE OFF

	if (yt_start >= 0) {

		store_loop: for (int nn=0; nn<N; nn++) {
			#pragma HLS PIPELINE
			for (int tmm=0; tmm<TM; tmm++) {
				int m_index = yt_start + tmm;
				yout[m_index][nn] = ytile[1-oflag][tmm][nn];
			}
		}

	}

	if (yt_start + TM < M) {

		for (int nn=0; nn<N; nn++) {
			#pragma HLS PIPELINE
			for (int tmm=0; tmm<TM; tmm++) {
					ytile[1-oflag][tmm][nn] = 0;
			}
		}
	}

}

In each iteration of the main loop of top_module:

  • The compute_ytile  module computes five rows of yout in ytile[oflag] (oflag is either 0 or 1) and then,
  • The store_ytile stores previously computed sums of ytile[1-oflag] and clears ytile[1-oflag] contents.

For now, parallelizing main loop of top_module is not my concern and my problem is specifically about first loop (store loop) within store_ytile module. I tried to add necessary directives to partition arrays if whenever necessary and pipeline all loops of my design other than top_module’s main loop.

I expect the store loop of store_ytile to be pipelined so that for example, ytile_0_0 elements are read in consecutive cycles and stored in yout_0 in consecutive cycles with one cycle latency, as shown below:

store pipeline 02.png

 

But the waveform of C/RTL cosimulation of the project shows that storing each element is done in three cycles as shown below:

store pipeline 01.png

 

(All necessary files for C/RTL Co-simulation are attached)

Can anybody tell me how I can solve this problem?

 

Thanks in advance

Ali

0 Kudos
8 Replies
Highlighted
Advisor
Advisor
841 Views
Registered: ‎04-26-2015

What interface are you expecting to use for yout? AXI Master? If so, probably a good idea to not partition that.

In this case you'll need to move your pipeline directive down one level, so you're only pipelining the inner loop.

 

Tags (1)
0 Kudos
Highlighted
Adventurer
Adventurer
789 Views
Registered: ‎07-08-2019

For now, the type of interface is not my concern. But, based on previous papers on my topic, AXI Master can be a good choice.

What I need for now is a mechanism to store BRAM data to off-chip memory in a fully-pipeline fashion.

For example, assuming five (TM=5) parallel BRAM blocks for ytile array each having 20 elements (N=20), I expect the store operation to be performed in 21 consecutive cycles (or a few cycles more).

Thanks

0 Kudos
Highlighted
Adventurer
Adventurer
744 Views
Registered: ‎07-08-2019

Hi @u4223374 ,

I applied your suggested changes, but it doesn’t work.

In fact, the inner loop (for-tmm) operates on five partitioned arrays that can be read in parallel. So, all iterations of the inner loop are completely independent and must be run in parallel. By moving the #pragma HLS pipeline into the inner loop, we made its iterations run in a pipeline fashion in 6 cycles instead of two cycles in parallel.

I expect that five iterations of for-tmm loop run in parallel (two cycles for read and write all) and 20 iterations of for-nn loop run in a pipeline fashion with depth=2 (21 cycles).

From the hardware viewpoint this is logical and possible.

So, is there any solution to achieve this goal?

Thanks,

Ali

 

cc: @nithink 

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
718 Views
Registered: ‎09-04-2017

Hi Ali,

  Let me take a look to understand the issue.

Thanks,

Nithin

 

0 Kudos
Highlighted
Adventurer
Adventurer
688 Views
Registered: ‎07-08-2019

Hi Nithin,

Please let me know if any extra files or comments about the issue are needed.

Thanks,

Ali

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
683 Views
Registered: ‎09-04-2017

Hi Ali,

  The address generation in the store_ytile seems to be causing the issue.  If we remove the resetting of ytile, then also I see II=1.  The inner loop is getting unrolled but for the next calls, since the address to yout is getting generated dynamically this seems to be causing the issue.

Thanks,

Nithin

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
596 Views
Registered: ‎09-04-2017

@akokha  Found one workaround. Please see if this helps

#include <ap_int.h>

#define M 30
#define N 20
#define P 10
#define TM 5

static ap_uint<1> oflag = 0;

static int ytile[2][TM][N];

void compute_ytile(int xin[M][N][P], int yt_start);
void store_ytile(int yout[M][N], int yt_start);

void top(
int xin[M][N][P],
int yout[M][N]
)
{
#pragma HLS ARRAY_PARTITION variable=ytile dim=1
#pragma HLS ARRAY_PARTITION variable=ytile dim=2
#pragma HLS RESOURCE variable=ytile core=RAM_2P_BRAM

#pragma HLS ARRAY_PARTITION variable=xin dim=1
#pragma HLS RESOURCE variable=xin core=RAM_1P_BRAM // Single-port is OK

#pragma HLS ARRAY_PARTITION variable=yout dim=1


for (int nn=0; nn<N; nn++) {
#pragma HLS PIPELINE
for (int tmm=0; tmm<TM; tmm++) {
ytile[oflag][tmm][nn] = 0;
}
}


for (int mm=0; mm<M; mm+=TM) {
// #pragma HLS PIPELINE

compute_ytile(xin, mm);
store_ytile(yout, mm-TM);

oflag = 1 - oflag;
}

}

void compute_ytile(
int xin[M][N][P],
int yt_start
)
{
#pragma HLS INLINE OFF

if (yt_start < M) {

for (int pp=0; pp<P; pp++) {
for (int nn=0; nn<N; nn++) {
#pragma HLS PIPELINE
for (int tmm=0; tmm<TM; tmm++) {
int m_index = yt_start + tmm;
ytile[oflag][tmm][nn] += xin[m_index][nn][pp];
}
}
}

}
}

void store_ytile(
int yout1[M][N],
int yt_start
)
{
int yout[M][N];
#pragma HLS ARRAY_PARTITION variable=yout1 dim=1

#pragma HLS ARRAY_PARTITION variable=yout dim=1
#pragma HLS INLINE OFF

if (yt_start >= 0) {

store_loop: for (int nn=0; nn<N; nn++) {
#pragma HLS PIPELINE
for (int tmm=0; tmm<TM; tmm++) {
int m_index = yt_start + tmm;
yout[m_index][nn] = ytile[1-oflag][tmm][nn];
}
}

}

if (yt_start + TM < M) {

for (int nn=0; nn<N; nn++) {
#pragma HLS PIPELINE
for (int tmm=0; tmm<TM; tmm++) {
ytile[1-oflag][tmm][nn] = 0;
}
}
}

Output : for (int nn=0; nn<N; nn++) {
#pragma HLS PIPELINE
for (int tmm=0; tmm<TM; tmm++) {
yout1[tmm][nn] = yout[tmm][nn];
}
}

}

Have taken a temporary array and assigned it to the output.

Thanks,

Nithin

 

 

 

Highlighted
Adventurer
Adventurer
289 Views
Registered: ‎07-08-2019

Hi @nithink ,

I tested your suggested solution. But, unfortunately it doesn't work. Not only the II=3 cycles still remains, but also some extra cycles needed for the temporary array are added.

Furthermore, I eliminated all PIPELINE pragmas from all loops other than the loop of storing the output in order to prevent all possible conflicts with store loop. But, the problem still remains and the loop of storing ytile data into yout memories is pipelined with an II=3 cycles.

 

#include <ap_int.h>

#define M 30
#define N 20
#define P 10
#define TM 5

static ap_uint<1> oflag = 0;

static int ytile[2][TM][N];

void compute_ytile(int xin[M][N][P], int yt_start);
void store_ytile(int yout[M][N],     int yt_start);

void top_module(
	int xin[M][N][P],
	int yout[M][N]
)
{
	#pragma HLS ARRAY_PARTITION variable=ytile dim=1
	#pragma HLS ARRAY_PARTITION variable=ytile dim=2
	#pragma HLS RESOURCE variable=ytile core=RAM_2P_BRAM

	#pragma HLS ARRAY_PARTITION variable=xin dim=1
	#pragma HLS RESOURCE variable=xin core=RAM_1P_BRAM  // Single-port is OK

	#pragma HLS ARRAY_PARTITION variable=yout dim=1

	for (int nn=0; nn<N; nn++) {
		// #pragma HLS PIPELINE
		for (int tmm=0; tmm<TM; tmm++) {
			ytile[oflag][tmm][nn] = 0;
		}
	}

	for (int mm=0; mm<M; mm+=TM) {
		// #pragma HLS PIPELINE
		compute_ytile(xin, mm);
		store_ytile(yout, mm-TM);

		oflag = 1 - oflag;
	}

}

void compute_ytile(
	int xin[M][N][P],
	int yt_start
)
{
	#pragma HLS INLINE OFF

	if (yt_start < M) {

		for (int pp=0; pp<P; pp++) {
			for (int nn=0; nn<N; nn++) {
				// #pragma HLS PIPELINE
				for (int tmm=0; tmm<TM; tmm++) {
					int m_index = yt_start + tmm;
					ytile[oflag][tmm][nn] += xin[m_index][nn][pp];
				}
			}
		}

	}
}

void store_ytile(
	int yout[M][N],
	int yt_start
)
{
	#pragma HLS INLINE OFF

	if (yt_start >= 0) {

		store_loop: for (int nn=0; nn<N; nn++) {
			#pragma HLS PIPELINE
			for (int tmm=0; tmm<TM; tmm++) {
				int m_index = yt_start + tmm;
				yout[m_index][nn] = ytile[1-oflag][tmm][nn];
			}
		}
	}

	if (yt_start + TM < M) {

		for (int nn=0; nn<N; nn++) {
			// #pragma HLS PIPELINE
			for (int tmm=0; tmm<TM; tmm++) {
				ytile[1-oflag][tmm][nn] = 0;
			}
		}
	}

}

It seems a straightforward problem with a straightforward solution. I think it must exist some code for the store_loop that vivado HLS can exactly synthesize it to a pipeline with II=1.

So, what is the source of the problem? and where is my mistake?

Note that all TM arrays in the loop are partitioned (i.e., dim1 of yout and dim2 of ytile) and are expected to be processed in parallel.

cc: @u4223374 

Thanks a lot,

Ali

 

0 Kudos