UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Participant akokha
Participant
556 Views
Registered: ‎07-08-2019

VivadoHLS Cannot parallelize operations on independent BRAM memories

Jump to solution

Hi,

I am implementing a design to process some input data (xin array) stored in off-chip memory. The result is supposed to be stored in yout array in off-chip memory.

int xin[P][R];

int yout[M][N];

Due to limited on-chip memory, the yout data must be computed in smaller units of on-chip memory whose size is TMxN and named tiles (TM << M). (This is also the case for input data xin)

To overlap the computation of yout tiles with data transmission between off-chip and on-chip memories, I use two blocks of BRAM memory to hold on-chip tiles of yout as below:

int ytile[2][TM][N];

With this memory structure, while ytile[1] array is being computed, previously computed tile in ytile[0] is moved to off-chip memory. In the next epoch, ytile[1] is gradually stored in off-chip memory while a new tile is computed in ytile[0], and so on. In other words, in each iteration (i.e. epoch) the roles of ytile[0] and ytile[1] memories are exchanged.

Below is the code of this design:

void top_module(
	int xin[P][R],
	int yout[M][N]
)
{
	int ytile[2][TM][N];
	ap_uint<1> oflag;

	#pragma HLS ARRAY_PARTITION variable=ytile dim=1
	#pragma HLS RESOURCE variable=ytile core=RAM_2P_BRAM

	init_ytile(ytile[0]);
	load_inputs_and_compute_ytile(xin, ytile[0]);
	init_ytile(ytile[1]);
	oflag = 1;

	for (int tmm=0; tmm<M-TM; tmm+=TM) {
		load_inputs_and_process_ytile(xin, ytile[oflag]);
		store_ytile_then_init_ytile(yout, ytile[1-oflag], tmm);
		oflag = 1 - oflag;
	}

	store_ytile(yout, ytile[1-oflag], tmm); // Storing Last ytile
}

Here, the processing of xin array is not my concern and our focus is on the for loop shown above.

Since array ytile is partitioned along first dimension, I expect that the processing of ytile[0] and ytile[1] can be done in parallel. In each iteration, a tile is computed in ytile[oflag]. Simultaneously, previously computed tile in ytile[1-oflag] is stored in off-chip memory yout and then initialized for next iteration.

I simulated this design using C/RTL co-simulation in Vivado HLS. But, unfortunately the operations on ytile[0] and ytile[1] are performed serially.

I tried using #pragma HLS dependence variable=ytile intra false to enforce intra independence of ytile[0] and ytile[1]. But, in this case, the results computed in ytile[0] and ytile[1] are all incorrect and equal zero.

Can anybody tell me what is wrong with my design and how I can fix it?

Thanks in advance,

Ali

0 Kudos
1 Solution

Accepted Solutions
Scholar u4223374
Scholar
402 Views
Registered: ‎04-26-2015

Re: VivadoHLS Cannot parallelize operations on independent BRAM memories

Jump to solution

@akokha OK, the trick is that HLS can't share an array between functions (except in dataflow regions). Even if it's partitioned and you feed one section to one function and the other section to another function, it's not going to work. So, you need two arrays:

void top_module(
	int xin[P][R],
	int yout[M][N]
)
{
	int ytile0[TM][N];
	int ytile1[TM][N];
	ap_uint<1> oflag;

	#pragma HLS ARRAY_PARTITION variable=ytile dim=1
	#pragma HLS RESOURCE variable=ytile core=RAM_2P_BRAM

	init_ytile(ytile[0]);
	load_inputs_and_compute_ytile(xin, ytile0);
	init_ytile(ytile[1]);
	oflag = 1;

	for (int tmm=0; tmm<M-TM; tmm+=TM) {
		if (oflag) {
			load_inputs_and_process_ytile(xin, ytile0);
			store_ytile_then_init_ytile(yout, ytile1, tmm);
		} else {
			load_inputs_and_process_ytile(xin, ytile1);
			store_ytile_then_init_ytile(yout, ytile0, tmm);
		}
		oflag = 1 - oflag;
	}

	if (oflag) {
		store_ytile(yout, ytile0, tmm); // Storing Last ytile
	} else {
		store_ytile(yout, ytile1, tmm); // Storing Last ytile
	}
}

Like I said, it's not pretty - but it works. It can be a bit neater than this - I generally put an enable input on each function, which means I can do everything in the loop and just disable whichever functions aren't needed for the first/last iterations.

 

8 Replies
Participant akokha
Participant
484 Views
Registered: ‎07-08-2019

Re: VivadoHLS Cannot parallelize operations on independent BRAM memories

Jump to solution

Hi @nithink ,

Hi @u4223374 ,

Unfortunately, noboday has responded to this post of mine, yet.

Since you have answered majority of my past problems, I will be grateful if you would help me solving this issue, too.

The design on which I am working is almost complex. But, if a more realistic still concise example is needed, please tell me to prepare it ASAP.

Thanks in advance for your time and consideration.

Ali

0 Kudos
Xilinx Employee
Xilinx Employee
479 Views
Registered: ‎09-04-2017

Re: VivadoHLS Cannot parallelize operations on independent BRAM memories

Jump to solution

Hi Ali, Can you share a simple code which shows this behaviour. Would be easy to start with that

Thanks,

Nithin

Scholar u4223374
Scholar
443 Views
Registered: ‎04-26-2015

Re: VivadoHLS Cannot parallelize operations on independent BRAM memories

Jump to solution

There is a way to do this, although it's not very pretty. I'll try to remember to post it tomorrow (because it's midnight here); if I forget then send me a message.

0 Kudos
Scholar u4223374
Scholar
403 Views
Registered: ‎04-26-2015

Re: VivadoHLS Cannot parallelize operations on independent BRAM memories

Jump to solution

@akokha OK, the trick is that HLS can't share an array between functions (except in dataflow regions). Even if it's partitioned and you feed one section to one function and the other section to another function, it's not going to work. So, you need two arrays:

void top_module(
	int xin[P][R],
	int yout[M][N]
)
{
	int ytile0[TM][N];
	int ytile1[TM][N];
	ap_uint<1> oflag;

	#pragma HLS ARRAY_PARTITION variable=ytile dim=1
	#pragma HLS RESOURCE variable=ytile core=RAM_2P_BRAM

	init_ytile(ytile[0]);
	load_inputs_and_compute_ytile(xin, ytile0);
	init_ytile(ytile[1]);
	oflag = 1;

	for (int tmm=0; tmm<M-TM; tmm+=TM) {
		if (oflag) {
			load_inputs_and_process_ytile(xin, ytile0);
			store_ytile_then_init_ytile(yout, ytile1, tmm);
		} else {
			load_inputs_and_process_ytile(xin, ytile1);
			store_ytile_then_init_ytile(yout, ytile0, tmm);
		}
		oflag = 1 - oflag;
	}

	if (oflag) {
		store_ytile(yout, ytile0, tmm); // Storing Last ytile
	} else {
		store_ytile(yout, ytile1, tmm); // Storing Last ytile
	}
}

Like I said, it's not pretty - but it works. It can be a bit neater than this - I generally put an enable input on each function, which means I can do everything in the loop and just disable whichever functions aren't needed for the first/last iterations.

 

Participant akokha
Participant
389 Views
Registered: ‎07-08-2019

Re: VivadoHLS Cannot parallelize operations on independent BRAM memories

Jump to solution

Hi @u4223374 ,

Thanks for your comments.

But, here rises another question. Doesn't this duplicate resources of double function calls to functions load_inputs_and_process_ytile(), store_ytile_then_init_ytile(), etc. with different arguments?

If this is the case, it is not desirable because we duplicate many resources (DSPs, internal BRAMs, etc.) within these modules.

Thanks,

Ali

0 Kudos
Xilinx Employee
Xilinx Employee
315 Views
Registered: ‎09-04-2017

Re: VivadoHLS Cannot parallelize operations on independent BRAM memories

Jump to solution

Hi Ali,

  The function calls are mutually exclusive, so HLS will use one instance of the function and schedule as needed.

Thanks,

Nithin

Participant akokha
Participant
260 Views
Registered: ‎07-08-2019

Re: VivadoHLS Cannot parallelize operations on independent BRAM memories

Jump to solution

Hi Nithin,

It's reasonable and sensible.

Thanks a lot,
Ali

0 Kudos
Scholar u4223374
Scholar
236 Views
Registered: ‎04-26-2015

Re: VivadoHLS Cannot parallelize operations on independent BRAM memories

Jump to solution

As above, HLS is pretty smart about this. It won't duplicate function calls (by default) unless duplicating them is actually easier (eg. if the function is so small that it takes less space/time to duplicate it than to implement the multiplexers needed to feed two different arrays into it). I've never seen it duplicate a function without being told to.

 

It'd be fantastic if HLS could do this automatically. I'd really like a way of saying "ytile is a TDP BRAM. Give port A to function A, and give port B to function B, and I'll take responsibility for any collisions". That would also eliminate the multiplexers that are needed by the current approach. Sort of like what the dataflow pragma does, but dataflow is too high-level for me - I'd like more control.