UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Observer stefanoribes
Observer
129 Views
Registered: ‎07-25-2016

Parallel PEs sharing partitioned array

Hi everyone,

I'm trying to parallelize a piece of code involving some processing elements (PE) that should run in parallel on different data. HLS is apparently unable to understand the parallelism and doesn't schedule the PEs to run in parallel.

 

Basically, I have a shared 2-dimensional buffer fully partitioned in the first dimension. For a certain amount of iterations, each PE will operate on a different 'tile' of the buffer. Which tile to operate on, i.e. the buffer index, is stored seperately for each PE. Here is a snippet of the design:

  float shared_buffer[kTiles][V / kTiles];
#pragma HLS ARRAY_PARTITION variable=shared_buffer complete dim=1
  
  int index_buffer[kNumPEs][N];
#pragma HLS ARRAY_PARTITION variable=index_buffer complete dim=1

  for (int i = 0; i < N; ++i) {
    for (int j = 0; j < kNumPEs; ++j) {
#pragma HLS LOOP_FLATTEN
#pragma HLS PIPELINE II=1
      // The idea is that each PE gets a different index to work on.
      index_buffer[i][j] = get_buffer_tile(...);
assert(index_buffer[i][j] < kTiles); } } for (int pe = 0; pe < kNumPEs; ++pe) { #pragma HLS UNROLL #pragma HLS DEPENDENCE variable=shared_buffer inter false #pragma HLS DEPENDENCE variable=index inter false for (int n = 0; n < N; ++n) { for (int i = 0; i < V / kTiles; ++i) { #pragma HLS LOOP_FLATTEN #pragma HLS PIPELINE II=1 if (i == 0) { index[pe] = index_buffer[pe][n]; if (pe > 0) { assert(index[pe] != index[pe - 1]); } assert(index[pe] < kTiles); } // PE logic: auto shared_buffer[index[pe]][i] = shared_buffer[index[pe]][i] * ...; #pragma HLS DEPENDENCE variable=shared_buffer inter false } } }

So, ideally, we know that:

  • The index buffer is filled before all PEs start.
  • All the PEs start all the same time and have the exact same code, and so the same latency and II.
  • Ideally, when it's time to read from the 'shared' buffer, all PEs know at the same time the index and that index will always be different. Since the tile index is always different across the PEs, they should run in parallel (say in lockstep fashion).

How do I enforce the parallelism on the kNumPEs loop? I tried to put asserts and DEPENDENCE pragmas along the code, but none of them helped.

Any suggestion or tip would be highly appreciated!

Thanks, Regards,

Stefano

 

p.s. Working with Vivado HLS 2018.3

0 Kudos
3 Replies
Scholar u4223374
Scholar
112 Views
Registered: ‎04-26-2015

Re: Parallel PEs sharing partitioned array

Hmm, that's an interesting one. Normally I use the pipeline pragma to tell HLS "I really, really want this to run in parallel" (because it both unrolls sub-loops and tells you why it can't do everything fast enough - which then gives useful debugging information). In order to make that work, could we switch the order of some of your loops?

 

    for (int n = 0; n < N; ++n) {
		for (int i = 0; i < V / kTiles; ++i) {
		#pragma HLS LOOP_FLATTEN
		#pragma HLS PIPELINE II=1
		for (int pe = 0; pe < kNumPEs; ++pe) {
			if (i == 0) {
				index[pe] = index_buffer[pe][n];
				if (pe > 0) {
					assert(index[pe] != index[pe - 1]);
				}
				assert(index[pe] < kTiles);
			}
			// PE logic:
			auto shared_buffer[index[pe]][i] = shared_buffer[index[pe]][i] * ...;
		}
	}

(I took out some of your dependence pragmas - better to try it without them, and add them if HLS complains). This approach will mean that HLS still unrolls the "pe" loop, but it now has to really try to pipeline the unrolled loop - and tell you why that's not possible.

0 Kudos
Observer stefanoribes
Observer
96 Views
Registered: ‎07-25-2016

Re: Parallel PEs sharing partitioned array

Hi, thanks for the quick reply!

So, I tried moving the PE_Loop 'below' as you suggested (swapping the order is fine btw), but it didn't work out.

Now HLS issues warning on the PE_Loop having a II=kNumPE and Depth=V/kTiles. And on top of that, other warnings on the shared buffer (and a variable related to it, see snippet below):

  • WARNING: [SCHED 204-68] The II Violation in module '...': Unable to enforce a carried dependence constraint (II = 1, distance = 1, offset = 1) between 'store' operation of variable 'mac_val.V' on local variable 'mac_val.V' and 'store' operation of variable 'mac_val.V' on local variable 'mac_val.V'
  • WARNING: [SCHED 204-69] Unable to schedule 'load' operation ('shared_buffer_0_V_loa_6') on array 'shared_buffer1[0].V' due to limited memory ports. Please consider using a memory core with more ports or partitioning the array 'shared_buffer_0_V'.

Now the code looks something like this (I included the PE logic, is somewhat a MAC):

  float shared_buffer[kTiles][V / kTiles];
#pragma HLS ARRAY_PARTITION variable=shared_buffer complete dim=1

  int index_buffer[kNumPEs][N];
#pragma HLS ARRAY_PARTITION variable=index_buffer complete dim=1

  for (int i = 0; i < N; ++i) {
    for (int j = 0; j < kNumPEs; ++j) {
#pragma HLS LOOP_FLATTEN
#pragma HLS PIPELINE II=1
      // The idea is that each PE gets a different index to work on.
      index_buffer[i][j] = j;
    }
  }

  for (int n = 0; n < N; ++n) {
    for (int i = 0; i < V / kTiles; ++i) {
#pragma HLS LOOP_FLATTEN
#pragma HLS PIPELINE II=1
      for (int pe = 0; pe < kNumPEs; ++pe) {
        if (i == 0) {
          index[pe] = index_buffer[pe][n];
          a[pe] = a_buffer[pe][n];
          b[pe] = b_buffer[pe][n];
          if (pe > 0) {
            assert(index[pe] != index[pe - 1]);
          }
          assert(index[pe] < kTiles);
        }
        // PE logic:
        auto mac_val = shared_buffer[index[pe]][i] + a[pe] * b[pe];
#pragma HLS DEPENDENCE variable=shared_buffer inter false
        shared_buffer[index[pe]][i] = mac_val;
      }
    }
  }

I'm actually not entirely sure what to try next, perhaps creating a specific mac_val per PE...

In the meantime, as usual, any more comments are welcome!

BR,

Stefano

0 Kudos
Scholar u4223374
Scholar
66 Views
Registered: ‎04-26-2015

Re: Parallel PEs sharing partitioned array

It's certainly a tricky problem!

 

The dependence thing is probably a result of using floating-point and block RAM. The delays implied with either of those means that the read-update-write cycle on the memory is going to take a couple of clock cycles - but with the pipeline at II=1 directive there it needs to be able to begin a new iteration every cycle. This means that the next read will occur before the previous write completes, which HLS will not allow (because it'd create incorrect results). I'm not sure if there's a good solution here.

 

The issue with number of ports is, I think, just because HLS is confused about your access patterns. We discussed a fix for this one in this thread, and I'm pretty sure that would work for you too. This would at least allow all PEs to run in parallel, even if (due to the dependence issue) they can't run at II=1.

0 Kudos