UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Visitor gdeest
Visitor
4,628 Views
Registered: ‎01-22-2015

Achieving II=1 with accesses on 2D array

Hi,

 

I am struggling to achieve II=1 when pipelining a loop nest performing accesses on a 2D array.

 

Here is a simplified version of my code illustrating the problem:

 

#include <hls_stream.h>


#define S0 256
#define S1 256


using namespace hls;
typedef float data_t;

#define REUSE(x0,x1) reuse[(x0)+2][(x1)+2]

void compute(int tt, stream<data_t> &out, data_t reuse[S0+2][S1+2]) {

    #pragma HLS RESOURCE variable=reuse core=RAM_2P
    #pragma HLS ARRAY_PARTITION variable=reuse dim=1 cyclic factor=3
    #pragma HLS ARRAY_PARTITION variable=reuse dim=2 cyclic factor=3

    data_t d[5];
    #pragma HLS ARRAY_PARTITION variable=d complete

    for (int xx0=0; xx0<S0; xx0++) {
        for (int xx1=0; xx1<S1; xx1++) {
            #pragma HLS DEPENDENCE variable=reuse inter=false
            #pragma HLS PIPELINE II=1

            d[0] = REUSE(xx0, xx1-1);
            d[1] = REUSE(xx0-1, xx1);
            d[2] = REUSE(xx0-1, xx1-1);
            d[3] = REUSE(xx0-1, xx1-2);
            d[4] = REUSE(xx0-2, xx1-1);

            data_t ret = (d[0] + d[1] + d[2] + d[3] + d[4])/5.02;
            REUSE(xx0, xx1) = ret;

            out.write(ret);
        }
    }
}

 

 

Thanks to the cyclic partitioning pragmas, the use of 2-ports BRAMs and the fact that within a single loop iteration, all read/write accesses are performed on different cells in a 3x3 window, I would expect Vivado HLS to schedule all reads in parallel and achieve II=1. As it turns around, it does not seem to see that and thus schedules them sequentially, as if all the d[i]'s could come from the same BRAM, which causes memory access conflicts during pipelining and prevents it from achieving II=1.

 

Is there an easy work-around, besides performing the partitioning by hand ?

 

Thanks in advance,

 

Gaël

0 Kudos
5 Replies
Participant jehandad
Participant
4,621 Views
Registered: ‎06-08-2016

Re: Achieving II=1 with accesses on 2D array

If you are sure that the access would be on different cells, then you can force no-dependency using the HLS pragmas. If however there is a dependency and it is violated your cosim would hang.

 

Regards,

Jehandad

0 Kudos
Visitor gdeest
Visitor
4,605 Views
Registered: ‎01-22-2015

Re: Achieving II=1 with accesses on 2D array

Hi jehandad,

 

Thanks for your reply.

 

As a matter of fact, I am already asserting the absence of dependency on the reuse array between iterations. The pipelining issue has to do with the limited number of memory ports on each BRAMs. The tool fails to see that all reads are performed on different BRAMs, no matter the value of the iterators, and thus schedules them sequentially instead of performing them all at once. As the reads are initiated at different pipeline stages, this causes access conflicts and prevents efficient pipelining.

 

Gaël

0 Kudos
Xilinx Employee
Xilinx Employee
4,590 Views
Registered: ‎08-17-2011

Re: Achieving II=1 with accesses on 2D array

Hi @gdeest

 

Rather than cyclic partition which creates more arrays and then VHLS fails to see they are mutually exclusive accesses, it would be better to array reshape, so that reading 1 location would actually read the others at the same time - that's the gist of it.

Note that basically you will store old indexes 0 1 2 at new index 0, old index 3 4 5 at new index 1 etc and you would still have issues.. since reading 3 consecutive old indexes may straddle 2 new locations anyway so you would still need some local caching. note that reading modulo 3 is not very HW friendly.

 

In the general context of what you are doing, it looks like you are processing an input frame in raster scan and apply a filter on a 3x3 moving window.

your implementation has 5 inputs reads per 1 output write and you are trying to make more memory ports available rather than fixing the read to write imbalance.

The way to do that is to read one pixel , store into delay lines and into a moving window, other values of the moving window come from the delay lines; then apply the filter kernel on the window to produce output value.

-> read 1 pixel & write 1 pixel.

 

PS your mockup algorithm is strange as you are writing out location (xx0,xx1) that you will read again in future iterations.

 

I hope this helps

- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Visitor gdeest
Visitor
4,569 Views
Registered: ‎01-22-2015

Re: Achieving II=1 with accesses on 2D array

Hi @herver,

 

Thank you very much for your reply.

 

Yes, the mock-up example is indeed a bit artificial as the dependence pragma is in fact erroneous, as you accurately observed ; it did illustrate my problem quite well, though.

 

I'll see how I can apply your insights to the problem at hand.

 

Best regards,

 

Gaël

0 Kudos
Scholar u4223374
Scholar
4,528 Views
Registered: ‎04-26-2015

Re: Achieving II=1 with accesses on 2D array

For what it's worth, I recently had to do something vaguely similar, and HLS's behaviour with the delay lines (as suggested by @herver) was quite spectacular. I was expecting to have to do a huge amount of manual adjustment and hard-coded optimization to get everything working right, but HLS figured it out straight away. The result was something that's still fairly flexible (ie minimal hard-coded optimisation), resource-efficient, and quick in both coding time and clock speed.

 

The only major failure was the timing estimate; HLS (2015.4) claimed 8.78ns (target was 6ns), but at implementation Vivado manages to hit 4.3ns. The block is running perfectly well at 200MHz.

 

Your initial approach probably can be made to work, with some effort, but all those dividers (to do the modulo-3 operation) will be expensive and slow - and the extra RAM needed (due to array partitioning) will be annoying too. Having two linebuffers (for the previous two lines) and a single delay line (for the middle line) shouldn't be much more work, and it'll produce a far more efficient bit of hardware.