UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Observer y.lee0320
Observer
120 Views
Registered: ‎01-02-2019

External DDR access is a II bottleneck ?

Jump to solution

Hi,

I am working on a image process project with Vivado HLS. To save BRAM, I am using external DDR as a buffer to save some information of previous image to be used on the current frame. However, it seems that memcpy (burst read I supposed) induces great impact in the II. Is that normal?  Any suggestion?   Thanks in advance. 

 

void top_function(hls::stream<uint_16_side_channel> &inStream, hls::stream<uint_8_side_channel> &outStream, int width, int height, unsigned int DRAM[ 135 ][ 240 ])

{

    for (Row=0;Row<1080;Row++) {

       for (Col=0;Col<1920;Col++)  {

              if( Col == W - 1 ){

                     memcpy((unsigned char *)&Mean_Buf_ITP[7][0], (unsigned char *) &DRAM[ MN_Row + 4 ][0], 240);
               }

 

               // 

         }  // for Col

     }  // for Row

}

 

WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 72, distance = 1, offset = 1)
between bus read on port 'DRAM' (ExtraProc.cpp:526) and bus read on port 'DRAM' (ExtraProc.cpp:448).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 103, distance = 1, offset = 1)
between bus read on port 'DRAM' (ExtraProc.cpp:526) and bus read on port 'DRAM' (ExtraProc.cpp:448).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 118, distance = 1, offset = 1)
between bus read on port 'DRAM' (ExtraProc.cpp:526) and bus read on port 'DRAM' (ExtraProc.cpp:448).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 120, distance = 1, offset = 1)
between bus read on port 'DRAM' (ExtraProc.cpp:526) and bus read on port 'DRAM' (ExtraProc.cpp:448).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 121, distance = 1, offset = 1)
between bus read on port 'DRAM' (ExtraProc.cpp:526) and bus read on port 'DRAM' (ExtraProc.cpp:448).
INFO: [SCHED 204-61] Pipelining result : Target II = 1, Final II = 122, Depth = 198.

0 Kudos
1 Solution

Accepted Solutions
Scholar u4223374
Scholar
69 Views
Registered: ‎04-26-2015

Re: External DDR access is a II bottleneck ?

Jump to solution

I'd just ignore that warning. Failing to flatten the loop will cost you ... maybe 1% performance. Probably less.

 

The memcpy is taking around 120 cycles to run (I assume it's 120 cycles of data transfer plus a cycle or two of overhead to set up the transfer). The HLS analysis goes like this:

How long does an iteration of the inner loop take to run? When Col != (W-1), it takes one cycle. When Col == W-1, it takes 122 cycles (because it has to perform that whole mempcy operation). So the best II that can be met in every case is II=122.

How long does an iteration of the outer loop take to run? This is where HLS is not particularly smart. It's saying "well, one iteration takes up to 122 cycles, and there are 1920 iterations, so the total time is going to be 234240 cycles".

How long does the outer loop take to run? This is going to be (to a good approximation) 1080 times the time for one iteration. 234240 * 1080 = 252,979,200 cycles.

 

By moving the memcpy outside the inner loop, the analysis goes like this:

How long does an iteration of the inner loop take to run? It's always 1 cycle, nothing else takes any longer.

How long does an iteration of the outer loop take to run? It'll be 1920 iterations of the inner loop (1920 * 1 = 1920) plus one memcpy (122 cycles). Total time is 2142 cycles.

How long does the outer loop take to run? This is going to be (to a good approximation) 1080 times the time for one iteration. 2142 * 1080 = 2,313,360 cycles. Somewhat more than a hundred times quicker than before.

 

I honestly don't know if HLS will actually use 122 cycles when Col != (W-1). Maybe it'll see that there's no more work to do and move on to the next iteration straight away. However, it's also possible that the pipelining is much easier to organize if it always takes exactly the same time (regardless of whether the memcpy occurs).

 

0 Kudos
6 Replies
Observer y.lee0320
Observer
108 Views
Registered: ‎01-02-2019

Re: External DDR access is a II bottleneck ?

Jump to solution

Sorry, forgot to mention, here is line 526 and line 448, they are in the same loop.

 

line 529:         

    memcpy((unsigned char *)&Mean_Line_Buf[0], (unsigned char *) &DRAM[ MN_Row ][0], 240);

line 448:         

    memcpy((unsigned char *)&Mean_Buf_ITP[7][0], (unsigned char *) &DRAM[ MN_Row + 4 ][0], 240);

0 Kudos
Scholar u4223374
Scholar
85 Views
Registered: ‎04-26-2015

Re: External DDR access is a II bottleneck ?

Jump to solution

When HLS is trying to find the II for a loop, it calculates the most stuff that could possibly happen in a single loop iteration - and that defines the II. As far as I can tell, it doesn't do min/max as it does for function latency. I'm not sure whether it'll actually run the loop faster if the slow operations are skipped in one iteration, or whether it'll enforce the same II always.

 

In this case, you're writing 240 bytes, which HLS seems to be handling as a 16-bit write (not sure why...) - so 120 elements. It might only happen once, but that still sets the worst-case and therefore determines the calculated II.

 

The easy change here would be to just move this outside the inner loop. Does it actually matter if the write happens at a specific column, rather than after each row is finished? That'll get the II back down to 1 for the inner loop, and add ~120 to the II for the outer loop (which is already going to be II=1920 or so anyway, so an extra 120 won't hurt too much).

 

With a small amount of extra buffering, you could set this up so that the memcpy occurs while the next line is being processed, eliminating that extra delay on the outer loop.

0 Kudos
Observer y.lee0320
Observer
76 Views
Registered: ‎01-02-2019

Re: External DDR access is a II bottleneck ?

Jump to solution

Thanks, it does help. The warning "Unable to enforce a carried dependence constraint" disappears, however, the outer loop is then failed to be flatten, so failed to pipeline with the following warning message ... 

WARNING: [XFORM 203-542] Cannot flatten a loop nest 'Loop-3' (ExtraProc.cpp:263:7) in function 'YYYY2RGB' :

more than one sub loop.

 

I am still curious why move the memcpy outside the inner loop can help?  What's difference between putting memcpy() inside and outside the inner llop from synthesis point of view?

0 Kudos
Scholar u4223374
Scholar
70 Views
Registered: ‎04-26-2015

Re: External DDR access is a II bottleneck ?

Jump to solution

I'd just ignore that warning. Failing to flatten the loop will cost you ... maybe 1% performance. Probably less.

 

The memcpy is taking around 120 cycles to run (I assume it's 120 cycles of data transfer plus a cycle or two of overhead to set up the transfer). The HLS analysis goes like this:

How long does an iteration of the inner loop take to run? When Col != (W-1), it takes one cycle. When Col == W-1, it takes 122 cycles (because it has to perform that whole mempcy operation). So the best II that can be met in every case is II=122.

How long does an iteration of the outer loop take to run? This is where HLS is not particularly smart. It's saying "well, one iteration takes up to 122 cycles, and there are 1920 iterations, so the total time is going to be 234240 cycles".

How long does the outer loop take to run? This is going to be (to a good approximation) 1080 times the time for one iteration. 234240 * 1080 = 252,979,200 cycles.

 

By moving the memcpy outside the inner loop, the analysis goes like this:

How long does an iteration of the inner loop take to run? It's always 1 cycle, nothing else takes any longer.

How long does an iteration of the outer loop take to run? It'll be 1920 iterations of the inner loop (1920 * 1 = 1920) plus one memcpy (122 cycles). Total time is 2142 cycles.

How long does the outer loop take to run? This is going to be (to a good approximation) 1080 times the time for one iteration. 2142 * 1080 = 2,313,360 cycles. Somewhat more than a hundred times quicker than before.

 

I honestly don't know if HLS will actually use 122 cycles when Col != (W-1). Maybe it'll see that there's no more work to do and move on to the next iteration straight away. However, it's also possible that the pipelining is much easier to organize if it always takes exactly the same time (regardless of whether the memcpy occurs).

 

0 Kudos
Observer y.lee0320
Observer
57 Views
Registered: ‎01-02-2019

Re: External DDR access is a II bottleneck ?

Jump to solution
 
0 Kudos
Observer y.lee0320
Observer
54 Views
Registered: ‎01-02-2019

Re: External DDR access is a II bottleneck ?

Jump to solution

Thanks a lot.

The buffer in the DRAM space (uint32 DRAM[135][240]) is used as mean buffer. Each entry in the array is a calculated mean for 8col*8row pixels of the previous image frame. I also keep a 8*240 array in my design as line buffer of the DRAM data to do some complicate operations in convolutional way. And I kept a 1*240 line buffer as well to buffer the calculated mean for the current frame to avoid mis-overwrite the DRAM data. Basically all the DRAM data access is in sequential order convolutionally. So, probably single uint32 access (instead of burst read/write) with a FIFO between DRAM and top level module might be a better approach. In that way, the outer loop can be pipelined and the code structure will be more reasonable. How do you think?

0 Kudos