cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
y.lee0320
Participant
Participant
878 Views
Registered: ‎01-02-2019

External DDR access is a II bottleneck ?

Jump to solution

Hi,

I am working on a image process project with Vivado HLS. To save BRAM, I am using external DDR as a buffer to save some information of previous image to be used on the current frame. However, it seems that memcpy (burst read I supposed) induces great impact in the II. Is that normal?  Any suggestion?   Thanks in advance. 

 

void top_function(hls::stream<uint_16_side_channel> &inStream, hls::stream<uint_8_side_channel> &outStream, int width, int height, unsigned int DRAM[ 135 ][ 240 ])

{

    for (Row=0;Row<1080;Row++) {

       for (Col=0;Col<1920;Col++)  {

              if( Col == W - 1 ){

                     memcpy((unsigned char *)&Mean_Buf_ITP[7][0], (unsigned char *) &DRAM[ MN_Row + 4 ][0], 240);
               }

 

               // 

         }  // for Col

     }  // for Row

}

 

WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 72, distance = 1, offset = 1)
between bus read on port 'DRAM' (ExtraProc.cpp:526) and bus read on port 'DRAM' (ExtraProc.cpp:448).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 103, distance = 1, offset = 1)
between bus read on port 'DRAM' (ExtraProc.cpp:526) and bus read on port 'DRAM' (ExtraProc.cpp:448).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 118, distance = 1, offset = 1)
between bus read on port 'DRAM' (ExtraProc.cpp:526) and bus read on port 'DRAM' (ExtraProc.cpp:448).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 120, distance = 1, offset = 1)
between bus read on port 'DRAM' (ExtraProc.cpp:526) and bus read on port 'DRAM' (ExtraProc.cpp:448).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 121, distance = 1, offset = 1)
between bus read on port 'DRAM' (ExtraProc.cpp:526) and bus read on port 'DRAM' (ExtraProc.cpp:448).
INFO: [SCHED 204-61] Pipelining result : Target II = 1, Final II = 122, Depth = 198.

0 Kudos
1 Solution

Accepted Solutions
u4223374
Advisor
Advisor
827 Views
Registered: ‎04-26-2015

I'd just ignore that warning. Failing to flatten the loop will cost you ... maybe 1% performance. Probably less.

 

The memcpy is taking around 120 cycles to run (I assume it's 120 cycles of data transfer plus a cycle or two of overhead to set up the transfer). The HLS analysis goes like this:

How long does an iteration of the inner loop take to run? When Col != (W-1), it takes one cycle. When Col == W-1, it takes 122 cycles (because it has to perform that whole mempcy operation). So the best II that can be met in every case is II=122.

How long does an iteration of the outer loop take to run? This is where HLS is not particularly smart. It's saying "well, one iteration takes up to 122 cycles, and there are 1920 iterations, so the total time is going to be 234240 cycles".

How long does the outer loop take to run? This is going to be (to a good approximation) 1080 times the time for one iteration. 234240 * 1080 = 252,979,200 cycles.

 

By moving the memcpy outside the inner loop, the analysis goes like this:

How long does an iteration of the inner loop take to run? It's always 1 cycle, nothing else takes any longer.

How long does an iteration of the outer loop take to run? It'll be 1920 iterations of the inner loop (1920 * 1 = 1920) plus one memcpy (122 cycles). Total time is 2142 cycles.

How long does the outer loop take to run? This is going to be (to a good approximation) 1080 times the time for one iteration. 2142 * 1080 = 2,313,360 cycles. Somewhat more than a hundred times quicker than before.

 

I honestly don't know if HLS will actually use 122 cycles when Col != (W-1). Maybe it'll see that there's no more work to do and move on to the next iteration straight away. However, it's also possible that the pipelining is much easier to organize if it always takes exactly the same time (regardless of whether the memcpy occurs).

 

View solution in original post

0 Kudos
6 Replies
y.lee0320
Participant
Participant
866 Views
Registered: ‎01-02-2019

Sorry, forgot to mention, here is line 526 and line 448, they are in the same loop.

 

line 529:         

    memcpy((unsigned char *)&Mean_Line_Buf[0], (unsigned char *) &DRAM[ MN_Row ][0], 240);

line 448:         

    memcpy((unsigned char *)&Mean_Buf_ITP[7][0], (unsigned char *) &DRAM[ MN_Row + 4 ][0], 240);

0 Kudos
u4223374
Advisor
Advisor
843 Views
Registered: ‎04-26-2015

When HLS is trying to find the II for a loop, it calculates the most stuff that could possibly happen in a single loop iteration - and that defines the II. As far as I can tell, it doesn't do min/max as it does for function latency. I'm not sure whether it'll actually run the loop faster if the slow operations are skipped in one iteration, or whether it'll enforce the same II always.

 

In this case, you're writing 240 bytes, which HLS seems to be handling as a 16-bit write (not sure why...) - so 120 elements. It might only happen once, but that still sets the worst-case and therefore determines the calculated II.

 

The easy change here would be to just move this outside the inner loop. Does it actually matter if the write happens at a specific column, rather than after each row is finished? That'll get the II back down to 1 for the inner loop, and add ~120 to the II for the outer loop (which is already going to be II=1920 or so anyway, so an extra 120 won't hurt too much).

 

With a small amount of extra buffering, you could set this up so that the memcpy occurs while the next line is being processed, eliminating that extra delay on the outer loop.

0 Kudos
y.lee0320
Participant
Participant
834 Views
Registered: ‎01-02-2019

Thanks, it does help. The warning "Unable to enforce a carried dependence constraint" disappears, however, the outer loop is then failed to be flatten, so failed to pipeline with the following warning message ... 

WARNING: [XFORM 203-542] Cannot flatten a loop nest 'Loop-3' (ExtraProc.cpp:263:7) in function 'YYYY2RGB' :

more than one sub loop.

 

I am still curious why move the memcpy outside the inner loop can help?  What's difference between putting memcpy() inside and outside the inner llop from synthesis point of view?

0 Kudos
u4223374
Advisor
Advisor
828 Views
Registered: ‎04-26-2015

I'd just ignore that warning. Failing to flatten the loop will cost you ... maybe 1% performance. Probably less.

 

The memcpy is taking around 120 cycles to run (I assume it's 120 cycles of data transfer plus a cycle or two of overhead to set up the transfer). The HLS analysis goes like this:

How long does an iteration of the inner loop take to run? When Col != (W-1), it takes one cycle. When Col == W-1, it takes 122 cycles (because it has to perform that whole mempcy operation). So the best II that can be met in every case is II=122.

How long does an iteration of the outer loop take to run? This is where HLS is not particularly smart. It's saying "well, one iteration takes up to 122 cycles, and there are 1920 iterations, so the total time is going to be 234240 cycles".

How long does the outer loop take to run? This is going to be (to a good approximation) 1080 times the time for one iteration. 234240 * 1080 = 252,979,200 cycles.

 

By moving the memcpy outside the inner loop, the analysis goes like this:

How long does an iteration of the inner loop take to run? It's always 1 cycle, nothing else takes any longer.

How long does an iteration of the outer loop take to run? It'll be 1920 iterations of the inner loop (1920 * 1 = 1920) plus one memcpy (122 cycles). Total time is 2142 cycles.

How long does the outer loop take to run? This is going to be (to a good approximation) 1080 times the time for one iteration. 2142 * 1080 = 2,313,360 cycles. Somewhat more than a hundred times quicker than before.

 

I honestly don't know if HLS will actually use 122 cycles when Col != (W-1). Maybe it'll see that there's no more work to do and move on to the next iteration straight away. However, it's also possible that the pipelining is much easier to organize if it always takes exactly the same time (regardless of whether the memcpy occurs).

 

View solution in original post

0 Kudos
y.lee0320
Participant
Participant
815 Views
Registered: ‎01-02-2019
 
0 Kudos
y.lee0320
Participant
Participant
812 Views
Registered: ‎01-02-2019

Thanks a lot.

The buffer in the DRAM space (uint32 DRAM[135][240]) is used as mean buffer. Each entry in the array is a calculated mean for 8col*8row pixels of the previous image frame. I also keep a 8*240 array in my design as line buffer of the DRAM data to do some complicate operations in convolutional way. And I kept a 1*240 line buffer as well to buffer the calculated mean for the current frame to avoid mis-overwrite the DRAM data. Basically all the DRAM data access is in sequential order convolutionally. So, probably single uint32 access (instead of burst read/write) with a FIFO between DRAM and top level module might be a better approach. In that way, the outer loop can be pipelined and the code structure will be more reasonable. How do you think?

0 Kudos