UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Adventurer
Adventurer
431 Views
Registered: ‎04-11-2018

Loop splitting to improve pipelining

Hi,

I have two for loops with 3000 and 100 iterations and I wanted to know how do I split the loop so that I can decrease the processing time?

Currently, the loops takes almost 5 seconds to process the image. Can someone please help/suggest the way I can decrease the processing time and make it real time.

Thanks!

 

0 Kudos
8 Replies
Xilinx Employee
Xilinx Employee
421 Views
Registered: ‎09-05-2018

Re: Loop splitting to improve pipelining

Hey @pranju,

Do you think you could provide some more details and maybe some code? With just the information you've provided, I think the most helpful thing I can do is point you to the Vivado HLS Optimization Methodology Guide, UG1270.

Nicholas Moellers

Xilinx Worldwide Technical Support
0 Kudos
Explorer
Explorer
409 Views
Registered: ‎07-18-2018

Re: Loop splitting to improve pipelining

Are these loops nested? I would pipeline the top loop, and unroll the inner loop.

But there might be better ways to do it if we know how the loops look. You don't need to give the full code. Pusedo code of the loops, and a description of the kind of operation done in each loop

EG:

 

for(i=0;i<3000;i++) {
    num = num + other num
    for(j=0;J<100;j++) {
          num = num* other num
    }
}

Letting us know if it's reading from an array a lot, doing a lot of mults or divides, or adds. And possibly if it makes sense to have the loops be the sizes they are.

But to start, generally pipeline top level loops, unroll inner loops.

 

0 Kudos
Adventurer
Adventurer
376 Views
Registered: ‎04-11-2018

Re: Loop splitting to improve pipelining

Hi,

The code goes like this. I am unable to understand  how do I split the loops further to decrease the processing time.Initially, I tried pipelining the outer loop and unrolling the inner loop but there were errors saying enormous load/store instructions. Can someone please tell me how do I solve this!

Thank you,

if(col2 == 30 ){ 
                    for(int hist_ii=0;hist_ii<3000;hist_ii++){

#pragma HLS pipeline
#pragma HLS loop_merge               
#pragma HLS INTERFACE axis port=buf_LBP
#pragma HLS INTERFACE axis port=buf_HSV
#pragma HLS DEPENDENCE variable=buf_LBP array intra false
#pragma HLS DEPENDENCE variable=buf_HSV array intra false
#pragma HLS latency max=3

                        {
                            #pragma HLS latency min=0 max=0
                            #pragma HLS pipeline
                            hist_i = hist_ii%30;                                    
                            hist_j = hist_ii/30;
                        }

                        {
                            #pragma HLS latency min=0 max=0
                            #pragma HLS pipeline
                            bin_pointer = buf_LBP[hist_j][ col2 - hist_i];
                              pi2=buf_HSV[hist_j][col2 - hist_i]; // H
                        }
                        {
                            #pragma HLS latency min=0 max=0
                            #pragma HLS pipeline
                              histogram_virtrude[bin_pointer] =  histogram_virtrude[bin_pointer] + 1 ;
                               histogram_virtrude[59+pi2] =  histogram_virtrude[59+pi2] + 1 ;
                        }
                    }
                    P0= histogram_virtrude[1];
                }else if(col2 < (SV_WIDTH-30) ){
                    
                
                    for(int hist_ii=0;hist_ii<100;hist_ii++){

                    #pragma HLS LOOP_TRIPCOUNT min=100 max=100
                    #pragma HLS pipeline
                    #pragma HLS loop_merge
                    #pragma HLS INTERFACE axis port=buf_LBP
                   #pragma HLS INTERFACE axis port=buf_HSV
                    #pragma HLS DEPENDENCE variable=buf_LBP array intra false
                    #pragma HLS DEPENDENCE variable=buf_HSV array intra false
                    #pragma HLS latency max=4

                                            {
                                            // lbp
                                                #pragma HLS latency min=0 max=0
                                                #pragma HLS pipeline
                                                bin_pointer = buf_LBP[hist_ii][ col2 - 30];
                                                  pi2=buf_HSV[hist_ii][col2 - 30]; // H
                                            }
                                            {
                                                #pragma HLS latency min=0 max=0
                                                #pragma HLS pipeline
                                                  histogram_virtrude[bin_pointer] =  histogram_virtrude[bin_pointer] + 1 ;
                                                   histogram_virtrude[59+pi2] =  histogram_virtrude[59+pi2] + 1 ;
                                            }
                                            {
                                            // lbp
                                                #pragma HLS latency min=0 max=0
                                                #pragma HLS pipeline
                                                bin_pointer = buf_LBP[hist_ii][ col2 ];
                                                  pi2=buf_HSV[hist_ii][col2 ]; // H
                                            }
                                            {
                                                #pragma HLS latency min=0 max=0
                                                #pragma HLS pipeline
                                                  histogram_virtrude[bin_pointer] =  histogram_virtrude[bin_pointer] - 1 ;
                                                   histogram_virtrude[59+pi2] =  histogram_virtrude[59+pi2] - 1 ;
                                            }}
                    P0= histogram_virtrude[1];

                }else{
                    P0 = 0;
                }
                data_hist.write(P0);
            }

0 Kudos
Scholar u4223374
Scholar
369 Views
Registered: ‎04-26-2015

Re: Loop splitting to improve pipelining

@pranju As far as I can tell, at one specific column (in an image? So it'll happen on every row at that column) you want to do 3000 loop iterations. That's obviously going to be very slow. For most of the rest of the columns you're asking for 100 iterations, which will also be slow. My general rule is to do one iteration per pixel ... or two if I really have to (often required for histogram construction).

Without fully understanding your code, it appears that you read from the buf_LBP and buf_HSV arrays to get a total of 200 pointers (2 for each of 100 iterations). With each of those pointers you increment an element in the histogram. Then you get 200 more pointers (2 for each of 100 iterations) corresponding to a place 30 columns further on and decrement those histogram elements.

Would it be reasonable to just buffer data for 30 columns in a shift register, so you can do the decrement and increment without hitting the array? Then just write the result to the array after you last access the data.

0 Kudos
Adventurer
Adventurer
356 Views
Registered: ‎04-11-2018

Re: Loop splitting to improve pipelining

@u4223374

Thank you so much for your reply. Yes, I am trying to perform one pixel iteration to construct a histogram.Initially, I had two, for loops with 30 and 100 iterations and was taking a bit longer. Thereby, just optimized with one loop of 3000 iterations thinking this would reduce the processing time.

Okay, just to clarify I need to have one more buffer storing the 30 columns instead of doing increment and decrement in both the loops.?

I had to know if loop splitting will help further to make it real time? If yes, could you please tell me

0 Kudos
Scholar u4223374
Scholar
326 Views
Registered: ‎04-26-2015

Re: Loop splitting to improve pipelining

@pranju For what it's worth, I have never found a good way of doing even a simple image histogram at 1 pixel/cycle (which is not to say that I haven't found a way, but it's not pretty). 1 pixel every second cycle is trivial, so it may make sense to aim for that and then double your clock speed.

 

Would it be possible to reorganize this so you process the whole image into one or more histograms, and then afterwards do the analysis on those histograms? That way the operations performed at every pixel can be very small, and you have maybe a 5000-cycle overhead (irrevelant compared to the size of the image) for analysis at the end.

 

A buffer storing 30 columns of data that get shifted along one space for each pixel would make sense. I've used this layout in the past (although not for histograms) and it's worked very well.

0 Kudos
Xilinx Employee
Xilinx Employee
314 Views
Registered: ‎01-09-2008

Re: Loop splitting to improve pipelining

The problem of the histogram is that you need to read the histogram value for the current pixel value, increment it and resave it.

for(int i=0;i<Height;i++)
    for(int j=0;j<Width;j++)
        {
        val = pixel[i][j];
        hist[val] = hist[val]++;
        }

The problem is that the histogram is stored in a BRAM. If 2 consecutive pixels have the same value, you have to read, write, read, write at the same address in 2 clock cycles.

In order to avoid this memory access burden you must avoid this scheme.

oldval = 0;
acc = 0;
for(int i=0;i<Height;i++)
    for(int j=0;j<Width;j++)
        {
        val = pixel[i][j];
        if(oldval == val)
            acc++;
        else
            {
            hist[oldval] = acc;
            acc = hist[val]+1;
            }
        }
hist[oldval] = acc;

With this code we use a temporary variable acc to store the histogram of the old value:

if (oldval==val) acc++;

Then we update the real histogram only when we get a new value val:

else { hist[oldval]=acc;

In this case we reset also the acc value to the new value (val) read

acc=hist[val]+1; } old = val;

With this coding style we can never get two Writes at the same address (val) twice in a row

You still need to issue a DEPENDENCE false directive to tell VHLS that the address will never be twice the same in the loop.

==================================
Olivier Trémois
XILINX EMEA DSP Specialist
0 Kudos
Adventurer
Adventurer
158 Views
Registered: ‎04-11-2018

Re: Loop splitting to improve pipelining

Hello,

I am trying to find the histogram of the LBP image and HSV image. I am able to get the output at a framerate of 3.10 whereas the input frame rate is 30fps. I am not understanding why is the output frame rate 10 times slow than the input. I have attached the piece of code. Could someone provide me a hint or help me out to find the performance lag?

 

Thank you in advance!

#pragma SDS data mem_attribute("lbp_input.data":NON_CACHEABLE|PHYSICAL_CONTIGUOUS)
#pragma SDS data mem_attribute("srcLeft_ENC.data":NON_CACHEABLE|PHYSICAL_CONTIGUOUS)
#pragma SDS data mem_attribute("ClassifiedImage.data":NON_CACHEABLE|PHYSICAL_CONTIGUOUS)
#pragma SDS data access_pattern("lbp_input.data":SEQUENTIAL,"srcLeft_ENC.data":SEQUENTIAL, "ClassifiedImage.data":SEQUENTIAL )
#pragma SDS data copy("lbp_input.data"[0:"lbp_input.size"],"srcLeft_ENC.data"[0:"srcLeft_ENC.size"],"ClassifiedImage.data"[0:"ClassifiedImage.size"])
void LBP_virtrude_histogram_v2(xf::Mat<XF_8UC1, SV_HEIGHT, SV_WIDTH, XF_NPPC1> &lbp_input, xf::Mat<XF_8UC1, SV_HEIGHT, SV_WIDTH, XF_NPPC1> &srcLeft_ENC,xf::Mat<XF_8UC1, SV_HEIGHT, SV_WIDTH, XF_NPPC1> &ClassifiedImage, int pcnt){
    hls::stream< XF_TNAME(XF_8UC1,XF_NPPC1)> _srcLBP;
    hls::stream< XF_TNAME(XF_8UC1,XF_NPPC1)> _srcENC;
    hls::stream< XF_TNAME(XF_8UC1,XF_NPPC1)> _dst1;

//copy data to HLS STREAM
#pragma HLS INLINE OFF
#pragma HLS DATAFLOW

    for(int i=0; i<lbp_input.rows;i++)
    {
#pragma HLS LOOP_TRIPCOUNT min=1 max=SV_HEIGHT
        for(int j=0; j<(lbp_input.cols)>>(XF_BITSHIFT(XF_NPPC1));j++)
        {
#pragma HLS LOOP_TRIPCOUNT min=1 max=SV_WIDTH/XF_NPPC1
#pragma HLS LOOP_FLATTEN off
#pragma HLS PIPELINE
            _srcLBP.write( *(lbp_input.data + i*(lbp_input.cols>>(XF_BITSHIFT(XF_NPPC1))) +j) );
            _srcENC.write( *(srcLeft_ENC.data + i*(srcLeft_ENC.cols>>(XF_BITSHIFT(XF_NPPC1))) +j) );

        }
    }

 
    LBP_virtrude_calchist_v2(_srcLBP,_srcENC,_dst1);
    for(int i=0; i<ClassifiedImage.rows;i++)
    {
#pragma HLS LOOP_TRIPCOUNT min=1 max=SV_HEIGHT
        for(int j=0; j<(ClassifiedImage.cols)>>(XF_BITSHIFT(XF_NPPC1));j++)
        {
#pragma HLS LOOP_TRIPCOUNT min=1 max=SV_WIDTH/XF_NPPC1
#pragma HLS PIPELINE
#pragma HLS LOOP_FLATTEN off
            XF_TNAME(XF_8UC1,XF_NPPC1) value = _dst1.read();
            *(ClassifiedImage.data + i*(ClassifiedImage.cols>>(XF_BITSHIFT(XF_NPPC1))) +j) = value;
        }
    }


}

void LBP_virtrude_calchist_v2(hls::stream<XF_TNAME(XF_8UC1,XF_NPPC1)> &data_lbp,hls::stream<XF_TNAME(XF_8UC1,XF_NPPC1)> &data_enc, hls::stream<XF_TNAME(XF_8UC1,XF_NPPC1)> &data_hist){

    int hist_i, hist_j;
    ap_uint<24> histogram_virtrude[95];//bin
    ap_uint<24> histogram_virtrude_H1[95];
    ap_uint<24> histogram_virtrude_H2[95];
    ap_uint<24> histogram_virtrude_H3[95];
    ap_uint<24> histogram_virtrude_H4[95];
    XF_TNAME(XF_8UC1,XF_NPPC1) buf0, buf1, buf2, buf3, buf4, buf5,buf6,buf7;
    ap_uint<13> row, col2, row_ind, h1_x, h1_y, h2_x, h2_y, h3_x, h3_y;
    ap_uint<24> pixel_counter;
    XF_TNAME(XF_8UC1,XF_NPPC1) P0;
   
    XF_TNAME(XF_8UC1,XF_NPPC1) buf_LBP[100][SV_WIDTH];
    XF_TNAME(XF_8UC1,XF_NPPC1) buf_HSV[100][SV_WIDTH];
    #pragma HLS RESOURCE variable=buf_LBP core=RAM_S2P_BRAM
    #pragma HLS ARRAY_PARTITION variable=buf_LBP complete dim=1
    #pragma HLS RESOURCE variable=buf_HSV core=RAM_S2P_BRAM
    #pragma HLS ARRAY_PARTITION variable=buf_HSV complete dim=1

    CLEAN_BUFFER_M:
    for(int ti=0; ti<93;ti++){
            #pragma HLS UNROLL
        histogram_virtrude[ti]=0;
        histogram_virtrude_H4[ti]=0;
        histogram_virtrude_H1[ti]=0;
        histogram_virtrude_H2[ti]=0;
        histogram_virtrude_H3[ti]=0;
    }

    row_ind = 0;
    pixel_counter=0;

    ROWLOOP_M:
        for(row = 0; row < SV_HEIGHT; row++){                                    // Height of the image
#pragma HLS LOOP_TRIPCOUNT min=SV_HEIGHT max=SV_HEIGHT

//the entire image data of the row is taken into consideration
#pragma HLS INLINE
            colLoop1_N:
                for(col2 = 0; col2 < SV_WIDTH; col2++){   // Width of the image

#pragma HLS PIPELINE II=1

                    // UPDATE Pointer for buffer histogram - integral.
                 pixel_counter++;
                    if(col2<30){

                        h3_x=row_ind-1;
                        if(row_ind==0) h3_x=99;
                        h1_x = h2_x-1;
                        if(h2_x==0) h1_x=99;

                        h3_y=SV_WIDTH-(30-col2);
                        h1_y=h3_y;
                    }else{

                        h3_x=row_ind;
                        h1_x=h2_x;
                        h3_y=col2;
                        h1_y=col2;
                    }

                  buf_LBP[row_ind][col2]=data_lbp.read();
                  buf_HSV[row_ind][col2]=data_enc.read();


                    // Update histogram integral
                    buf0=buf_LBP[row_ind][col2];
                    histogram_virtrude_H4[ buf0 ]++;
                    buf1=buf_HSV[row_ind][col2];
                    histogram_virtrude_H4[ 59 + buf1 ]++;


                    // Update histogram integral (shifted H1, H2, H3)

                if(pixel_counter > 30 ){
                    #pragma HLS latency min=0 max=0
                        buf2=buf_LBP[h3_x][h3_y];
                        histogram_virtrude_H3[buf2]++;
                        buf3=buf_HSV[h3_x][h3_y];
                        histogram_virtrude_H3[ 59 + buf3]++;

                       if(pixel_counter > 126720){//(99*1280=126720)
                        #pragma HLS latency min=0 max=0
                           buf4=buf_LBP[h2_x][col2];
                             histogram_virtrude_H2[buf4]++;
                             buf5=buf_HSV[h2_x][col2];
                            histogram_virtrude_H2[ 59 + buf5 ]++;

                            if(pixel_counter > 126750 ){//99*1280+30=126750
                            #pragma HLS latency min=0 max=0
                                buf6=buf_LBP[h1_x][h1_y];
                                histogram_virtrude_H1[buf6]++;
                                buf7=buf_HSV[h1_x][h1_y];
                                histogram_virtrude_H1[ 59 + buf7 ]++;


                                if(col2>30){
                                    // GET histogram for the region 100*30
                                    for(int ti=0; ti<93;ti++){
        #pragma HLS UNROLL
                                        histogram_virtrude[ti]=histogram_virtrude_H4[ti]+histogram_virtrude_H1[ti]-histogram_virtrude_H2[ti]-histogram_virtrude_H3[ti];
                                    }
                                }

                            P0= histogram_virtrude[1];
                    }
                   }

                }else{
                        P0=0;
                }

                    data_hist.write(P0);
            }

            row_ind = row_ind + 1;
            if (row_ind > 99){
                row_ind=0;
            }
            h2_x=99-row_ind;
     }
}

0 Kudos