UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Explorer
Explorer
509 Views
Registered: ‎02-08-2018

Unrolling a for loop has little effect on overall loop latency

I am trying to unroll a for loop shown below.  I thought that if I unrolled by a factor of 10 that the overall loop latency should improve by roughly a multiple of 10, but unrolling by a factor of 10 has little effect on loop latency.  I confirmed in the log that the loop was unrolled by the specified factor.  Note the variable lfps is an array of structs that has been partitioned along all dimensions by a factor of 10 upon its declaration. Note also that MAX_NMAXIMA = 20

Unrolling the loop by a factor of 2 also has little effect on loop latency.

double error_array[MAX_NMAXIMA];
#pragma HLS ARRAY_PARTITION variable=error_array complete
loop22c: for (int m3 = 0; m3 < MAX_NMAXIMA; m3++)
{
     if ((m3 < (m2+1)) || (m3 >= nmaxima)) 
         continue;
     int i3 = maxima[m3];
     double err23, err30, mse23, mse30;
     double params23[4], params30[4];

     fit_line(lfps, num_points, i2, i3, params23, &err23, &mse23);
     if (mse23 > td->qtp.max_line_fit_mse)
         continue;

     fit_line(lfps, num_points, i3, i0, params30, &err30, &mse30);
     if (mse30 > td->qtp.max_line_fit_mse)
         continue;
     error_array[m3] = err01 + err12 + err23 + err30;
}

/************Fit_line function definition*******************/

void fit_line(struct line_fit_pt lfps[MAX_CLUSTER_SIZE], unsigned int num_points, int i0, int i1, double *lineparm, double *err, double *mse)
{
    assert(i0 != i1);
    assert(i0 >= 0 && i1 >= 0 && i0 < num_points && i1 < num_points);

    double Mx, My, Mxx, Myy, Mxy, W;
    int N; // how many points are included in the set?

    if (i0 < i1) {
        N = i1 - i0 + 1;

        Mx = lfps[i1].Mx;
        My = lfps[i1].My;
        Mxx = lfps[i1].Mxx;
        Mxy = lfps[i1].Mxy;
        Myy = lfps[i1].Myy;
        W = lfps[i1].W;

    if (i0 > 0) {
        Mx -= lfps[i0-1].Mx;
        My -= lfps[i0-1].My;
        Mxx -= lfps[i0-1].Mxx;
        Mxy -= lfps[i0-1].Mxy;
        Myy -= lfps[i0-1].Myy;
        W -= lfps[i0-1].W;
    }

    } else {
        // i0 > i1, e.g. [15, 2]. Wrap around.
        assert(i0 > 0);

        Mx = lfps[num_points-1].Mx - lfps[i0-1].Mx;
        My = lfps[num_points-1].My - lfps[i0-1].My;
        Mxx = lfps[num_points-1].Mxx - lfps[i0-1].Mxx;
        Mxy = lfps[num_points-1].Mxy - lfps[i0-1].Mxy;
        Myy = lfps[num_points-1].Myy - lfps[i0-1].Myy;
        W = lfps[num_points-1].W - lfps[i0-1].W;

        Mx += lfps[i1].Mx;
        My += lfps[i1].My;
        Mxx += lfps[i1].Mxx;
        Mxy += lfps[i1].Mxy;
        Myy += lfps[i1].Myy;
        W += lfps[i1].W;

        N = num_points - i0 + i1 + 1;
     }

    assert(N >= 2);

    double Ex = Mx / W;
    double Ey = My / W;
    double Cxx = Mxx / W - Ex*Ex;
    double Cxy = Mxy / W - Ex*Ey;
    double Cyy = Myy / W - Ey*Ey;

    double nx, ny;
    double normal_theta = .5 * atan2f(-2*Cxy , Cyy - Cxx);
    nx = cosf(normal_theta);
    ny = sinf(normal_theta);

    if (1) //(lineparm) {
   {
        lineparm[0] = Ex;
        lineparm[1] = Ey;
        lineparm[2] = nx;
        lineparm[3] = ny;
   }
   // sum of squared errors
   *err = nx*nx*N*Cxx + 2*nx*ny*N*Cxy + ny*ny*N*Cyy;

   // mean squared error
   *mse = nx*nx*Cxx + 2*nx*ny*Cxy + ny*ny*Cyy;
}

0 Kudos
6 Replies
Explorer
Explorer
450 Views
Registered: ‎07-18-2018

Re: Unrolling a for loop has little effect on overall loop latency

Hi agailey,

    From the example provided I suspect the loop that is being unrolled is loop22c?

 

Since it's not unrolling as expected, it suggests there might be a data dependency that isn't being handled. It's not obvious to me what that would be, so i would use the analysis view to see what are the steps in each call of the loop with and without adding the unrolling. It might give a little more insight into what is driving the loop latency to be long regardless of unrolling, which likely is access to some variable that is being seen as dependent.

Share the Analysis view of the Loop before and after unrolling if it isn't obvious and maybe some other eyes can catch something.

0 Kudos
Xilinx Employee
Xilinx Employee
445 Views
Registered: ‎01-09-2008

Re: Unrolling a for loop has little effect on overall loop latency

Before UNROLLING, you should try to PIPELINE.

This does not increase too much the hardware, and you will have a much better view of data dependency.

If the pipeline length is > 10 you also can have a much high acceleration ratio.

 

==================================
Olivier Trémois
XILINX EMEA DSP Specialist
0 Kudos
Explorer
Explorer
435 Views
Registered: ‎02-08-2018

Re: Unrolling a for loop has little effect on overall loop latency

@oliviert

I did try pipelining as well, and it leads to the best outcome so far.  Without parallel optimizations, the overall loop latency is 40-->4420.  When unrolling by a factor of 4, loop latency goes to 25-->4425.  When pipelining, loop latency decreases to 261, but LUT utilization for the whole program increases from 59% of available LUT resources to 69%. 

Resource consumption when unrolling by a factor of 4 is less than when pipelining, but that may be because unrolling by a factor of 4 is not effective at facilitating parallel processing.  I tried checking for dependency issues.  No dependency issues seem to be mentioned in the log.  Based on results so far, pipelining seems like the best option.

0 Kudos
Scholar u4223374
Scholar
419 Views
Registered: ‎04-26-2015

Re: Unrolling a for loop has little effect on overall loop latency

I've often found that splitting a loop into a two-level nested version works well, because then you can pipeline the outer loop (which automatically unrolls the inner one). This has two related advantages:

 

(1) It sets the performance target for the unrolled loops. If you just unroll a loop normally, there's nothing to say that HLS should try to make them run faster - it should just do everything in parallel. With the outer loop pipelined at II=1, HLS is trying to ensure that it can start the inner loop in every clock cycle.

(2) HLS will tell you why it can't reach II=1 - which is normally because of congestion around a RAM. You can then go and investigate the RAM that it's complaining about.

 

 

0 Kudos
Xilinx Employee
Xilinx Employee
412 Views
Registered: ‎01-09-2008

Re: Unrolling a for loop has little effect on overall loop latency

What is the initiation interval of the pipelined loop?

Olivier

==================================
Olivier Trémois
XILINX EMEA DSP Specialist
0 Kudos
Explorer
Explorer
405 Views
Registered: ‎02-08-2018

Re: Unrolling a for loop has little effect on overall loop latency

@oliviert

The initiation interval of the pipelined loop is 1.  In notice that overall resource consumption increases with pipelining this loop moreso than unrolling.  Currently the Vivado HLS block is predicted to occupy 26% BRAM, 14% DSP, 20% FF and 57% LUT.  I am not sure of the highest allowable percent for each category.  I am using the MicroBlaze, which also occupies some space and some of our collaborators said that the FPGA program should not occupy any more than 70% of total available memory space.

0 Kudos