cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
sneutrino
Visitor
Visitor
1,003 Views
Registered: ‎08-04-2014

Pipelining triple loops

Dear all,

I've noticed some interesting behavior when trying to pipeline triple loops, which I can reproduce using the simplified code below : 

#include <iostream>
#include "ap_int.h"

void tripleloop( const ap_uint<16> (&in1)[64],
                 const ap_uint<16> (&in2)[64],
                 ap_uint<16> (&out)[64] ) {

#pragma HLS ARRAY_PARTITION variable=in1 complete dim=0
#pragma HLS ARRAY_PARTITION variable=in2 complete dim=0
#pragma HLS ARRAY_PARTITION variable=out complete dim=0

 loop1 : for( unsigned int i=0; i<4; i++ ) {

    //    #pragma HLS PIPELINE // II=1 succeeds                                                                                                                

  loop2 : for( unsigned int j=0; j<4; j++ ) {

#pragma HLS PIPELINE // carry dependence                                                                                                                       

    loop3 : for( unsigned int k=0; k<4; k++ ) {

        //#pragma HLS PIPELINE // II=1 succeeds                                                                                                                


        // dummy calculation                                                                                                                                   
        int value = in1[i] - in2[j];
        if( value>0 ) value += k;
        else value = 0;
        // model a more complicated calculation                                                                                                                
        #pragma HLS LATENCY min=3 max=3


        // write to a unique location                                                                                                                          
        unsigned index = 16*i + 4*j + k;
//unsigned index = 4*j + k; // loop2 can achieve II=1 with this ... ap_uint<16> output = value; out[index] = output; } } } }

 

 

 

The arrays are completely paritioned and the index for the output array is unique between interations.  Pipelining of either loop1 or loop3 succeeds with II=1.  When I pipeline loop2 I receive warnings:

 

 

INFO: [SCHED 204-61] Pipelining loop 'loop1_loop2'.
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 1, distance = 1, offset = 1)
between wire write on port 'out_63_V' (tripleloop.cc:36) and wire write on port 'out_63_V' (tripleloop.cc:36).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 2, distance = 1, offset = 1)
between wire write on port 'out_63_V' (tripleloop.cc:36) and wire write on port 'out_63_V' (tripleloop.cc:36).
INFO: [SCHED 204-61] Pipelining result : Target II = 1, Final II = 3, Depth = 4.

 

 

Interestingly, if I remove the "index" variable's "i" dependence (as shown in the commented line), loop2 will pipeline with II=1.  In reality, I will probably pipeline loop1 or loop3 in my actual  application, but I'm curious as to why the pipelining of loop2 fails.  I'd appreciate any insight you might offer.

Thanks, Kristian 

0 Kudos
3 Replies
nmoeller
Xilinx Employee
Xilinx Employee
965 Views
Registered: ‎09-05-2018

@sneutrino,

I admittedly don't know why the Synthesis only detects a dependency when you pipeline the second loop, but you can fix it with a false dependence directive:

#pragma HLS DEPENDENCE variable=out intra false

We're rellying on HLS knowing that index will produce 64 different values, and sometimes it's able to do this by itself and sometimes we have to tell it. But even though it's only a small change to us, it does change the code that HLS sees significantly, so it's curious but not totally surprising.

I think the commented out code works because HLS is realizing it can optimize by not doing the first 56 writes and just writing the last 16, but that's just a guess.

Nicholas Moellers

Xilinx Adaptive Computing Tools
0 Kudos
sneutrino
Visitor
Visitor
957 Views
Registered: ‎08-04-2014

Dear Nicholas,

Thanks very much for your reply.  Apologies, I failed to mention in my original post that I had also tried adding the false dependence pragma, and that it didn't change the situation:

 

INFO: [HLS 200-111] Finished Pre-synthesis Time (s): cpu = 00:00:16 ; elapsed = 00:00:17 . Memory (MB): peak = 582.770 ; gain = 132.344 ; free physical = 8868 ; free virtual = 35169
INFO: [XFORM 203-541] Flattening a loop nest 'loop1' (tripleloop.cc:13:45) in function 'tripleloop'.
WARNING: [ANALYSIS 214-52] Found false intra dependency for variable 'out[22].V' (tripleloop.cc:6).
WARNING: [ANALYSIS 214-52] Found false intra dependency for variable 'out[34].V' (tripleloop.cc:6).
WARNING: [ANALYSIS 214-52] Found false intra dependency for variable 'out[14].V' (tripleloop.cc:6).
WARNING: [ANALYSIS 214-52] Found false intra dependency for variable 'out[25].V' (tripleloop.cc:6).

<... snip ...>

INFO: [SCHED 204-61] Pipelining loop 'loop1_loop2'.
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 1, distance = 1, offset = 1)
   between wire write on port 'out_63_V' (tripleloop.cc:41) and wire write on port 'out_63_V' (tripleloop.cc:41).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 2, distance = 1, offset = 1)
   between wire write on port 'out_63_V' (tripleloop.cc:41) and wire write on port 'out_63_V' (tripleloop.cc:41).
INFO: [SCHED 204-61] Pipelining result : Target II = 1, Final II = 3, Depth = 4.

I had tried originally using 2018.2, and again just now using 2018.3.  I tried specifying both false inter (because of the carry warning) and intra dependencies on array out, but neither allows me to pipeline loop2 at II=1.  It seems the specified false depedence is acknowledged by the tool, yet this is ultimately ignored?

Thanks, Kristian

   

0 Kudos
sneutrino
Visitor
Visitor
876 Views
Registered: ‎08-04-2014

To follow up, I believe this issue is connected to the i dependence of the array accesses in the inner loops.  We're seeing similar behavior with a simple, single loop implementation, as described here.

Best, Kristian

0 Kudos