cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
hanqiu
Visitor
Visitor
4,243 Views
Registered: ‎10-01-2016

trying to achieve PIPLINE interval of 1

Jump to solution

hello, I want to achieve a PIPELINE interval of 1, but the current interval is 8.

For simplicity, I just put the inner loop of my code here.

#pragma HLS PIPELINE
#define C 2
#defien PARA 16 for(int l=0;l<PARA;++l){ int ker_num = l; if(init&&j==0){ sum = 0; } else { sum = out_buf[ker_num][i]; } for(int m=0;m<C;++m){ if(i+j-PAD >=0&&i+j-PAD<W){ sum += mul_bi(buf_line[m][i+j-PAD],f_buf[ker_num][m][j]); } } out_buf[ker_num][i] = sum; }

the data type of  buf_line and f_buf is ap_uint<64>

the data type of out_buf is ap_int<16>

the mul_bi function is just a pop_count.

I've  partitioned f_buf,buf_line, out_buf. so they can be fetched in one cycles.

#pragma HLS ARRAY_PARTITION variable=buf_line complete dim=1
#pragma HLS ARRAY_PARTITION variable=f_buf cyclic factor=16 dim=1
#pragma HLS ARRAY_PARTITION variable=f_buf complete dim=2
#pragma HLS ARRAY_PARTITION variable=out_buf cyclic factor=16 dim=1
0 Kudos
1 Solution

Accepted Solutions
martizih
Visitor
Visitor
7,088 Views
Registered: ‎12-02-2016

First of all, consider moving the topic to HLS.

I do not know how expensive it is to have so much logic branches in your code when compiling to bitstream, this might blow up your design by implementing large multiplexers.

 

Now to the problem in question:

 

The PIPELINE pragma unrolls all encompassed loops by default. I think what you might want to consider is the PIPELINE REWIND pragma in the innermost loop. This tells HLS that there is no need to flush the pipeline between operations, but rather keep the pipeline always full. If this doesnt work you might have dependencies in your code.

To make sure that there is no inter dependency between the data used in the pipeline, the synthesized bitstream will simply flush the pipeline and fill it again, invalidating the REWIND option. The minimum distance between depencies required to keep the pipeline always full depends on the depth of your pipeline; e.g. your depth is 24, make sure that the minimum distance between actual dependencies is larger. If you assured that the is no inter dependency (or that the distance is sufficient) but receive a warning about it, consider telling hls with the AP DEPENDENCE INTER FALSE pragma.

 

Below is the core of my convolutional layer. few remarks considering the code and its surroundings:

 

- padding is done in software together with some other dataset transformations

- currently single precision floating point data

- multiply accumulate buffer (mac) allows for larger distance between dependencies

- output buffered convolution to increase size of mac for free.

- everything is stored in bram

 

for(nin = 0; nin < _NIN_; nin++)
{
    for(wkern = 0; wkern < _WK0_; wkern++)
    {
        /* inter independence of mac no longer true above this point */
        for(nout = 0; nout < _NOUT_PW_; nout++)
        {
            for(hkern = 0; hkern < _HK0_; hkern++)
            {
                for(wout = 0; wout < _WEFF_; wout++)
                {
		    #pragma HLS PIPELINE II=1 REWIND

                    win = wkern + wout;
                    i = in[nin][win];

                    /* bram shift register => circular access pattern */
                    mac_height = (shreg + _HK0_ - hkern - 1) % _HS_;

                    for(nworker = 0; nworker < _NWORKER_; nworker++)
                    {
                        #pragma HLS UNROLL
                        #pragma AP dependence variable=in inter false
	                #pragma AP dependence variable=mac inter false
                        #pragma AP dependence variable=kern inter false

                        k[nworker] = kern[nworker][nin][nout][hkern][wkern];
                        p[nworker] = i * k[nworker];
                        mac[nworker][nout][mac_height][wout] +=  p[nworker];
                    }
                }
            }
        }
    }
}

problem in the above code

View solution in original post

0 Kudos
5 Replies
hanqiu
Visitor
Visitor
4,233 Views
Registered: ‎10-01-2016

I think maybe I should put a simplified version of my total design. I will do that later.

0 Kudos
martizih
Visitor
Visitor
7,089 Views
Registered: ‎12-02-2016

First of all, consider moving the topic to HLS.

I do not know how expensive it is to have so much logic branches in your code when compiling to bitstream, this might blow up your design by implementing large multiplexers.

 

Now to the problem in question:

 

The PIPELINE pragma unrolls all encompassed loops by default. I think what you might want to consider is the PIPELINE REWIND pragma in the innermost loop. This tells HLS that there is no need to flush the pipeline between operations, but rather keep the pipeline always full. If this doesnt work you might have dependencies in your code.

To make sure that there is no inter dependency between the data used in the pipeline, the synthesized bitstream will simply flush the pipeline and fill it again, invalidating the REWIND option. The minimum distance between depencies required to keep the pipeline always full depends on the depth of your pipeline; e.g. your depth is 24, make sure that the minimum distance between actual dependencies is larger. If you assured that the is no inter dependency (or that the distance is sufficient) but receive a warning about it, consider telling hls with the AP DEPENDENCE INTER FALSE pragma.

 

Below is the core of my convolutional layer. few remarks considering the code and its surroundings:

 

- padding is done in software together with some other dataset transformations

- currently single precision floating point data

- multiply accumulate buffer (mac) allows for larger distance between dependencies

- output buffered convolution to increase size of mac for free.

- everything is stored in bram

 

for(nin = 0; nin < _NIN_; nin++)
{
    for(wkern = 0; wkern < _WK0_; wkern++)
    {
        /* inter independence of mac no longer true above this point */
        for(nout = 0; nout < _NOUT_PW_; nout++)
        {
            for(hkern = 0; hkern < _HK0_; hkern++)
            {
                for(wout = 0; wout < _WEFF_; wout++)
                {
		    #pragma HLS PIPELINE II=1 REWIND

                    win = wkern + wout;
                    i = in[nin][win];

                    /* bram shift register => circular access pattern */
                    mac_height = (shreg + _HK0_ - hkern - 1) % _HS_;

                    for(nworker = 0; nworker < _NWORKER_; nworker++)
                    {
                        #pragma HLS UNROLL
                        #pragma AP dependence variable=in inter false
	                #pragma AP dependence variable=mac inter false
                        #pragma AP dependence variable=kern inter false

                        k[nworker] = kern[nworker][nin][nout][hkern][wkern];
                        p[nworker] = i * k[nworker];
                        mac[nworker][nout][mac_height][wout] +=  p[nworker];
                    }
                }
            }
        }
    }
}

problem in the above code

View solution in original post

0 Kudos
u4223374
Advisor
Advisor
4,176 Views
Registered: ‎04-26-2015

In my experience, HLS can have trouble with conditions within loops when those conditions affect whether or not a RAM access has to occur. You're better-off always doing the RAM access, and then using a condition to determine whether the resulting value gets used.

 

Give this a try:

#pragma HLS PIPELINE
#define C 2
#defien PARA 16
for(int l=0;l<PARA;++l){
	int ker_num = l;
	
	int tmp = out_buf[ker_num][i]; // RAM is always accessed.
	if(init&&j==0){
		sum = 0;
	} else {
		sum = tmp;
	}
	for(int m=0;m<C;++m){
		int tmp_result = mul_bi(buf_line[m][i+j-PAD],f_buf[ker_num][m][j]); // RAM is always accessed.
		if(i+j-PAD >=0&&i+j-PAD<W){
			sum += tmp_result;
		}
	}
	out_buf[ker_num][i] = sum;
}
0 Kudos
hanqiu
Visitor
Visitor
4,159 Views
Registered: ‎10-01-2016

@u4223374Thank you for your reply. It doesn't solve the problem. But it helps get rid of the warning

CRITICAL WARNING: [SDSoC 0-0] Timing constraints were not met

 

0 Kudos
u4223374
Advisor
Advisor
4,141 Views
Registered: ‎04-26-2015

Does it say what the problem is during synthesis? Normally HLS will produce a warning that it can't meet the requested pipeline interval, and tell you exactly where the problem is. It occurs to me that in your design, you might want to use the ALLOCATION pragma to ensure it makes enough copies of mul_bi; if it's trying to squeeze the whole lot through a single copy then that's going to be a problem.

0 Kudos