cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
wijm02
Contributor
Contributor
929 Views
Registered: ‎06-09-2020

Correct usage of ap_ctrl_chain to apply backpressure between functions

Jump to solution

I am using Vitis 2019.2

I have a top level module that looks like:

MM2S --> fast_function --> slow_function --> S2MM

The fast function has an internal loop that buffers an input image in a sliding window (BRAM) and sends a signal to the slow function as an hls::stream when the sliding window is ready, so that the slow function will run.

The slow_function processes the information in the sliding window and produces output data. While the slow_function is processing the data in the window, I would like the slow_function to backpressure the fast_function to prevent the contents of the window from changing. I tried to use the hls::stream FIFO between the fast and the slow_function to backpressure the fast_function.

Currently what is happening is that fast_function runs to completion before the slow_function even begins (if the FIFO between them is large enough), or the system deadlocks if the FIFO is full. I would like the fast_function to be backpressured if the FIFO between them is full. Is this possible at all?

I understand that the compiler uses ap_ctrl_hs at the block level interface by default to synchronise functions/blocks in dataflow applications. I did try to use the following at the top level to add the ap_continue signal to the functions when they are synthesised:

#pragma HLS INTERFACE ap_ctrl_chain port=return bundle=control

It would appear that I have not used it correctly.

 

This is a simplified version of the code, where the slow function merely reads the data from the window and streams it out, instead of processing it. If the backpressure works correctly, the img_inp and img_out should be the same. When I combine the fast function and slow function into a single function, it will give the correct output, but I require the two to be decoupled for performance.

#include <ap_int.h>
#include <hls_stream.h>

typedef struct{
	unsigned int h;
	unsigned int w;
	bool kp;
    bool last;
} pixel_t;

const unsigned int PATCHSIZE = 48;

// mat2stream
template <int SRC_T, int ROWS, int COLS, int NPC>
//static
void mat2stream(xf::cv::Mat<SRC_T, ROWS, COLS, NPC> &img, hls::stream<unsigned char> &imgStream, int size) {
#pragma HLS inline off

	img_conv: for (unsigned int k = 0; k < size; k++) { //vertical, rows
	#pragma HLS loop_tripcount min=8192 max=2073600
	#pragma HLS PIPELINE II=1

		unsigned char tmp = img.read(k);

		imgStream.write(tmp);
	}

}

void fast_function(hls::stream<unsigned char> &blurStream, unsigned char window_[WIDTH][PATCHSIZE+1], hls::stream<pixel_t> &pixelStream, unsigned int IMAGE_HEIGHT, unsigned int IMAGE_WIDTH){
#pragma HLS inline off

	circ_buffer: for (unsigned int h = 0; h < IMAGE_HEIGHT; h++){
#pragma HLS loop_tripcount max=1080
//	#pragma HLS PIPELINE II=1


		window_wr_: for (unsigned int w = 0; w < IMAGE_WIDTH; w++){
		#pragma HLS loop_tripcount max=1920
//		#pragma HLS PIPELINE II=1

			unsigned char tmp_blur;

			blurStream.read(tmp_blur);
			window_[w][h % (PATCHSIZE+1)] = tmp_blur;

			if ((h == IMAGE_HEIGHT - 1) & (w == IMAGE_WIDTH-1) ){
//
				pixel_t px2;
				px2.last = 1;
				px2.w = w;
				px2.h = h;
				px2.kp = 0;
				pixelStream.write(px2);
			}
			else
			{
				pixel_t px;
				px.w = w;
				px.h = h;
				px.last = 0;
				px.kp = 1;

				pixelStream.write(px);
			}

		} // w

	}
}

template <int SRC_T, int ROWS, int COLS, int NPC>
void slow_function(unsigned char window_[WIDTH][PATCHSIZE+1], hls::stream<pixel_t> &pixelStream, xf::cv::Mat<SRC_T, ROWS, COLS, NPC> &dstMat, unsigned int IMAGE_HEIGHT, unsigned int IMAGE_WIDTH){

#pragma HLS inline off

	comp: while(1){

	#pragma HLS PIPELINE II=2

		pixel_t temp;
		static bool image_end = 0;

    	if (image_end){

    		image_end = 0;
     		break;
    	}
		else
		{
			pixelStream.read(temp);

			image_end = temp.last;

			unsigned int h = temp.h;
			unsigned int w = temp.w;

			unsigned char window_rd =  window_[w][h % (PATCHSIZE+1)];
			dstMat.write(w  + h*IMAGE_WIDTH, window_rd);

		}
	}
}



extern "C" {

void kernel_accel(ap_uint<INPUT_PTR_WIDTH>* img_inp, ap_uint<OUTPUT_PTR_WIDTH>* img_out, int rows, int cols) {


    #pragma HLS INTERFACE m_axi     port=img_inp  offset=slave bundle=gmem1
    #pragma HLS INTERFACE m_axi     port=img_out  offset=slave bundle=gmem2

    #pragma HLS INTERFACE s_axilite port=rows     bundle=control
    #pragma HLS INTERFACE s_axilite port=cols     bundle=control
    #pragma HLS INTERFACE s_axilite port=return   bundle=control

	#pragma HLS INTERFACE ap_ctrl_chain port=return bundle=control

    const int pROWS = HEIGHT;
    const int pCOLS = WIDTH;
    const int pNPC1 = NPIX;


    xf::cv::Mat<IN_T, HEIGHT, WIDTH, NPIX> img_mat(rows, cols);
	#pragma HLS stream variable=img_mat.data depth=16384

    xf::cv::Mat<IN_T, HEIGHT, WIDTH, NPIX> _dst(rows, cols);
	#pragma HLS stream variable=_dst.data depth=16384

    hls::stream<pixel_t> pixel_st;
	//#pragma HLS STREAM variable = pixel_st depth = 2
	#pragma HLS STREAM variable = pixel_st depth = 16384

    hls::stream<unsigned char> img_st;
	#pragma HLS STREAM variable = img_st depth = 16384

    const unsigned int IMAGE_HEIGHT = img_mat.rows;
    const unsigned int IMAGE_WIDTH = img_mat.cols;

    unsigned char window[WIDTH][PATCHSIZE+1];

    #pragma HLS DATAFLOW

    xf::cv::Array2xfMat<INPUT_PTR_WIDTH, IN_T, HEIGHT, WIDTH, NPIX>(img_inp, img_mat);
    mat2stream<IN_T, HEIGHT, WIDTH, NPIX>(img_mat, img_st, IMAGE_HEIGHT*IMAGE_WIDTH);
    fast_function(img_st, window, pixel_st, IMAGE_HEIGHT, IMAGE_WIDTH);
    slow_function<IN_T, HEIGHT, WIDTH, NPIX>(window, pixel_st, _dst, IMAGE_HEIGHT, IMAGE_WIDTH);
    xf::cv::xfMat2Array<OUTPUT_PTR_WIDTH, IN_T, HEIGHT, WIDTH, NPIX>(_dst, img_out);


}
}

 

 

 

0 Kudos
1 Solution

Accepted Solutions
wijm02
Contributor
Contributor
724 Views
Registered: ‎06-09-2020

Thanks for all the replies. After further investigation, I have realised that I need to rewrite the code in such a way that I don't have two functions concurrently accessing a single RAM - rather, only a single function can simultaneously perform read/write operations on a single RAM

View solution in original post

7 Replies
scampbell
Moderator
Moderator
918 Views
Registered: ‎10-04-2011

Hello wijm02,

I think what you are describing is the default behavior between functions in a non-dataflow regions. That is, one function will completely process its inputs, and generate all of its outputs, prior to the next function beginning. The DATAFLOW directive is used to specify that you would like function level pipelining. With DATAFLOW, as soon as one function has output data available, the next function can begin processing. Care must be taken to use canonical forms (specific code structures) to ensure this is applied correctly. This is documented here beginning on page (142):
https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_1/ug902-vivado-high-level-synthesis.pdf#nameddest=xApplyingOptimizationDirectives

DATAFLOW will apply the correct back pressure for the streaming interface in your multi-rate application. However, as you mentioned, you must ensure the FIFO size in between these functions is large enough to store all the data necessary to hold the output data of the fast function before the slow function can process it. This is discussed briefly on page (221) of the UG902 guide. 

OK, I hope this helps,
Scott

 

0 Kudos
tedbooth
Scholar
Scholar
916 Views
Registered: ‎03-28-2016

@wijm02 

Could you post your code or a simplified example?  It is much easier for people to offer suggestions.

For streams you should be using the "axis" Interface type on the stream ports.

Ted Booth | Tech. Lead FPGA Design Engineer | DesignLinx Solutions
https://www.designlinxhs.com
0 Kudos
wijm02
Contributor
Contributor
894 Views
Registered: ‎06-09-2020

Hi Scott,
Thanks for the quick reply. I have used the dataflow pragma, so I think I have made a mistake somewhere else in my code. Perhaps I have not correctly implemented the canonical form?
Regards,
Marlon

0 Kudos
wijm02
Contributor
Contributor
894 Views
Registered: ‎06-09-2020
Hi Ted,
My understanding was that the "axis" interface can be used on the top level function to communicate with the host program. Is it possible to use them in sub-functions too?
Regards,
Marlon
0 Kudos
scampbell
Moderator
Moderator
870 Views
Registered: ‎10-04-2011

Hi Marlon,

You can use the hls::stream class to communicate between functions, but not AXIS. On the interface, they can be combined together, but not between the functions. One other thing I see in your code now is that you are using the Vivado HLS video library. This library is no longer supported and has been replaced by the Vitis Vision library located here:

Code: https://github.com/Xilinx/Vitis_Libraries/tree/master/vision

Documentation: https://xilinx.github.io/Vitis_Libraries/vision/2020.1/index.html

There is a migration guide to help in transitioning between them that is located here:

https://xilinx.github.io/Vitis_Libraries/vision/2020.1/overview.html#migrating-hls-video-library-to-vitis-vision

So, given that, I would recommend then that you migrate your code as soon as possible to make sure you can be supported in that design going forward. 

Scott

0 Kudos
wijm02
Contributor
Contributor
865 Views
Registered: ‎06-09-2020

Hi Scott,

Thanks for the message.

I am currently using Vitis Vision 2019.2 right now and moving to 2020.1 is definitely something I would like to do sooner rather than later. I am actually planning to move to 2020.1 when the embedded platforms are released

The pixel_t type I used is a struct that I defined myself (I'll add it into the code for clarity, apologies for not including it), so it might look like I am using deprecated library if there was such a type in the previous library. Plus the window too, I manually defined, rather than using predefined window class

typedef struct{
unsigned int h;
unsigned int w;
bool kp;
bool last;
} pixel_t;

I would like to apply video processing functions in the slow_function, though right now, it can't read the window correctly since the fast_function runs to completion before the slow_function begins

0 Kudos
wijm02
Contributor
Contributor
725 Views
Registered: ‎06-09-2020

Thanks for all the replies. After further investigation, I have realised that I need to rewrite the code in such a way that I don't have two functions concurrently accessing a single RAM - rather, only a single function can simultaneously perform read/write operations on a single RAM

View solution in original post