UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Observer jreinauld
Observer
226 Views
Registered: ‎02-24-2017

achieving maximum datarate with fifo/stream interfaces

Hello everyone,

I am working with HLS, and something bothers me. Maybe someone will be able to help.

I am building a very simple block.
The block has an input interface, which is a FIFO of pixels.
The block has an output interface, which is also a FIFO of pixels.
The block forwards every input pixel onto its output; and once every 9 pixels, it inserts a sequence of 6 blank pixels in the stream.

Here is the C++ code:

typedef ap_uint<8>              pixel_t;
typedef hls::stream<pixel_t>    pixel_stream_t;

void pad(
    pixel_stream_t & input,
    pixel_stream_t & output
){

    static int pixel_counter = 0;

    pixel_t pixel = input.read();

    if (pixel_counter == 0) {
        pad_loop: for (int pad_counter = 0; pad_counter < 6; pad_counter++) {
            output.write((pixel_t) 0);
        }
    }

    output.write(pixel);

    if (pixel_counter == 8) {
    	pixel_counter = 0;
    } else {
    	pixel_counter++;
    }
}


The RTL is pretty straightforward:
- a scheduling FSM, with 2 states -- let's call them IDLE and XMIT.
- a pixel counter the goes from 0 to 8, indicates when blank pixels must be inserted
- a pad counter that goes from 0 to 5, used to insert the blank pixels
This is very near the C++ code, so it's easy to understand.

When I synthesize the C++ in RTL, I get the following results: latency min 1, latency max 7, interval min 2, interval max 8.

synthesis.png

Here are the waves of the simulation:

waves.png

The interval of 2 is due to the fact that the input FIFO can only be read while in IDLE state.
I am not happy with that since as a consequence, the datarate is divided by 2.
Is there a directive that would optimize this, so that the FIFO can be read directly from the XMIT state if is not empty?
I.e. is there a way that the block is able to consume one input pixel every clock cycle (instead of 1 input pixel every 2 clock cycles)?

Many thanks!

- Julien

P.S. Please find the code and project attached

0 Kudos
3 Replies
Xilinx Employee
Xilinx Employee
124 Views
Registered: ‎09-05-2018

Re: achieving maximum datarate with fifo/stream interfaces

Hey @jreinauld,

Would you be able to use non blocking reads and/or writes? That should help achieve the desired data rate, but of course I don't know if feasible for your application.

Nicholas Moellers

Xilinx Worldwide Technical Support
0 Kudos
Observer jreinauld
Observer
110 Views
Registered: ‎02-24-2017

Re: achieving maximum datarate with fifo/stream interfaces

Hi,

I am not sure I understand what a 'non-blocking read/write' is in the context of vivado HLS.

Could you tell me how to change the C code so that I can try and see if I get an improvement?

Thanks,

- Julien

0 Kudos
Xilinx Employee
Xilinx Employee
102 Views
Registered: ‎09-05-2018

Re: achieving maximum datarate with fifo/stream interfaces

@jreinauld,

The difference is that blocking reads and writes will stall whereas a non blocking access will pass or fail, but continue either way. Non blocking reads are usually used with error checking. Here's some code:

 

void pad(
    pixel_stream_t & input,
    pixel_stream_t & output
){
#pragma HLS LATENCY max=1
#pragma HLS stream depth=2 variable=input
#pragma HLS stream depth=2 variable=output

	static int state = PAD;
    static int pixel_counter = 0;
    static int pad_counter = 0;
    pixel_t pixel;

    if( state == PAD) {
    	pixel = (pixel_t) 0;
    	if( output.write_nb(pixel) ) {
			if (pad_counter == 5) {
				pad_counter = 0;
				state = PIXEL;
			} else {
				pad_counter++;
			}
    	}
    } else {
    	if( input.read_nb(pixel) && output.write_nb(pixel)) {
            if (pixel_counter == 8) {
            	pixel_counter = 0;
            	state = PAD;
            } else {
            	pixel_counter++;
            }
    	}
    }

}

For what it's worth, if you get rid of the non blocking reads in the code above, Synthesis does give you a latency of 1, but the simulation still shows the stalls. If you can confirm that the stalls are because of the testbench and not the IP, you could get rid of all the _nb's and error checking in the code above. But I was unsure, which is why I recommended non blocking reads and writes as a solution.

The reason I refactored the code like this was so that I could add the #pragma HLS latency max=1 into the function. You can not combine that directive with a for loop. But I think it ends up being extraneous anyway.

Nicholas Moellers

Xilinx Worldwide Technical Support
0 Kudos