cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

MicroZed Chronicles: Focusing in on HLS Timing and Initiation Interval Violations

Xilinx Employee
Xilinx Employee
0 0 214

Editor’s Note: This content is republished from the MicroZed Chronicles, with permission from the author and Hackster.io.

 

As you will be aware, I do a lot of High-Level Synthesis (HLS) design for clients, especially for image processing applications. One of the great things about HLS is the productivity it brings when creating the application and its verification.

However, when the performance of our HLS block is not as expected, being able to find the critical path which impacts the violation is crucial. We have looked before at the analysis view and the potential optimizations which can be used to increase performance.

In this blog, we are going to examine how we can focus on finding the timing and initiation violations within our HLS designs and of course, correcting them.

Let’s take a simple example of a Test Pattern Generator (TPG). The custom TPG will load in image over the S AXI link from the PS of a Zynq device and then output this image at very fast frame rates. Such an approach is often used to verify image processing algorithms. The image, once loaded, is stored in BRAM / URAM depending upon the device which has been selected for implementation. Crucially to achieve the high frame rates required we are going to output multiple pixels per clock.

This code can be written simply using memcopy to read in the image from the PS DDR and two for loops to output the data correctly.

Source Code.

#include "Blog_352.h"

#include "ap_utils.h"

 

void tpg(axis& OUTPUT_STREAM,

                int lines,

                int pixels,

                int line_start,

                int pixel_start,

                image_in* image){

 

#pragma HLS INTERFACE m_axi depth=327680 port=image offset=slave bundle=image

#pragma HLS INTERFACE axis register both port=OUTPUT_STREAM

#pragma HLS INTERFACE s_axilite port=return

#pragma HLS INTERFACE s_axilite port=lines

#pragma HLS INTERFACE s_axilite port=pixels

#pragma HLS INTERFACE s_axilite port=line_start

#pragma HLS INTERFACE s_axilite port=pixel_start

 

 

image_in frame[MAX_WIDTH*(MAX_HEIGHT)];

 

memcpy(frame,image,(MAX_WIDTH*(MAX_HEIGHT))*sizeof(image_in));

tpg_gen(OUTPUT_STREAM,lines,pixels,line_start,pixel_start,frame);

 

}

 

void tpg_gen(axis& OUTPUT_STREAM,

                int lines,

                int pixels,

                int y_start,

                int x_start,

                image_in* frame){

 

VIDEO_COMP tpg_gen;

int i = 0;

int y = 0;

int x = 0;

 

outer_loop:for (y =0; y<lines; y++){

#pragma HLS LOOP_TRIPCOUNT max=513

                inner_loop:for (x =0; x <  (pixels); x+=2) {

                                #pragma HLS LOOP_TRIPCOUNT max=640

                                #pragma HLS PIPELINE II=1

                                if (y == 0 && x == 0 ){

                                tpg_gen.user = 1;

                                tpg_gen.data = ((frame[((y+y_start)*MAX_WIDTH)+((x+x_start)+1)] << 16) |frame[((y+y_start)*MAX_WIDTH)+(x+x_start)]);

                                }

                                else if (x == (pixels-2) ){

                                tpg_gen.last = 1;

                                tpg_gen.data = ((frame[((y+y_start)*MAX_WIDTH)+((x+x_start)+1)] <<16) | frame[((y+y_start)*MAX_WIDTH)+(x+x_start)]);

                                }

                                else{

                                tpg_gen.last = 0;

                                tpg_gen.user = 0;

                                tpg_gen.data = ((frame[((y+y_start)*MAX_WIDTH)+((x+x_start)+1)] <<16) | frame[((y+y_start)*MAX_WIDTH)+(x+x_start)]);

                                }

                                OUTPUT_STREAM.write(tpg_gen);

                 }

}

}

However, the desire to output two (or more) pixels per clocks makes for a bottleneck in reading from the array which stores the image.

Of course, this bottleneck exists as the image is stored as 16-bit words in each memory location. Reading out two pixels requires reading of two memory locations. This cannot be achieved in one clock cycle unless we correctly partition the BRAM.

When we open the analysis view, we will be presented with information under the module hierarchy, indicating which module if any, is presenting a timing violation or initiation interval violation.

If we only want to focus on the violations, we can click on the timing or II violation button at the top of the module hierarchy.

As it stands, our design indicates a II violation in the tpg_gen function, which is the core of the function that reads the memory and outputs the data over the AXI Stream.

352_Fig1.png

At this point we know we have a II violation, but we need to be able to find the root cause within our design and correct it. We can find the root cause by setting the analysis focus to II violation. This will focus the analysis view on the design element which is causing the II violation.352_Fig2.png

Once the focus is set to II violation in our example, we see that the root cause of the issue reads from the BRAM blocks. The failing element will be colored blue. This will be the case for all analysis views indicating there is an issue. Knowing the failing element, we need to be able to identify which line of the code is causing the potential issue. We can select the operation / control step by right clicking and selecting “goto source”. The source line will be cross probed.

352_Fig3.png

Now that we know what the issue is and have identified the source line of code causing the issue, we can begin to implement optimization strategies. For this case, we can partition the Block Memory into a cyclic buffer such that we get two Block RAMs. Each Block RAM stores the data cyclically. For example,  BRAM A contains data elements 0, 2, 4, 6 etc., while BRAM B contains 1,3,5,7 etc. This allows the two pixel values to be read in parallel and the desired initiation interval achieved.

352_Fig4.png

Of course, more complex algorithms may need a little more analysis and optimization, but at least we now know how we can focus in on what the root cause might be.

 

See My FPGA / SoC Projects: Adam Taylor on Hackster.io

Get the Code: ATaylorCEngFIET (Adam Taylor)