cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
mstamler3037
Contributor
Contributor
9,766 Views
Registered: ‎08-30-2011

HLS Synthesis Problem

Attached is a simple HLS project that we are developing. When the #If on line 109 is '0' then it compiles cleanly and nicely. However, when I add a simple comparator by enablling the #if on line 109 then it fails. I have invested three days to figure it out without success. What am I missing here?

 

BR

Michael Stamler

 

0 Kudos
13 Replies
balkris
Xilinx Employee
Xilinx Employee
9,758 Views
Registered: ‎08-01-2008

what is the error you are getting . I will try to run your design
Thanks and Regards
Balkrishan
--------------------------------------------------------------------------------------------
Please mark the post as an answer "Accept as solution" in case it helped resolve your query.
Give kudos in case a post in case it guided to the solution.
0 Kudos
mstamler3037
Contributor
Contributor
9,701 Views
Registered: ‎08-30-2011

It's not so much an error than it does not succeed to synthesize the design. WIthout the compare it synthesizes it quickly and well. With it it gets confused and stops.

0 Kudos
muzaffer
Teacher
Teacher
9,690 Views
Registered: ‎03-31-2012

honestly I don't want to unarchive a design and go through it. Could you post the relevant portion of the design where this "comparator" exists?
- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
mstamler3037
Contributor
Contributor
9,680 Views
Registered: ‎08-30-2011

Hi. The 'design' is a single, small file in the Src directory. I gave you the project so that you can both peruse the file and synthesize it 'as is' if you like. But anyway here is the code below.

 

Look at the line: if (x4MinLocl > X4_R_B)

If I remove this line then it synthesizes well. Otherwise it doesn't

 

Cheers

Michael

 

 

 

 

 

 

 

 

 

 

 

// #include "hls_yuv2rgb.h"

#include "ap_int.h"
#include <iostream>
#include <iomanip>
using namespace std;

#include "X4.h"


typedef struct {
ap_uint<24> data; //
bool tuser; // ap_uint<1>
bool tlast; // ap_uint<1>
} STREAM_DATA;


typedef struct s_rgb
{
unsigned char r,g,b;
} RGB;

typedef unsigned short image_dim_t;
typedef unsigned char image_pix_t;
typedef signed short yuv_intrnl_t;
typedef signed short yuv2rgb_coef_t;

#define X4_0ShftFac 2
#define BShftFac 8
#define X4ShftFac 30

#define WIDTH 720
#define LENGTH 576
#define N WIDTH*LENGTH

#define CLIP(x) (((x)>255) ? 255 : (((x)<0) ? 0 : (x)))

void X4(AXI_STREAM_24 video_in[], ap_uint<16> corr_B_R[N],ap_uint<16> corr_B_G[N],ap_uint<16> corr_B_B[N],
ap_uint<16> corr_A_R[N],ap_uint<16> corr_A_G[N], ap_uint<16> corr_A_B[N],
AXI_STREAM_48 video_out[], ap_int<32> *x4Min, ap_int<32> *x4Max);


void X4(AXI_STREAM_24 video_in[], ap_uint<16> corr_B_R[N],ap_uint<16> corr_B_G[N],ap_uint<16> corr_B_B[N],
ap_uint<16> corr_A_R[N],ap_uint<16> corr_A_G[N], ap_uint<16> corr_A_B[N],
AXI_STREAM_48 video_out[], ap_int<32> *x4Min, ap_int<32> *x4Max)
{
//#pragma HLS PIPELINE II=1 enable_flush

// AXI streaming interfaces
#pragma HLS interface axis port=video_in
#pragma HLS interface axis port=video_out

// ap_uint<24> tmp;

int X4_0_R, X4_0_G, X4_0_B;
int X4_R, X4_G, X4_B;

int X4_R_B;

int tmpA, tmpB, tmpC;

int A_invShftFac = 16;
RGB imgIn;
ap_uint<48> outV;
ap_uint<24> inV;


int width = 720;
int height = 576;
int ij;

int x4MaxLocl; //0;
int x4MinLocl; //255;


x4MaxLocl = 0; //-5000; //0;
x4MinLocl = 0; // 5000; //255;


YUV2RGB_LOOP_Y: for (int y=0; y<height; y++)
{
#pragma HLS PIPELINE
#pragma HLS loop_tripcount min=200 max=576
YUV2RGB_LOOP_X: for (int x=0; x<width; x++)
{
// #pragma HLS UNROLL
#pragma HLS PIPELINE
#pragma HLS loop_tripcount min=200 max=720
ij = y*x;

#ifdef __SYNTHESIS__
inV = video_in->data;
#else
inV = video_in[ij].data;
#endif
imgIn.r = inV >>16;
imgIn.g = inV >>8;
imgIn.b = inV & 0xFF;

X4_0_R = ((((int)imgIn.r) << BShftFac) - (int) corr_B_R[ij]) >> (BShftFac - X4_0ShftFac);
X4_0_G = ((((int)imgIn.g) << BShftFac) - (int) corr_B_G[ij]) >> (BShftFac - X4_0ShftFac);
X4_0_B = ((((int)imgIn.b) << BShftFac) - (int) corr_B_B[ij]) >> (BShftFac - X4_0ShftFac);

X4_R = X4_0_R * corr_A_R[ij];
X4_G = X4_0_G * corr_A_G[ij];
X4_B = X4_0_B * corr_A_B[ij];
outV = (X4_R <<24) | (X4_G <<8) | X4_B;

#if 0

X4_R_B = X4_R;
if (x4MinLocl > X4_R_B)
x4MinLocl = X4_R_B;
*x4Min = x4MinLocl;

/**
if (X4_R > x4MaxLocl)
x4MaxLocl = X4_R;
else if (X4_G > x4MaxLocl)
x4MaxLocl = X4_G;
else if (X4_B > x4MaxLocl)
x4MaxLocl = X4_B;
*/


#endif
#ifdef __SYNTHESIS__
video_out->data = outV;
#else
video_out[ij].data = outV;
#endif
video_out->tuser = video_in->tuser;
video_out->tlast = video_in->tlast;
}
}

/// x4MaxLocl -= x4MinLocl;

//Rani moved to normalize
//x4MaxLocl = (1<<(x4Max_invShftFac - X4ShftFac + 8))/(x4MaxLocl);

*x4Min = x4MinLocl;
//*x4Max = x4MaxLocl;

 

}

0 Kudos
muzaffer
Teacher
Teacher
9,677 Views
Registered: ‎03-31-2012

Try to calculate min & max locally and write them once after the loop ends instead of writing at every iteration. ie declare a local variable, calculate min/max over the loop and write it back once at the end.

- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
mstamler3037
Contributor
Contributor
9,665 Views
Registered: ‎08-30-2011

I did that too. Same results.

 

I have been playing with the compare statement for two days now. Fascinating as to why the synthesis fails when such a simple compare is made. Any other ideas?

 

BR

Michael

 

0 Kudos
muzaffer
Teacher
Teacher
9,655 Views
Registered: ‎03-31-2012

Actually it's not a simple compare. It is something which lasts over iterations. I guess it's possible that VHLS is trying to unroll your loop and coming up with a very large comparator tree.
Also if your width is fixed, don't make the trip count minimum a smaller number. Fix it as it is already fixed. That might help.
- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
mstamler3037
Contributor
Contributor
9,647 Views
Registered: ‎08-30-2011

HI. 

 

This code compares the current pixel value against a local variable. It is streaming so this comparator logic receives one pixel at a time and compares it one pixel at a time. That's it.

 

Unless I am missing someting there is no complicated logic that spans multiple cycles.

 

This code is a test case for us as we evaluate the HLS tool. I see it as rather serious if I can't make such a simple comparison statement. Can you help me clarify this point.

 

Many thanks

Michael

 

 

0 Kudos
muzaffer
Teacher
Teacher
9,644 Views
Registered: ‎03-31-2012

I was correct in my earlier assessment. You can't just let hls unroll/pipeline without any control.
Remove the pipeline pragma from LOOP_Y and add it to LOOP_X. Then you can synthesize.

- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
mstamler3037
Contributor
Contributor
8,580 Views
Registered: ‎08-30-2011

OK. when I apply your instructions it indeed synthesizes but with a latency of 414,700 cycles. When I apply the PIPELINE as before I get 758 cycles latency and it works nicely. Adding the compare causes it to fail and the application of PIPELINE to the inner loop creates a latency that is too high. The addition of the compare logic is very, very simple. A simple peak detector using a local variable, not at all on the critical path.

 

I am a bit at odds as to what to do here. The functionality is simple but the impact on performance is tremendous. It seems that HLS is very sensitive to small changes in code. Any suggestions?

 

 

0 Kudos
muzaffer
Teacher
Teacher
8,578 Views
Registered: ‎03-31-2012

my suggestion would be to think about how the old logic  & the new logic would be implemented in rtl and try to explain where the extra latency is coming from.

The 414700 number sounds like width*height of your image. If you look at the image one byte at a time, that would be the latency. Maybe you need to read one whole line and do the compare but you should notice that an compare tree of a whole line would be quite big.

As far as I can tell you are not really considering what your logic implies for hardware. The change you make, albeit seems small in C, has a different structure in implementation.

As I mentioned, try a flow chart of your logic & see if the latency can be explained. If it can (which looks like it) see how you can optimize it in C, if not create a web case with Xilinx to get them to fix the issue.

- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
muzaffer
Teacher
Teacher
8,508 Views
Registered: ‎03-31-2012

@herver: I try my best to contribute. I was wondering if you could take a look at case # 10321140 which is giving me large difference between HLS timing estimation & vivado evaluation. I really need to understand where this is coming from, how to analyze the generated code & how to recode to get better timing.
- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
mstamler3037
Contributor
Contributor
8,490 Views
Registered: ‎08-30-2011

I'm sorry but I only just saw this. Is this still relevant? How do I access this case. I don't see a search for the case number?

0 Kudos