cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
mstamler3037
Contributor
Contributor
8,281 Views
Registered: ‎08-30-2011

HLS 2015.2 Problem

I have a simple YUV to RGB filter that I have been using successfully for a while now with good performance using HLS 2014.4. However, when I used HLS 2015.2 and 2015.2.1 it fails to synthesize and is stuck. I also defined a new project with the source and it still fails. Very strange to see this. Attached is the project. In the meantime I am rolling back to 2014.4

 

 

0 Kudos
6 Replies
austin
Scholar
Scholar
8,271 Views
Registered: ‎02-27-2008

m,

 

Perhaps you would post the logfile?

 

Wading through the archive takes time, and reading the log might indicate the problem far easier.

 

Austin Lesea
Principal Engineer
Xilinx San Jose
0 Kudos
herver
Xilinx Employee
Xilinx Employee
8,243 Views
Registered: ‎08-17-2011

hi @mstamler3037

 

your solution1.log file shows:

@W [XFORM-561] Updating loop lower bound from 200 to 720 for loop 'YUV2RGB_LOOP_X' (src/hls_yuv2rgb.cpp:58:1) in function 'hls_yuv2rgb'.
@W [XFORM-561] Updating loop lower bound from 200 to 576 for loop 'YUV2RGB_LOOP_Y' (src/hls_yuv2rgb.cpp:55:1) in function 'hls_yuv2rgb'.
@I [XFORM-501] Unrolling loop 'YUV2RGB_LOOP_Y' (src/hls_yuv2rgb.cpp:54) in function 'hls_yuv2rgb' completely.
@W [XFORM-504] Stop unrolling loop 'YUV2RGB_LOOP_Y' (src/hls_yuv2rgb.cpp:54) in function 'hls_yuv2rgb' because it may cause large runtime and excessive memory usage due to increase in code size. Please avoid unrolling the loop or form sub-functions for code in the loop body.
@I [XFORM-501] Unrolling loop 'YUV2RGB_LOOP_X' (src/hls_yuv2rgb.cpp:57) in function 'hls_yuv2rgb' completely.
@I [XFORM-101] Partitioning array 'Wyuv'  in dimension 2 completely.

 

Cross referencing the C-code, it looks like you want to completly unroll the inner loop body 576*720 times. I think this is not reasonnable!

 

Note: a directive pipeline II=1 means : I want this code (or region) to be callable every clock cycle.

You need to undestandand that pipelining will unroll all sub loops contained at that level in order to achieve II required.

 

2014.4 tells you

@W [XFORM-504] Stop unrolling loop 'YUV2RGB_LOOP_Y' (src/hls_yuv2rgb.cpp:54) in function 'hls_yuv2rgb' because it may cause large runtime and excessive memory usage due to increase in code size. Please avoid unrolling the loop or form sub-functions for code in the loop body.

 

I have not run the tool but my guess is that 2015.x is trying to be more optimistic and to do that... but gets stuck because it's not manageable.

 

Do you want to move the pipeline directive inside the inner loop body?

 

I hope this helps.

- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
mstamler3037
Contributor
Contributor
8,232 Views
Registered: ‎08-30-2011

Thanks for your response.

 

I want to emphasize that in 2014.4 this file synthesizes to 1054 cycles latency, implements with meeting timing, integrates into my Vivado video project and works like a charm on our Kintex-7 board.

 

This very same project compiled under 2015.2 doesn't even finish synthesis. The project has a single PIPELINE directive at the function header which indeed instructs HLS to unroll. But it seems to do so with no problem in 2014.4 but gets stuck in 2015.2.

 

Should I open a webcase for this issue?

 

BR

Michael

 

0 Kudos
herver
Xilinx Employee
Xilinx Employee
8,214 Views
Registered: ‎08-17-2011

hi @mstamler3037

 

 

there is something not right somewhere. Your code has 2 loops like so:

 

   width = 720;
   height = 576;

   YUV2RGB_LOOP_Y: for (y=0; y<height; y++)
   {
#pragma HLS loop_tripcount min=200 max=576
    YUV2RGB_LOOP_X: for (x=0; x<width; x++)
    {
#pragma HLS loop_tripcount min=200 max=720

 

there is no way that this is taking only 1054 cycles; the tool is wrong if this number comes from there.

what's the latency in cosim and on hw ?

i didn't see a C TB , do you have one ?

if you read my other posts on this forum, you will notice that i always insist to not put anything on a board before cosim pass with a self checking TB.

 

if that works on a board, then i don't see how the latency can be less than 576*720 ~= 400k cycles.

 

either it's only a reporting issue or the generated RTL somewhat works by change. EDIT : chance

 

so yes please open a webcase, make sure to have a self checking C TB in order to be productive.

- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
mstamler3037
Contributor
Contributor
8,188 Views
Registered: ‎08-30-2011

There is no problem for the latency to be less than 400K. The unrolling and pipelining easily achieves this and I have done this many times which works as a charm on my board. In fact, if the latency is 400k then it does not work in real time.

 

Keep in mind that latency is the delay from input to output for each sample/pixel, not the entire frame.

 

Any more insights here? I have been working extensivly with HLS now for five months and I must say that there are many problems with it especially with AXIS streaming. HLS is very sensitive.

 

BR

 

0 Kudos
herver
Xilinx Employee
Xilinx Employee
8,175 Views
Registered: ‎08-17-2011


@mstamler3037 wrote:

 

Keep in mind that latency is the delay from input to output for each sample/pixel, not the entire frame.

 

there an issue on the terminology here.

 

with this:

 

  YUV2RGB_LOOP_Y: for (y=0; y<576; y++)
    YUV2RGB_LOOP_X: for (x=0; x<720; x++) {
            process_one_pixel_body_function(......)
  }

 

 

and that

      process_one_pixel_body_function(....)

they will not have the same latency.

 

you are talking about the latency on a per pixel basis but you describe the design with loops.

the tool will report the latency to go through the code - ie all the loops - and somewhere in the reports there is the report about process_one_pixel_body_function only.

 

if the II of this function is 3 then the clock cycles needed to execute all the loops are 3* 576* 720.

 

usually people want II=1 meaning their pixel processing pipeline will be able to execute one pixel every 1 clock cycle.

pixel processing pipeline can have a latency of 10 but II of 1 and the averall latency of this function will be something like 1*576*720+10

 

I stay on my previous statement: there is something strange in the numbers reported. The whole function represented by the 2 nested loop can not execute in less than 576*720 cycles.

 

Disclaimer: I mean this in a resonable area. Of course one could unroll the inner loop and read 720 pixels at every clock cycle and process this with some latency (and write back 720 pixels N cycles later), but this would not be practical for the FPGA area and throughput needed.

-> is that what you did actually????

- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos