Showing results for 
Show  only  | Search instead for 
Did you mean: 
Registered: ‎11-21-2013

HLS 2016.2 generates much higher resource consumption than 2015.3 version

I recently checked some of the old Vivado HLS c++ code which mainly consists of AXI Stream Interface and some AXI Master Interface and r/w control logics to move data between PL and DDR.


For my code, i think using Vivado HLS 2016.2 results in a 100% increase usage of LUTs (~12%) and FFs (~7%), when compared to the results of using HLS 2015.3 (~4% LUTs, ~3% FFs). 


It is the same code with all the same directives, using the same device, and clock frequency constraint.


Anyone else has similar experience, or could someone who still has both version try to test similar codes that including both AXI stream and AXI Master communication that move data between PL and DDR?


If i didn't do anything wrong, this is a really big increase.

0 Kudos
23 Replies
Registered: ‎11-21-2013

Just a quick add note, for computation code the consumption is the same
0 Kudos
Registered: ‎01-28-2014



I can confirm I've seen the same issue through 2016.4. I've primarily used 2015.4 because any 2016.x version has produced poor results. This is frustrating because there are definitely bug fixes in the newer versions but the increase in latency and resources isn't acceptable. I've heard 2017.1 is no better but I haven't yet tried it myself.

0 Kudos
Registered: ‎11-21-2013

@jprice,   Thank you for your confirmation with experiences.


It is indeed hard to accept for these kind of increase in resource consumption,  so anyone who has similar experiences, please comment here : )

0 Kudos
Registered: ‎04-26-2015

Yes, seen the same thing here. 2015.2 is my fallback, although 2015.4 is also common because it includes a few handy bug-fixes and new features. 2015.4 is also nice because it's the first version where HLS was included with Vivado WebPACK.


2016.1/2016.2 tend to be a similar size, sometimes a little bit larger, although they achieve better timing results. 2016.3/2016.4 are significantly larger, with no obvious benefits found (all of my designs produce absolutely identical results in 2016.3 and 2016.4). In 2017.1, my largest block grew by 67% in LUTs and 273% in FFs - while taking more clock cycles and requiring a lower clock speed. I haven't tested 2017.2 yet.


As much as I would like to just stick with 2015.4 permanently, for UltraScale+ support that's not really suitable.

0 Kudos
Registered: ‎02-07-2008

@xubintan, @jprice. @u4223374, if you can post here, or PM me, examples of where the Vivado HLS resources have dramatically increased, I'll run some tests and investigate further.

Don’t forget to reply, kudo, and accept as solution.
0 Kudos
Registered: ‎04-26-2015

@peadard I've already submitted a few of my blocks to Xilinx via one of the other Xilinx moderators.


For what it's worth, I just did a build with my biggest block (this one has been sent to Xilinx) in HLS 2017.2...



I've removed the Y-axis labels, but the differences seen here between 2015.2 and 2017.2 are in the region of 150,000 FFs and 30,000 LUTs. This is for the same code, with the same settings. Target period is 10ns (only 2016.1 and 2016.2 have ever achieved that in the HLS estimates). All of these were done for a Zynq 7045, but testing with a ZU9EG suggests similar behaviour.


In practical terms, I can put three of these blocks (with room to spare) on a Zynq 7045 using HLS 2015.4 or 2016.1. A direct reading of the datasheet suggests that the Zynq UltraScale+ ZU7EG would be a fine upgrade over the Zynq 7045 - a bit more logic, far more DSPs, and the usual speed gains from a newer production process. However, to fit three of these blocks on a Zynq UltraScale+ with HLS 2017.1 or 2017.2 I'd actually need to buy a ZU15EG.

Registered: ‎01-28-2014



I'm out of the office today but I might be able to create an example test case on Wednesday. I've seen similar results to @u4223374.

0 Kudos
Registered: ‎01-28-2014



I've had some trouble reproducing the issues in simple test cases (as is often the case unfortunately). I mucked around and created a test case where the LUT count went up substantially in 2016.4 vs 2015.4 but the FF count went down a little, and the actual clock rate improved substantially (everything else was essentially identical). It is worth noting that the C-Synthesis estimates were inaccurate in both cases, but 2016.4 was much closer to the truth than 2015.4. Just a friendly reminder to everyone to base sizing numbers off of actual synthesized net lists and not the C-Synthesis estimates. I've only been able to reproduce the full scope of the issue with my full designs. I'll continue to try to reproduce a test case but that'll take time.

0 Kudos
Registered: ‎04-26-2015

@jprice Good point, and I may need to retract my above statement about resource consumption rising. The data above is from the HLS estimates, which in the past have generally been reasonably reliable (+/- maybe 20%). I use them as much as possible because these large blocks take a very long time to build.


However, a recent test suggests that when it goes through the full synthesis process Vivado is able to remove a vast amount of hardware, which brings it back down to much more reasonable levels. This still suggests a serious issue in HLS (after all, if HLS thinks the design needs 180K FFs and the synthesis tool deletes 140K of them as unnecessary then that's not a good sign) but at least it's usable.


I'm regenerating the data now ... very slowly. Should have some figures sometime next week.



Edit: from going through the HDL, it appears that HLS has decided to build a 256-to-1 20-bit multiplexer ... and then replicate that roughly 200 times. HLS has then connected these multiplexers up in a particularly odd way...



Every single data input except for din226 is connected to the same thing. So what HLS has built here is equivalent to an equality test (ie mainIndex == 226) followed by a 2-input multiplexer. As far as I can tell, the others it's built are very similar - every one that I've checked has been a two-input multiplexer constructed very inefficiently.


The function this is in does indeed have a fully-partitioned 256-element 20-bit array connected to it, but that array is only used as an output - there should never be any reason to read from the array at this stage (when it's used later, it's treated like a shift register so each element only needs to be able to read from the one next to it, so there's no need for a multiplexer).


In contrast, the 2016.1 implementation has no multiplexers here at all, which lines up with what I'd expect.

0 Kudos
Registered: ‎01-28-2014



That's is quite a disparity. Normally HLS is within...100% which I did not think was very good but is certainly better than 400%. You also gave an interesting example of why HLS might be so mislead. The synthesis tools can definitely optimize that away (though that slows the whole process considerably). I need to re-evaluate all my sizing numbers as it's quite possible I've tricked myself sizing wise (I always used the C-Synthesis reports to gauge relative orders of magnitude).

0 Kudos
Registered: ‎04-26-2015

@jprice Yes, it now makes a lot of sense why HLS thinks it needs so much hardware.


Why it thinks it needs a 256-input multiplexer to select between two values is another question. Perhaps HLS used to do an optimization run to correct this situation, which has been left out as the main synthesis tool can do it instead. That might slightly improve HLS synthesis times, but when it completely destroys the estimates it's not really worthwhile.


The other thing I noticed was that these multiplexers are listed as Instances in the synthesis report, rather than in the Multiplexers tab. Possibly HLS has managed to incorrectly classify them, resulting in optimizations being skipped.


0 Kudos
Registered: ‎04-26-2015

@peadard I've now completed my analysis of this block in newer Vivado versions.




This is the same block following synthesis and implementation. I did all the synthesis and implementation in 2016.1, because (a) I don't have a license for 2017.1/2017.2 yet, and (b) this seems like the fairest way since I only want to evaluate HLS here, not changes in the synthesis/implementation tools. All of them passed timing except for the 2015.4 version; I suspect that this is because the 2015.4 one has done the same work in far fewer clock cycles.


Results are much better than the previous estimates would suggest. However, we're still looking at something like 20% more flip-flops and 10% more LUTs in 2017.1/2017.2 than in 2016.1/2016.2 - for no change in functionality.




Registered: ‎02-24-2016

Hi @u4223374,


I must join the club with my worries regarding the difference between HLS is actually quite surprising.

I am working in the migration of several IPs from 2015.2 to 207.2. Our IPs are basically simple processing blocks for image processing corrections and using axi-stream interfaces. 


Here I posted the result for one block. I'm experiencing similar results with other blocks, sometimes not very positive. I haven't made so nice graphs but I hope a hand-made table would do ;-).

I present in this table the results of the estimations for the two versions I use, 2015.2 and 2017.2. The results are for exactly the same code. In 2017.2 I have to set "#pragma HLS INTERFACE axis off ..." to disable the in/out registers and be totally fair with the comparison (2015.2 doesn't add this registers by default). Moreover, in 2015.2 I can select the option to 'evaluate'. Now in 2017.2 I can also select "evaluate, syntehsis, place&route".

Here the results:

       || t(ns) | BRAM | DSP |    FF |   LUT | SLICE |  SRL |
2015.2 ||  4.5  |    0 | 261 | 13917 | 12356 |              | HLS estimation
2017.2 ||  5.22 |    0 | 581 | 68568 | 22851 |              | 
2015.2 ||  4.83 |    0 | 261 | 20394 |  5354 |  4538 |  635 | With "Export RTL-> Evaluate"
2017.2 ||  3.67 |    0 | 245 | 24254 |  9829 |     0 | 1834 |
2017.2 ||  4.62 |    0 | 245 | 24470 |  6667 |  6591 | 1148 | With "Export RTL-> Evaluate, Vivado synthesis, place&route"

 First thing that surprise me is the horrible initial estimation the new HLS. The resources are multiplied for no apparent reason, moreover, the timing estimation does not meet the requirement ( 5 ns ). On the other side, the initial estimations of 2015.2 are also not perfect but there are not that off. 

The funny thing comes when I go to the 'evaluate' option. I can see how the results in timing are now better with the new HLS, but of course paying more resources.


With this I can conclude a few things for myself.

1. Seems that you cannot trust in the initial estimation. The code will be really much optimized after it. You have to add one additional step in the workflow to always check the 'evaluate'.

2. Since the tool believes it is not meeting timing, it tries to put more resources to fix this (and it believe it is failing). At a conclusion, more resources are used and it can achieve better timing than what it is desired. Bad thing cuz resources are valuable!


Of course, I have other examples where this conclusions are not totally valid...


I'll continue with my migraine-tion to Vivado 2017.2!  



0 Kudos
Registered: ‎04-26-2015

@garbisingla Good to see that others are facing the same issue.


I was amazed at how horribly inefficient some of the code produced by the new versions of HLS was. Vivado optimizes it away and gets the resource usage down to a reasonable level - but surely Vivado should not need to do that. The inefficient code messes up both resource usage and timing estimates because the hardware that HLS has specified bears little resemblence to what Vivado is actually building. Since HLS relies on the timing estimate internally the result is that it wastes a lot of time trying to optimize timing for code that will pass easily after Vivado's optimizations - exactly as you've found.

0 Kudos
Registered: ‎01-28-2014


I believe you were going to send some of your designs to Xilinx for them to analyze why the newer versions perform so differently. Did you ever get any feedback? I've been very hesitant to use newer versions because of this kind of problem. Would you recommend newer versions in your experience so far? My next design I'll probably use 2017.2 but I'm a little leery.

0 Kudos
Registered: ‎04-26-2015

@jprice No feedback as yet, although I suppose I should ask for some. So far I've just been hoping that any issues they find in HLS's interaction with my code will be corrected in HLS 2017.3.


I have not been using HLS at all recently (other projects occupying my time) but right now I'd be using either 2015.4 or 2016.2.

0 Kudos
Registered: ‎07-21-2014

Hi All,


We are able to reproduce this issue and now we have a CR# 987295 to track and fix this issue in future releases.

After debugging we found that, HLS Csynth is not handling DSP generation properly and the DSP logic is pushed into flops and LUTs.


Utilization mismatch can be checked under post-Csynth Analysis view.


We are checking for possible work around and I will update this thread soon.




0 Kudos
Registered: ‎04-26-2015

@anusheel Excellent, thanks for the update!


For what it's worth, HLS 2017.3's estimated FF usage (for the block discussed earlier) has greatly decreased compared to 2017.1 and 2017.2; I suspect that this is because it's met timing fairly easily and therefore hasn't needed to add thousands of extra FFs . However, it's still 20K FFs above 2016.1 and 2016.2.


LUT estimated usage is the highest yet, at just over 106,000 (compared to just over 64,000 for 2016.1 and 2016.2).


I'll do an implementation run in 2016.1 (used for all previous testing to provide a fair comparison) shortly and will see how that goes.


[Edit: or maybe not. The new 2017.3 license manager appears to have broken my implementation license.]


Also, I've seen a bunch of unexpected warnings in synthesis:

WARNING: [HLS 200-40] Directive 'RESOURCE' for core 'ROM_1P_LUTRAM' cannot be applied: Variable 'second_order_float::cos_K1' can not be recognized(UNKNOWN VARIABLE) in 'sincos_approximation'.
WARNING: [HLS 200-40] Directive 'RESOURCE' for core 'ROM_1P_LUTRAM' cannot be applied: Variable 'second_order_float::sin_K2' can not be recognized(UNKNOWN VARIABLE) in 'sincos_approximation'.
WARNING: [HLS 200-40] Directive 'RESOURCE' for core 'ROM_1P_LUTRAM' cannot be applied: Variable 'second_order_float::sin_K1' can not be recognized(UNKNOWN VARIABLE) in 'sincos_approximation'.
WARNING: [HLS 200-40] Directive 'RESOURCE' for core 'ROM_1P_LUTRAM' cannot be applied: Variable 'first_order_fixed_16::sin_cos_K0' can not be recognized(UNKNOWN VARIABLE) in 'sin_or_cos_approximation'.
WARNING: [HLS 200-40] Directive 'RESOURCE' for core 'ROM_1P_LUTRAM' cannot be applied: Variable 'second_order_float::sin_cos_K0' can not be recognized(UNKNOWN VARIABLE) in 'sin_or_cos_approximation'.
WARNING: [HLS 200-40] Directive 'RESOURCE' for core 'ROM_1P_LUTRAM' cannot be applied: Variable 'second_order_float::sin_cos_K1' can not be recognized(UNKNOWN VARIABLE) in 'sin_or_cos_approximation'.
WARNING: [HLS 200-40] Directive 'RESOURCE' for core 'ROM_1P_LUTRAM' cannot be applied: Variable 'second_order_float::sin_cos_K2' can not be recognized(UNKNOWN VARIABLE) in 'sin_or_cos_approximation'.
WARNING: [HLS 200-40] Directive 'RESOURCE' for core 'ROM_1P_LUTRAM' cannot be applied: Variable 'second_order_float::cos_K2' can not be recognized(UNKNOWN VARIABLE) in 'sincos_approximation'.
WARNING: [HLS 200-40] Directive 'RESOURCE' for core 'ROM_1P_LUTRAM' cannot be applied: Variable 'second_order_float::cos_K0' can not be recognized(UNKNOWN VARIABLE) in 'sincos_approximation'.
WARNING: [HLS 200-40] Directive 'RESOURCE' for core 'ROM_1P_LUTRAM' cannot be applied: Variable 'first_order_fixed_16::sin_cos_K1' can not be recognized(UNKNOWN VARIABLE) in 'sin_or_cos_approximation'.
WARNING: [HLS 200-40] Directive 'RESOURCE' for core 'ROM_1P_LUTRAM' cannot be applied: Variable 'second_order_float::sin_K0' can not be recognized(UNKNOWN VARIABLE) in 'sincos_approximation'.

This block does not use sine or cosine at any point.


WARNING: [XFORM 203-152] Cannot apply array mapping directives with instance name 'sPixelBuffer.pixel.V' (block_V2/src/window.cpp:44): cannot find another array to be merged with.
WARNING: [XFORM 203-152] Cannot apply array mapping directives with instance name 'lPixelBuffer0.pixel.V' (block_V2/src/window.cpp:59): cannot find another array to be merged with.
WARNING: [XFORM 203-152] Cannot apply array mapping directives with instance name 'lPixelBuffer1.pixel.V' (block_V2/src/window.cpp:60): cannot find another array to be merged with.
WARNING: [XFORM 203-152] Cannot apply array mapping directives with instance name 'lPixelBuffer2.pixel.V' (block_V2/src/window.cpp:61): cannot find another array to be merged with.

I don't actually have any array mapping directives in here (just a few array partitions).

0 Kudos
Registered: ‎02-24-2016

Hi @anusheel,


Any updates regarding CR# 987295?


Looking forward for a fix! We still have plenty of issues with the instantiation of multipliers and the high increase of resources in 2017.2.




0 Kudos
Registered: ‎08-31-2017


 May we know if it's only the matter of correctness of HLS estimation instead of functional issue in generated RTL code ? Is it that case ?  Now I use 2017.2. 

0 Kudos
Registered: ‎02-24-2016



I have experienced many issues with HLS 2017.2 regarding the instantiation of DSPs, concretely for Zynq 7000 devices. It seems that for U+ the generation is OK.

My issue is that the tool does not instantiate the correct number of DSPs for an operation. Two kinds of issues i've seen:

   1* HLS does not instantiate any DSP at all and performs the multiplications with logic! 

   2* HLS instantiates two times the expected number of DSPs.


In both cases the output HDL is of bad quality. In case of 1*, the resource utilization just explodes...

But in case of 2*, Vivado optimizes away the extra DSPs, resulting in expected number of DSP and ok-ish use of resources. However, I just experienced serious issues in one IP were I had logic errors in the middle of the frame. Image totally corrupted!


So, in our case this is really hurting our development. Now, we decided to roll back the image processing IPs with high use of multiplications and build them in 2015.2 (previous version we were using).


The bug is really serious. Xilinx employee replied that the CR is accepted and it will be 'tentatively fixed for 2018.2 version'. 


Looking forward to hearing something else from Xilinx. 



0 Kudos
Registered: ‎08-31-2017



We target at xc7z010clg400-1.


 For the functional issue, can C/RTL co-sim uncover it  in your cases ? I'm wondering if C/RTL sim pass implies the function of RTL is OK. 


 On top of that, I only install 2017.02 in my linux. Can I also install older stable version line  2015.2 ? Which stable version in HLS  you suggest ?  2015.4 or 2016.2. or 2015.2 ?

0 Kudos
Registered: ‎02-24-2016



For some reason, the co-simulation is successful. I believe that the rtl is functionally valid. I presume the errors are induced during vivado optimization; the resource usage is reduced drastically compared to the actual rtl. 


I must say that the error of the frame corruption I mentioned has only appeared in one of our designs. Since I already had problems with other HLS designs in 2017.2 (*), I decided to quickly try other HLS version. This did the trick and I have not invested much time on the problem yet. 

(*) e.g. other issue I faced was that HLS failed to unroll a loop. Something really straight forward (nothing fancy). The issue was also recognized and fixed in 2017.3. We have not tried further versions, though. 


We are using Vivado 2017.2 for the top levels, implementation and so on (Vivado 2017.2 is WAY better).

We tried HLS 2017.2 and it worked 'ok-ish' for a while, we did further investigation and realized that there was something wrong. For ease, we are using now hls 2015.2 just because it was the one we were using before (old-known version...). I've heard that 2015.4 was also working fine.


In our case, the new versions of HLS do not seem to add much. They add features, of course (like the handy button for opening the co-sim! that's indeed nice :-) ), and synthesis goes way faster, but we prefer a reliable design...


And yes, It is possible to install as many versions of Vivado you need. No problem, just source the proper configuration file in the bash_rc and you're all set. I have a very simple script in my bash_rc that ask me which version of Vivado do I want to use everytime I open a new terminal ;-). Easy-peasy. 

You can also generate IPs with HLS in any version you want and mix and match in your block design with Vivado. A limitation I've seen is that Vivado version cannot be older than the HLS version (e.g. HLS IP v2017.2 and Vivado 2015.2 gave me errors when adding IPs to block design).




0 Kudos