01-16-2019 02:05 AM
New to the board here, I am curious to hear what people think the biggest challenges are with using HLS. What part of the process do people most difficult and/or time consuming, is it taking non-sythesizable code and making it synthesizable? Parallelism identification? Something else?
What is preventing HLS from becoming more widely adopted?
Would love to hear others opinions and have some ideas of my own, but wanted to hear from the community first.
01-16-2019 04:33 AM
The following post might provide some answers - and point you to other places with answers.
01-17-2019 03:45 AM
is it taking non-sythesizable code and making it synthesizable?
In my experience, that's not a "challenge" so much as a "pointless waste of time". I have never come across a situation where massaging CPU-friendly code into HLS-friendly code made sense. It's not possible to do a line-by-line conversion of regular code; instead you have to understand exactly what the whole function/module/program is doing and then write an equivalent in a HLS-friendly way. While you could theoretically gain this understanding by reading and analyzing the C code, for any non-trivial program it's vastly easier to just read the relevant research. After all, the whole point of academic papers is to share understanding...
The key challenges are:
- Finding appropriate algorithms to implement, both in terms of functionality and HLS-friendliness. A solid understanding of what HLS can do is required before starting this.
- Math optimization. Where can I use fixed-point? Can I avoid doing any division, or at least do one big division operation at the end of a line instead of one operation per pixel? What approximation for sine/cosine is appropriate? Often this is not very hard, but I've had to do a few proofs that (for example) the sum of my 1000+ 18-bit values cannot exceed the storage capacity of a 24-bit value, as a result of the process that generates those values. This then allows me to squeeze the multiplication of this by a smaller value into a single DSP slice...
- Pragmas/directives to explain to HLS exactly what I want to do. Mostly I've had success, but there are some blocks where I've just given up, and some things that you simply can't do.
- Bugs. Sometimes HLS just does something really wierd, and without the ability to dig through the HDL code (because it's a mess) actually identifying and fixing the problem can be a real challenge.
01-17-2019 09:27 AM
My company has been trying to use HLS exclusively to replace Verilog for our main applications and we really enjoy the fast simulation speed and automation of data flow generation. As much as I love HLS tool, there are always some annoying sides that I hope HLS team can reconsider the purpose of the tool.
1. I am guessing that the HLS tool is now mainly driven by SDx tools because some simply features that we consider crucial to all hardware designers never catch their eyes. SDx tool always make some strong assumptions, eg OpenCL kernel model is not very applicable if you simply want to do a filter with network in and network out. I can understand Xilinx is trying to target software designer without much hardware background to gain more popularity. However just as @u4223374 points out, a real optimized version always need to be written in a HLS-friendly way and the ironic fact is that people use FPGA mostly for performance (per watt) purpose. So from my experience these years, HLS has always been in a dilemma of a) software users cannot gain enough performance for them to switch and b) hardware designers like us are secondary citizen and have to fight with difficult workarounds to get it working. The complexity for hardware designers, from my perspective, is really unwise since any real impactful FPGA project is likely to be designed by hardware designers. Sometimes I suspect my company is the most HLS intensive company in the world simply because our design typically have over ten different kinds of non-trivial HLS cores in a block diagram. Most of our cores can hardly use same function with a loop like OpenCL kernels. I have been posting lots of improvements that are quite crucial for hardware designers that want to use HLS to replace Verilog for IP design without the limitations of using OpenCL framework or Zynq, unfortunately I am seeing lots of issues is recorded in the user guide as "unsupported" instead of being fixed. If you are interested, you can check the posts I have. I do hope that Xilinx can target hardware designers more to make HLS a replacement for Verilog in large scale projects without sacrificing performance, a simple example is that reading bram in a state machine is not supported even you know you can have a state to wait for data.
2. Also since HLS cannot avoid some hardware-specific problems like depth of fifo, deadlock can happen quite frequently if you don't know all the details. Unfortunately when this happens, there is no easy way to debug it, even adding chipscope is a difficult task. When this happen in a large project, you would hope your code is in Verilog instead, and it is quite possible that you will spend more time and effort for debugging compared to well-structure SystemVerilog. If you are a student that just need to be able to run your applications and get some results for paper, it might be OK. But if you are thinking about adopting it as a replacement for HDL in an industry-level application, I would suggest to reconsider it without some years of HLS experience. As a reference point, it took me about two years of intensive HLS designs to get to a point that I can comfortably say using HLS can achieve same result of my HDL and can be used in production, including lots of hack for debugging etc.
3. The poor regression test of HLS tool! Even if you think you know the tools, new version can change some signal handling and can cause deadlock randomly. This is what I have been fight against over these years. I have to stick with version 2015.x sometimes to avoid some problems. The mysterious block control signals between functions make it extremely really easy to deadlock when your core have feedback. There seems to be a pragma called "occurence" to disable it, but it is not in the ug, thus not officially supported.
Finally I really see the potential of HLS replacing HDL and the simulation ability is amazing, therefore I will highly recommend everyone to embrace it. On the other hand, I do hope Xilinx can hear voice from hardware designers like me and make it a better choice to HDL. I have heard enough about statements like "HLS is slow" and "HLS can not do fine control" etc.