06-15-2020 12:53 AM
I have a Boosting Decision Tree algorithm on HLS, and it passes simulation and synthesis. The only problem is that it is using too many LUTs. I tried to reduce the number of trees by 20%-40% and changed the fixed-point precision from <18,8>to<15,5>. They improved the number of BRAMs and FFs, but only had a 1%-2% improvement on the number of LUTs.
I am new to the FPGA and HLS development, I think I need some help on what pragma I need to add to the C++ code to use more FFs and RAMs, and less LUTs. The project code is attached.
Thanks a lot,
07-07-2020 12:44 PM - edited 07-07-2020 12:47 PM
I looked at your design, and I think the source of the very large LUT resource utilization comes from the array partition directives. I think you are following the right method of unrolling the loops, then partitioning the arrays in order to provide and consume the data from those parallel operations that result from that. All of the array partitions I saw were complete. I like to make a baseline performance solution where I remove all but the interface directives. When I did that, and compared it to your results, I see the following performance and resource impact:
Latency: Baseline Partition Latency (cycles) min 58024 146 nterval (cycles) min 58024 120 Utilization Estimates: Baseline Partition BRAM_18K 213 1202 DSP48E 0 0 FF 69138 383660 LUT 29138 275317 URAM 0 0
So, we have about a 10x increase in LUT utilization to achieve a 500x reduction in latency. To find out why there are so many LUTs, looking at the analysis view is a good idea. There we can see the "decision function" consumes the majority of the LUTs. In that function, the majority of these are in instances, and MUXs rather than expressions or memories. If expressions or memories, then we might be able to use a resource directive to push those in DSP48s or BRAMs respectively. In this case however, it is simply a matter of too much parallelization causing the large LUT resource utilization. Unfortunately, the next step will be an iterative exploration of loop unrolling with a factor, and partial array partitioning rather than complete to match that, to find the performance versus resource utilization your design and device can tolerate. I would recommend a binary search type exploration to quickly converge on what would work for you.
OK, I hope this helps then,
07-10-2020 01:01 PM
Dear Scott@scampbell ,
Thank you so much for your help! I tried many combinations of the array partition and loop pipelining, and I found that the current code gives me the best speed-hardwareUtilization ratio. I have decided to go with it.
Also, After I implemented the design as an IP in Vivado, I found that the LUT utilization is only about 50% in my zcu102 (comparing Vivado HLS estimates 99%, which is not accurate)
07-10-2020 02:07 PM
You are right. Sometimes the resource utilization of HLS is not completely accurate. Vivado synthesis performs a variety of physical optimizations that HLS can not account for such as LUT combining - where different functions can be placed into a single LUT. The best way to get an accurate estimate of resources in HLS is to export with Vivado Synthesis selected.
Glad I could help,