cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Visitor
Visitor
327 Views
Registered: ‎06-09-2020

Any good ways to reduce the LUT utilization?

Hi,

I have a Boosting Decision Tree algorithm on HLS, and it passes simulation and synthesis. The only problem is that it is using too many LUTs. I tried to reduce the number of trees by 20%-40% and changed the fixed-point precision from <18,8>to<15,5>. They improved the number of BRAMs and FFs, but only had a 1%-2% improvement on the number of LUTs.

I am new to the FPGA and HLS development, I think I need some help on what pragma I need to add to the C++ code to use more FFs and RAMs, and less LUTs. The project code is attached.

Thanks a lot,
xjc

0 Kudos
3 Replies
Highlighted
Moderator
Moderator
187 Views
Registered: ‎10-04-2011

Hello xjcwd1101,

I looked at your design, and I think the source of the very large LUT resource utilization comes from the array partition directives. I think you are following the right method of unrolling the loops, then partitioning the arrays in order to provide and consume the data from those parallel operations that result from that. All of the array partitions I saw were complete. I like to make a baseline performance solution where I remove all but the interface directives. When I did that, and compared it to your results, I see the following performance and resource impact:

Latency:
			        Baseline	Partition
Latency (cycles)	min	58024		146
nterval (cycles)	min	58024		120

Utilization Estimates:

			Baseline	Partition
BRAM_18K	        213		1202
DSP48E		        0		0
FF			69138		383660
LUT			29138		275317
URAM		        0		0	

 

So, we have about a 10x increase in LUT utilization to achieve a 500x reduction in latency. To find out why there are so many LUTs, looking at the analysis view is a good idea. There we can see the "decision function" consumes the majority of the LUTs. In that function, the majority of these are in instances, and MUXs rather than expressions or memories. If expressions or memories, then we might be able to use a resource directive to push those in DSP48s or BRAMs respectively. In this case however, it is simply a matter of too much parallelization causing the large LUT resource utilization. Unfortunately, the next step will be an iterative exploration of loop unrolling with a factor, and partial array partitioning rather than complete to match that, to find the performance versus resource utilization your design and device can tolerate. I would recommend a binary search type exploration to quickly converge on what would work for you. 

OK, I hope this helps then,

Regards,
Scott

 

Highlighted
Visitor
Visitor
139 Views
Registered: ‎06-09-2020

Dear Scott@scampbell ,

 

Thank you so much for your help!  I tried many combinations of the array partition and loop pipelining, and I found that the current code gives me the best speed-hardwareUtilization ratio. I have decided to go with it.

Also, After I implemented the design as an IP in Vivado, I found that the LUT utilization is only about 50% in my zcu102 (comparing Vivado HLS estimates 99%, which is not accurate)

 

Thank you,
xjc

0 Kudos
Highlighted
Moderator
Moderator
130 Views
Registered: ‎10-04-2011

Hi XJC,

You are right. Sometimes the resource utilization of HLS is not completely accurate. Vivado synthesis performs a variety of physical optimizations that HLS can not account for such as LUT combining - where different functions can be placed into a single LUT. The best way to get an accurate estimate of resources in HLS is to export with Vivado Synthesis selected. 

Glad I could help,
Scott

0 Kudos