HLS works well if you know what you understand hardware synthesis, directives, and boolean logic optimization in the context of PL logic (LUTS, F7, F8, F9 muxes, carry chains, pipelining). It takes some trial and error but you can get near optimal latency for a given function, part, and clock speed fairly quickly compared to hand tuning pipelined RTL.
If you want to stick to RTL, then experimentation with number of registers and retiming can get you close.
Otherwise, you have to go purely combinational and use the fastest logic possible. Infinitesimally small propagation delays regardless of number of inputs to boolean expressions. Instantaneous propagation delays even better (but unrealizable).
*** Destination: Rapid design and development cycles *** Unappreciated answers get deleted, unappreciative OPs get put on ignored list ***