03-28-2020 03:48 AM - edited 03-28-2020 03:59 AM
I have a rather complex design where I have to square a lot of paralell signals. Now I am using the Multiplier IP-core with two signed inputs of equal length (hence, squaring a signal).
As for resource utilization, I am wondering if the Synthetizier and implementer recognize this that both inputs are the same?
For example, with a 3-bit signed vector. With normal multiplication we would get outputs ranging from -4*3 to 3*3 = -12 to 9. (-12,-9,-8,-6,-4,-3,-2,-1,0,1,4,9,16) If I calculated right...
But with squaring, the only outputs are 0,1,4,9,16.
This is a lot of less outcomes for the squaring. Hence already a look-up-table could be enough.
So my question is, can the optimizer take this into account, that it should be able to optimize when squaring instead of normal multiplication of signed?
Thank you for your input in advance!
03-28-2020 04:34 AM
Thank you for your answer.
That is absolutely true, although this would need a lot of re-coding on my part so that is why I wondered if anyone knew the answer directly.
03-28-2020 06:57 AM
03-28-2020 07:12 AM
What the optimizer can do is somewhat dependent upon the problem. If you are squaring 8-bit numbers, then the optimizer should be able to convert that into a series of 16 lookup tables assuming you have no other logic in the same block. Fewer bits than 8 should also result in a simple lookup table implementation. If you are squaring 18-bit numbers (or more), the optimizer should convert your logic to using one (or more) hard multiplication resources (DSP) in your chip. If you go much wider, there's not much the hardware can do in a single clock tick. In the middle, there are possibilities that include using block RAM, it's in this place that I'm not sure where the cutoff is between trying to optimize with LUTs and optimizing with DSPs is.
Hence the answer above suggesting that you just try it out and see what happens.
03-28-2020 01:44 PM
Thank you for your answers! Really appreciate.
I took the time to replace the instantiated IP-cores with my own lookup tables as described before with the same latency as the IPs.
And the result was a lot less resources used! So I'll stick with my own module. Here's a good change for Xilinx to improve the multiplier IP, with a "square" option.
And Yes, I know about DSP48, but my design is quite large so it would use up >1000 DSP48 units, which would not be feasible with my FPGA.
03-28-2020 02:52 PM
03-29-2020 11:00 AM
Thank you for your response and input.
Yes, I am aware of those high-end products, but I can't work with a chip that costs >1000$/piece (Digikey).
That is why I am sticking with Artix and need to adapt my solutions. But it is also fun to evaluate and research alternative algorithms