cancel
Showing results for
Show  only  | Search instead for
Did you mean:
Observer
1,068 Views
Registered: ‎07-27-2016

## Squaring signals with IP-core

Hi all,

I have a rather complex design where I have to square a lot of paralell signals. Now I am using the Multiplier IP-core with two signed inputs of equal length (hence, squaring a signal).

As for resource utilization, I am wondering if the Synthetizier and implementer recognize this that both inputs are the same?

For example, with a 3-bit signed vector. With normal multiplication we would get outputs ranging from -4*3 to 3*3 = -12 to 9. (-12,-9,-8,-6,-4,-3,-2,-1,0,1,4,9,16) If I calculated right...

But with squaring, the only outputs are 0,1,4,9,16.

This is a lot of less outcomes for the squaring. Hence already a look-up-table could be enough.

So my question is, can the optimizer take this into account, that it should be able to optimize when squaring instead of normal multiplication of signed?

Jonas

7 Replies
Voyager
1,052 Views
Registered: ‎06-28-2018

You can simply synthesize/implement the design and see how much and what kind of resources the design uses.

Observer
1,047 Views
Registered: ‎07-27-2016

That is absolutely true, although this would need a lot of re-coding on my part so that is why I wondered if anyone knew the answer directly.

Cheers,

Jonas

Teacher
1,017 Views
Registered: ‎07-09-2009
Taking a step back,

on the assumption your going to hit an FPGA with this

in the FPGA , the DSP blocks on the later 7 series devices, have multipliers that can square in a one clock pipeline

look here, page 9
https://www.xilinx.com/support/documentation/user_guides/ug579-ultrascale-dsp.pdf
Scholar
1,013 Views
Registered: ‎05-21-2015

What the optimizer can do is somewhat dependent upon the problem.  If you are squaring 8-bit numbers, then the optimizer should be able to convert that into a series of 16 lookup tables assuming you have no other logic in the same block.  Fewer bits than 8 should also result in a simple lookup table implementation.  If you are squaring 18-bit numbers (or more), the optimizer should convert your logic to using one (or more) hard multiplication resources (DSP) in your chip.  If you go much wider, there's not much the hardware can do in a single clock tick.  In the middle, there are possibilities that include using block RAM, it's in this place that I'm not sure where the cutoff is between trying to optimize with LUTs and optimizing with DSPs is.

Hence the answer above suggesting that you just try it out and see what happens.

Dan

Observer
957 Views
Registered: ‎07-27-2016

I took the time to replace the instantiated IP-cores with my own lookup tables as described before with the same latency as the IPs.

And the result was a lot less resources used! So I'll stick with my own module. Here's a good change for Xilinx to improve the multiplier IP, with a "square" option.

And Yes, I know about DSP48, but my design is quite large so it would use up >1000 DSP48 units, which would not be feasible with my FPGA.

Jonas

Teacher
937 Views
Registered: ‎07-09-2009
Just for reference,

even the smallest KU3P, has 1368 DSP blocks,
The VU13P has 12288 DSP blocks,

a thousand is very possible,

and they can run at 700 MHz clock

No problem using an LUT if thats what you want, I'm all for that,
BUT

just be aware of what can be done