cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
shaikon
Voyager
Voyager
867 Views
Registered: ‎04-12-2012

Multiply - Accumulate of two 32 bit numbers

Jump to solution

Hello,

Can I use a single DSP48 Macro to do a MAC operation on two 32 bit numbers ?
If yes, what DSP48 instruction should I use ?

0 Kudos
1 Solution

Accepted Solutions
avrumw
Guide
Guide
764 Views
Registered: ‎01-23-2009

To do 32x32 you need more than 2 DSP48 cells - I think  you may need 4 (apparently so - see below); @olupj hinted at this when he told you how to break down a 32 bit value into smaller ones.

Specifically (and especially if this is unsigned), you need to break down your operands so that they are 24x17. So to get 32 bits you need (24+8)*(15+17). You then get partial products for the 24x15, the 8x15, the 24x17 and the 8x17 (remember FOIL? First-Outer-Inner-Last). When these are added together (with the appropriate shifts), this gives you your 64 bit output.

For some operations, the tools can do this automatically - I know that if you try and do a 24x34 multiply, it will properly break them into 24x(17+17). This can be done with the cascaded paths internally, and is the reason that there is an OPMODE for PCIN >> 17 (OPMODE[6:4]=3'b101). Simply doing the multiply with enough pipeline stages in your RTL code is enough for it to infer the two DSP48 cells.

As for the other operand being larger than 24 (or 25 including sign), I am not sure if the tools can do this automatically break this down, resulting in 4. You should try this first - simply do

reg [31:0] a;
reg [31:0] b;
reg [63:0] out, out_s1, out_s2, out_s2, out_s3;

always @(posedge clk) begin
  out      <= a * b;
  out_s1 <= out;
  out_s2 <= out_s1;
  out_s3 <= out_s2;
  // I don't know how many are required - apparently 6, see below
end

Then synthesize it and see what the schematic gives.

If this doesn't work, or if you want another solution, then use the "Multiplier" IP - go to the IP catalog and search for "Multiplier". This allows you to specify the inputs and output width, whether you want to use LUTs or MULTs (which are really DSP48) and, in the second page, how much pipelining you can tolerate. When I tried it for an unsigned 32x32 it tells me that 6 pipeline stages is optimial. Building this requires 17 LUTs, 67 FFs, and 4 DSP48 (on an UltraScale device). This core will accept one set of 32x32 operands per clock and will output a 64bit output 6 clocks later (pipelined). This should run at ridiculous speeds - approaching 600MHz...

Avrum

View solution in original post

9 Replies
olupj
Explorer
Explorer
850 Views
Registered: ‎01-27-2008

Hi @shaikon ,

As the DSP48 is 18x25, the immediate answer seems to be no.

However you can do in multiple cycles using one DSP48, this would be similar to using the Multiplier core to multiply any two numbers up to 64b each.

Consider breaking the numbers up into two sets of 16b values,

a = a1*2^n + a0

Check this out: https://en.wikipedia.org/wiki/Karatsuba_algorithm

Jerry

 

shaikon
Voyager
Voyager
819 Views
Registered: ‎04-12-2012

I must do it in a single clock cycle.

0 Kudos
drjohnsmith
Teacher
Teacher
806 Views
Registered: ‎07-09-2009
if you multiply two 32 bit number s together, you get a 64 bit answer.
If you add two 64 bit numbers together you get a 65 bit answer.

As the DSP48, is 48 bits, QED, in a single DSP block you cannot do a MAC in a single cycle.

Your choice is

a) round down the input vector size,
b) use multiple clock cycles
c) use multiple DSP's

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
bruce_karaffa
Scholar
Scholar
802 Views
Registered: ‎06-21-2017

Even if using multiple DSPs, this will be very difficult to complete in one clock cycle unless it is a very slow clock.  @shaikon why must this be done in one clock cycle instead of pipelining the operation?

0 Kudos
olupj
Explorer
Explorer
795 Views
Registered: ‎01-27-2008

@shaikon 

So a single cycle, to the best of my knowledge, is not achievable. I believe the minimum latency of the DSP48 is 3, possibly 2 clocks.

That's why, if you were limited to one converter you could treat your 32b numbers as polynomials... and iterate but not a single cycle (latency = 1).

Jerry

0 Kudos
shaikon
Voyager
Voyager
778 Views
Registered: ‎04-12-2012

When I said "single cycle" I meant the throughput should be single cycle per operation. Of course it would be pipelined and will have multicycle latency.

Is there a way to concatenate 2 DSP48 macros to be able to accept two 32 bit numbers ?  

 

 

 

0 Kudos
avrumw
Guide
Guide
765 Views
Registered: ‎01-23-2009

To do 32x32 you need more than 2 DSP48 cells - I think  you may need 4 (apparently so - see below); @olupj hinted at this when he told you how to break down a 32 bit value into smaller ones.

Specifically (and especially if this is unsigned), you need to break down your operands so that they are 24x17. So to get 32 bits you need (24+8)*(15+17). You then get partial products for the 24x15, the 8x15, the 24x17 and the 8x17 (remember FOIL? First-Outer-Inner-Last). When these are added together (with the appropriate shifts), this gives you your 64 bit output.

For some operations, the tools can do this automatically - I know that if you try and do a 24x34 multiply, it will properly break them into 24x(17+17). This can be done with the cascaded paths internally, and is the reason that there is an OPMODE for PCIN >> 17 (OPMODE[6:4]=3'b101). Simply doing the multiply with enough pipeline stages in your RTL code is enough for it to infer the two DSP48 cells.

As for the other operand being larger than 24 (or 25 including sign), I am not sure if the tools can do this automatically break this down, resulting in 4. You should try this first - simply do

reg [31:0] a;
reg [31:0] b;
reg [63:0] out, out_s1, out_s2, out_s2, out_s3;

always @(posedge clk) begin
  out      <= a * b;
  out_s1 <= out;
  out_s2 <= out_s1;
  out_s3 <= out_s2;
  // I don't know how many are required - apparently 6, see below
end

Then synthesize it and see what the schematic gives.

If this doesn't work, or if you want another solution, then use the "Multiplier" IP - go to the IP catalog and search for "Multiplier". This allows you to specify the inputs and output width, whether you want to use LUTs or MULTs (which are really DSP48) and, in the second page, how much pipelining you can tolerate. When I tried it for an unsigned 32x32 it tells me that 6 pipeline stages is optimial. Building this requires 17 LUTs, 67 FFs, and 4 DSP48 (on an UltraScale device). This core will accept one set of 32x32 operands per clock and will output a 64bit output 6 clocks later (pipelined). This should run at ridiculous speeds - approaching 600MHz...

Avrum

View solution in original post

shaikon
Voyager
Voyager
729 Views
Registered: ‎04-12-2012
Thanks Avrum.
The multiplier IP does support 32x32 (and wider operations).
But the reason I want to adhere to DSP48 is the extra instructions it supports natively with zero external logic.
For example - as far as I understand a DSP48 can be configured to support simultaneous instructions such as A*B, A+B, A*B+old (MAC) in a single instantiation of the IP without extra logic.
0 Kudos
drjohnsmith
Teacher
Teacher
721 Views
Registered: ‎07-09-2009
@shaikon,

The DSP48 block has a bunch of options as you have seen, which "can" be re configured on the fly.

The DSP48 gains its significant speed, by including built in pipe lining registers, which cna be switched in and out.

The down side of the registers is as you change mode, you need to ensure the configuration is also piped line acordingly.

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
0 Kudos