cancel
Showing results for
Show  only  | Search instead for
Did you mean:
Observer
1,938 Views
Registered: ‎01-12-2018

## Buffer for Integer Multiplication with High Frequency in Automatically-Generated Verilog Code

Hi all,

I am confused by the generated Verilog for a simple multiplication in a pipelined for-loop.

When the clock period is 10ns, the latency of the multiplication is 8.47ns (within a cycle). When the clock period is 3ns, the latency of the multiplication turn to be 6 cycles plus 2.13ns (i.e. = 6*3+2.13=20.13ns).

The first question is why the latency of the operation is different?

Then I look into the generated Verilog for these two situations. For the situation with clock_period=10ns, the multiplication is simply a " * " operation. However, when clock period is 3ns, there is a shift buffer to store the result of multiplication for 4 extra cycles.

Therefore, the second question is why the buffer is necessary?

Best regards,

-----------------------

Tingyuan

Tags (4)
1 Solution

Accepted Solutions
Xilinx Employee
1,795 Views
Registered: ‎01-09-2008

Your data are 32 bit integers.

DSP48E1 (7-series) have 18x25 multipliers.

DSP48E2 (Ultrascale and Ultrascale+) have 18x27 multipliers.

Let's have a look to the latter (UG579) but this is almost the same process for the previous generations (back to Virtex-4).

On p13, you can see this mult followed by multiplexers. The last one contains 2 inputs with integrated 17-bit shift (to the right, loosing the LSBs). This allows you to implement wide multipliers using 1 or more DSP slice. Typically for a 32x32 bit multiplication:

A = A2:A1 with A on 32 bits, A1 on 17 bits and A2 on 15 bits

B = B2:B1 with B on 32 bits, B1 on 17 bits and B2 on 15 bits

In order to multiply A by B (M = AxB) you must operate in 4 main steps:

1. Tmp1 = (0:A1)x(0:B1)
• A 0 is prepended to the 17 bit MSBs in order to have positive numbers
• Tmp1 is on 18+18 = 36 bits (in a 48 bits container because P is a 48 bit register)
• M1 = LSB17(Tmp1) : the last 17 bits of Tmp1
2. Tmp2 = A2x(0:B1) + (Tmp1>>17)
3. Tmp2 = (0:A1)xB2 + Tmp2
• M2 = LSB17(Tmp2) : the last 17 bits of Tmp2
4. Tmp3 = A2xB2 + (Tmp2>>17)
• M3 = Tmp3
• M3 is on 15+15+1 bits

At the end: M = M3:M2:M1

This technique will use 4 DSP slices and no LUTs (just FF to store intermediate registers)

Now if the clock rate is lower, you can use less DSP slices. If you split B into 6+26 bits, then you can operate the first 2 stages using DSP slices, but the 2 others can be done in LUTs because B2 is pretty small.

1. Tmp1 = (0:A1)x((0:B1) : use DSP slice
2. Tmp2 = A2x(0:B1) + Tmp1>>17 : use DSP slice
3. Tmp3 = (0:A1)xB2
4. Tmp4 = A2xB2 + Tmp3>>17

Then there is a large adder at the end to ad the 2 partial results that will use a DSP slice (to achieve clock rate).

On p45-46-47 you will see some operations requiring 1 to 8 DSP slices

You can also have a look to some older UG. The one of the Virtex-5, UG193, describes what I have just done on p70.

Regards

Olivier

==================================
Olivier Trémois
XILINX SW Marketing AI Engine Tools
Don't forget to reply, give kudos, and accept as solution.
7 Replies
Xilinx Employee
1,899 Views
Registered: ‎09-05-2018

In order to meet the clock period requirements, Vivado HLS has to introduce pipeline buffers to break up the task into small pieces. When the tast is broken up into 6 pieces, HLS must introduce 6 loads and stores into registers, and this adds to the total latency of the IP. I unfortunately don't know enough about the details of C synthesis or the project in question to say why a shift buffer is required to store the operation of the multiplication across the last 4 cycles. But 2 cycles at 3ns is still shorter than 8.47ns of the original single cycle solution; I'd guess other calculations are still being done or the multiplication is still being accumulated.

Nicholas Moellers

Xilinx Worldwide Technical Support
Observer
1,875 Views
Registered: ‎01-12-2018

Dear Nicolas,

Thanks a lot for your prompt reply! I show the source code below:

=====================

for (i = 0; i < 100; i++)

data_out[i] = data_in[i] * data_in[i] + 1;

=====================

This problem is the same even without any directives: the overall latency of multiplication increases as the frequency increases.

I am confused because such multiplication is actually done by combination logic (since it takes 8.47ns, less than 1 10ns-cycle, to accomplish), and it is not necessary to break it up to 6 pieces when the frequency gets higher.

Moreover, another interesting phenomenon is that when the clock period is 10ns, only 3 DSPs are needed but 4 DSPs are needed when the clock period is 3ns.

Besides, I wonder whether there is any way for me to figure this practical technical problem, or how I can feedback such confusing situation to the development team?

Thanks again for your time and suggestion!

Best Regards,

-------------------------------------

Tingyuan LIANG

Observer
1,867 Views
Registered: ‎01-12-2018
Xilinx Employee
1,858 Views
Registered: ‎01-09-2008

Hi Tingyuan

could you expose the datatype of your data_in and data_out arrays?

The estimated DSP# can be slightly off. You should push to the 'export' stage and 'evaluate' the generated RTL(synthesis, place and route) to have a much closer estimate.

Regards

Olivier TREMOIS

==================================
Olivier Trémois
XILINX SW Marketing AI Engine Tools
Don't forget to reply, give kudos, and accept as solution.
Observer
1,817 Views
Registered: ‎01-12-2018

@oliviert

Hi Oliver,

Thanks a lot for your suggestions! The datatype is int (32-bit integer).

Actually, when I read the generated Verilog code, the HLS does instantiate extra DSPs in the source code.

It seem that when the frequency is high, the generated Verilog will do the multiplication in the way like A*B*1 instead of A*B.

Thanks again for your time and further explanation!

Best Regards,

---------------------------

Tingyuan

Xilinx Employee
1,796 Views
Registered: ‎01-09-2008

Your data are 32 bit integers.

DSP48E1 (7-series) have 18x25 multipliers.

DSP48E2 (Ultrascale and Ultrascale+) have 18x27 multipliers.

Let's have a look to the latter (UG579) but this is almost the same process for the previous generations (back to Virtex-4).

On p13, you can see this mult followed by multiplexers. The last one contains 2 inputs with integrated 17-bit shift (to the right, loosing the LSBs). This allows you to implement wide multipliers using 1 or more DSP slice. Typically for a 32x32 bit multiplication:

A = A2:A1 with A on 32 bits, A1 on 17 bits and A2 on 15 bits

B = B2:B1 with B on 32 bits, B1 on 17 bits and B2 on 15 bits

In order to multiply A by B (M = AxB) you must operate in 4 main steps:

1. Tmp1 = (0:A1)x(0:B1)
• A 0 is prepended to the 17 bit MSBs in order to have positive numbers
• Tmp1 is on 18+18 = 36 bits (in a 48 bits container because P is a 48 bit register)
• M1 = LSB17(Tmp1) : the last 17 bits of Tmp1
2. Tmp2 = A2x(0:B1) + (Tmp1>>17)
3. Tmp2 = (0:A1)xB2 + Tmp2
• M2 = LSB17(Tmp2) : the last 17 bits of Tmp2
4. Tmp3 = A2xB2 + (Tmp2>>17)
• M3 = Tmp3
• M3 is on 15+15+1 bits

At the end: M = M3:M2:M1

This technique will use 4 DSP slices and no LUTs (just FF to store intermediate registers)

Now if the clock rate is lower, you can use less DSP slices. If you split B into 6+26 bits, then you can operate the first 2 stages using DSP slices, but the 2 others can be done in LUTs because B2 is pretty small.

1. Tmp1 = (0:A1)x((0:B1) : use DSP slice
2. Tmp2 = A2x(0:B1) + Tmp1>>17 : use DSP slice
3. Tmp3 = (0:A1)xB2
4. Tmp4 = A2xB2 + Tmp3>>17

Then there is a large adder at the end to ad the 2 partial results that will use a DSP slice (to achieve clock rate).

On p45-46-47 you will see some operations requiring 1 to 8 DSP slices

You can also have a look to some older UG. The one of the Virtex-5, UG193, describes what I have just done on p70.

Regards

Olivier

==================================
Olivier Trémois
XILINX SW Marketing AI Engine Tools
Don't forget to reply, give kudos, and accept as solution.
Observer
1,764 Views
Registered: ‎01-12-2018

Dear Olivier,

Thanks a lot for your detailed explanation! I get it!

I may still have one more question: why the multiplication might be implemented in the way like A*B*1 in the generated Verilog?

Thanks again !

Best Regards,

------------------------------

Tingyuan