UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Visitor zslwyuan
Visitor
717 Views
Registered: ‎01-12-2018

Buffer for Integer Multiplication with High Frequency in Automatically-Generated Verilog Code

Jump to solution

Hi all,

 

    I am confused by the generated Verilog for a simple multiplication in a pipelined for-loop.

    When the clock period is 10ns, the latency of the multiplication is 8.47ns (within a cycle). When the clock period is 3ns, the latency of the multiplication turn to be 6 cycles plus 2.13ns (i.e. = 6*3+2.13=20.13ns).

    The first question is why the latency of the operation is different?

     Then I look into the generated Verilog for these two situations. For the situation with clock_period=10ns, the multiplication is simply a " * " operation. However, when clock period is 3ns, there is a shift buffer to store the result of multiplication for 4 extra cycles.

     Therefore, the second question is why the buffer is necessary?

    Thanks in advance for your explanation and suggestion!!! ^_^

 

Best regards,

-----------------------

Tingyuan

0 Kudos
1 Solution

Accepted Solutions
Xilinx Employee
Xilinx Employee
574 Views
Registered: ‎01-09-2008

Re: Buffer for Integer Multiplication with High Frequency in Automatically-Generated Verilog Code

Jump to solution

Your data are 32 bit integers.

DSP48E1 (7-series) have 18x25 multipliers.

DSP48E2 (Ultrascale and Ultrascale+) have 18x27 multipliers.

Let's have a look to the latter (UG579) but this is almost the same process for the previous generations (back to Virtex-4).

On p13, you can see this mult followed by multiplexers. The last one contains 2 inputs with integrated 17-bit shift (to the right, loosing the LSBs). This allows you to implement wide multipliers using 1 or more DSP slice. Typically for a 32x32 bit multiplication:

A = A2:A1 with A on 32 bits, A1 on 17 bits and A2 on 15 bits

B = B2:B1 with B on 32 bits, B1 on 17 bits and B2 on 15 bits

In order to multiply A by B (M = AxB) you must operate in 4 main steps:

  1. Tmp1 = (0:A1)x(0:B1)
    • A 0 is prepended to the 17 bit MSBs in order to have positive numbers
    • Tmp1 is on 18+18 = 36 bits (in a 48 bits container because P is a 48 bit register)
    • M1 = LSB17(Tmp1) : the last 17 bits of Tmp1
  2. Tmp2 = A2x(0:B1) + (Tmp1>>17) 
  3. Tmp2 = (0:A1)xB2 + Tmp2
    • M2 = LSB17(Tmp2) : the last 17 bits of Tmp2
  4. Tmp3 = A2xB2 + (Tmp2>>17)
    • M3 = Tmp3
    • M3 is on 15+15+1 bits

At the end: M = M3:M2:M1

This technique will use 4 DSP slices and no LUTs (just FF to store intermediate registers)

Now if the clock rate is lower, you can use less DSP slices. If you split B into 6+26 bits, then you can operate the first 2 stages using DSP slices, but the 2 others can be done in LUTs because B2 is pretty small.

  1. Tmp1 = (0:A1)x((0:B1) : use DSP slice
  2. Tmp2 = A2x(0:B1) + Tmp1>>17 : use DSP slice
  3. Tmp3 = (0:A1)xB2
  4. Tmp4 = A2xB2 + Tmp3>>17

Then there is a large adder at the end to ad the 2 partial results that will use a DSP slice (to achieve clock rate).

On p45-46-47 you will see some operations requiring 1 to 8 DSP slices

You can also have a look to some older UG. The one of the Virtex-5, UG193, describes what I have just done on p70.

Regards

Olivier

 

==================================
Olivier Trémois
XILINX EMEA DSP Specialist
7 Replies
Xilinx Employee
Xilinx Employee
678 Views
Registered: ‎09-05-2018

Re: Buffer for Integer Multiplication with High Frequency in Automatically-Generated Verilog Code

Jump to solution

@zslwyuan,

In order to meet the clock period requirements, Vivado HLS has to introduce pipeline buffers to break up the task into small pieces. When the tast is broken up into 6 pieces, HLS must introduce 6 loads and stores into registers, and this adds to the total latency of the IP. I unfortunately don't know enough about the details of C synthesis or the project in question to say why a shift buffer is required to store the operation of the multiplication across the last 4 cycles. But 2 cycles at 3ns is still shorter than 8.47ns of the original single cycle solution; I'd guess other calculations are still being done or the multiplication is still being accumulated.

Nicholas Moellers

Xilinx Worldwide Technical Support
Visitor zslwyuan
Visitor
654 Views
Registered: ‎01-12-2018

Re: Buffer for Integer Multiplication with High Frequency in Automatically-Generated Verilog Code

Jump to solution

Dear Nicolas,

 

    Thanks a lot for your prompt reply! I show the source code below:

=====================

    for (i = 0; i < 100; i++)

        data_out[i] = data_in[i] * data_in[i] + 1;

=====================

    This problem is the same even without any directives: the overall latency of multiplication increases as the frequency increases.

    I am confused because such multiplication is actually done by combination logic (since it takes 8.47ns, less than 1 10ns-cycle, to accomplish), and it is not necessary to break it up to 6 pieces when the frequency gets higher.

    Moreover, another interesting phenomenon is that when the clock period is 10ns, only 3 DSPs are needed but 4 DSPs are needed when the clock period is 3ns.

    Besides, I wonder whether there is any way for me to figure this practical technical problem, or how I can feedback such confusing situation to the development team?

    Thanks again for your time and suggestion!

 

Best Regards,

-------------------------------------

Tingyuan LIANG

0 Kudos
Visitor zslwyuan
Visitor
646 Views
Registered: ‎01-12-2018

Re: Buffer for Integer Multiplication with High Frequency in Automatically-Generated Verilog Code

Jump to solution
0 Kudos
Xilinx Employee
Xilinx Employee
637 Views
Registered: ‎01-09-2008

Re: Buffer for Integer Multiplication with High Frequency in Automatically-Generated Verilog Code

Jump to solution

Hi Tingyuan

could you expose the datatype of your data_in and data_out arrays?

The estimated DSP# can be slightly off. You should push to the 'export' stage and 'evaluate' the generated RTL(synthesis, place and route) to have a much closer estimate.

 

Regards

Olivier TREMOIS

==================================
Olivier Trémois
XILINX EMEA DSP Specialist
0 Kudos
Visitor zslwyuan
Visitor
596 Views
Registered: ‎01-12-2018

Re: Buffer for Integer Multiplication with High Frequency in Automatically-Generated Verilog Code

Jump to solution

@oliviert

Hi Oliver,

 

    Thanks a lot for your suggestions! The datatype is int (32-bit integer).

    Actually, when I read the generated Verilog code, the HLS does instantiate extra DSPs in the source code.

    It seem that when the frequency is high, the generated Verilog will do the multiplication in the way like A*B*1 instead of A*B.

    Thanks again for your time and further explanation!

 

Best Regards,

---------------------------

Tingyuan

0 Kudos
Xilinx Employee
Xilinx Employee
575 Views
Registered: ‎01-09-2008

Re: Buffer for Integer Multiplication with High Frequency in Automatically-Generated Verilog Code

Jump to solution

Your data are 32 bit integers.

DSP48E1 (7-series) have 18x25 multipliers.

DSP48E2 (Ultrascale and Ultrascale+) have 18x27 multipliers.

Let's have a look to the latter (UG579) but this is almost the same process for the previous generations (back to Virtex-4).

On p13, you can see this mult followed by multiplexers. The last one contains 2 inputs with integrated 17-bit shift (to the right, loosing the LSBs). This allows you to implement wide multipliers using 1 or more DSP slice. Typically for a 32x32 bit multiplication:

A = A2:A1 with A on 32 bits, A1 on 17 bits and A2 on 15 bits

B = B2:B1 with B on 32 bits, B1 on 17 bits and B2 on 15 bits

In order to multiply A by B (M = AxB) you must operate in 4 main steps:

  1. Tmp1 = (0:A1)x(0:B1)
    • A 0 is prepended to the 17 bit MSBs in order to have positive numbers
    • Tmp1 is on 18+18 = 36 bits (in a 48 bits container because P is a 48 bit register)
    • M1 = LSB17(Tmp1) : the last 17 bits of Tmp1
  2. Tmp2 = A2x(0:B1) + (Tmp1>>17) 
  3. Tmp2 = (0:A1)xB2 + Tmp2
    • M2 = LSB17(Tmp2) : the last 17 bits of Tmp2
  4. Tmp3 = A2xB2 + (Tmp2>>17)
    • M3 = Tmp3
    • M3 is on 15+15+1 bits

At the end: M = M3:M2:M1

This technique will use 4 DSP slices and no LUTs (just FF to store intermediate registers)

Now if the clock rate is lower, you can use less DSP slices. If you split B into 6+26 bits, then you can operate the first 2 stages using DSP slices, but the 2 others can be done in LUTs because B2 is pretty small.

  1. Tmp1 = (0:A1)x((0:B1) : use DSP slice
  2. Tmp2 = A2x(0:B1) + Tmp1>>17 : use DSP slice
  3. Tmp3 = (0:A1)xB2
  4. Tmp4 = A2xB2 + Tmp3>>17

Then there is a large adder at the end to ad the 2 partial results that will use a DSP slice (to achieve clock rate).

On p45-46-47 you will see some operations requiring 1 to 8 DSP slices

You can also have a look to some older UG. The one of the Virtex-5, UG193, describes what I have just done on p70.

Regards

Olivier

 

==================================
Olivier Trémois
XILINX EMEA DSP Specialist
Visitor zslwyuan
Visitor
543 Views
Registered: ‎01-12-2018

Re: Buffer for Integer Multiplication with High Frequency in Automatically-Generated Verilog Code

Jump to solution

Dear Olivier,

 

    Thanks a lot for your detailed explanation! I get it!

    I may still have one more question: why the multiplication might be implemented in the way like A*B*1 in the generated Verilog?

    Thanks again !

 

Best Regards,

------------------------------

Tingyuan

0 Kudos