Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

- Community Forums
- :
- Forums
- :
- Software Development and Acceleration
- :
- HLS
- :
- Buffer for Integer Multiplication with High Freque...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

zslwyuan

Observer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-02-2019 06:39 AM

1,938 Views

Registered:
01-12-2018

Hi all,

I am confused by the generated Verilog for a simple multiplication in a pipelined for-loop.

When the clock period is 10ns, the latency of the multiplication is 8.47ns (within a cycle). When the clock period is 3ns, the latency of the multiplication turn to be 6 cycles plus 2.13ns (i.e. = 6*3+2.13=20.13ns).

The first question is why the latency of the operation is different?

Then I look into the generated Verilog for these two situations. For the situation with clock_period=10ns, the multiplication is simply a " * " operation. However, when clock period is 3ns, there is a shift buffer to store the result of multiplication for 4 extra cycles.

Therefore, the second question is why the buffer is necessary?

Thanks in advance for your explanation and suggestion!!! ^_^

Best regards,

-----------------------

Tingyuan

1 Solution

Accepted Solutions

oliviert

Xilinx Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-07-2019 02:15 AM

1,795 Views

Registered:
01-09-2008

Your data are 32 bit integers.

DSP48E1 (7-series) have 18x25 multipliers.

DSP48E2 (Ultrascale and Ultrascale+) have 18x27 multipliers.

Let's have a look to the latter (UG579) but this is almost the same process for the previous generations (back to Virtex-4).

On p13, you can see this mult followed by multiplexers. The last one contains 2 inputs with integrated 17-bit shift (to the right, loosing the LSBs). This allows you to implement wide multipliers using 1 or more DSP slice. Typically for a 32x32 bit multiplication:

A = A2:A1 with A on 32 bits, A1 on 17 bits and A2 on 15 bits

B = B2:B1 with B on 32 bits, B1 on 17 bits and B2 on 15 bits

In order to multiply A by B (M = AxB) you must operate in 4 main steps:

- Tmp1 = (0:A1)x(0:B1)
- A 0 is prepended to the 17 bit MSBs in order to have positive numbers
- Tmp1 is on 18+18 = 36 bits (in a 48 bits container because P is a 48 bit register)
- M1 = LSB17(Tmp1) : the last 17 bits of Tmp1

- Tmp2 = A2x(0:B1) + (Tmp1>>17)
- Tmp2 = (0:A1)xB2 + Tmp2
- M2 = LSB17(Tmp2) : the last 17 bits of Tmp2

- Tmp3 = A2xB2 + (Tmp2>>17)
- M3 = Tmp3
- M3 is on 15+15+1 bits

At the end: M = M3:M2:M1

This technique will use 4 DSP slices and no LUTs (just FF to store intermediate registers)

Now if the clock rate is lower, you can use less DSP slices. If you split B into 6+26 bits, then you can operate the first 2 stages using DSP slices, but the 2 others can be done in LUTs because B2 is pretty small.

- Tmp1 = (0:A1)x((0:B1) : use DSP slice
- Tmp2 = A2x(0:B1) + Tmp1>>17 : use DSP slice
- Tmp3 = (0:A1)xB2
- Tmp4 = A2xB2 + Tmp3>>17

Then there is a large adder at the end to ad the 2 partial results that will use a DSP slice (to achieve clock rate).

On p45-46-47 you will see some operations requiring 1 to 8 DSP slices

You can also have a look to some older UG. The one of the Virtex-5, UG193, describes what I have just done on p70.

Regards

Olivier

==================================

**Olivier Trémois**

XILINX SW Marketing AI Engine Tools

*Don't forget to reply, give kudos, and accept as solution.*

XILINX SW Marketing AI Engine Tools

7 Replies

nmoeller

Xilinx Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-02-2019 12:40 PM

1,899 Views

Registered:
09-05-2018

In order to meet the clock period requirements, Vivado HLS has to introduce pipeline buffers to break up the task into small pieces. When the tast is broken up into 6 pieces, HLS must introduce 6 loads and stores into registers, and this adds to the total latency of the IP. I unfortunately don't know enough about the details of C synthesis or the project in question to say why a shift buffer is required to store the operation of the multiplication across the last 4 cycles. But 2 cycles at 3ns is still shorter than 8.47ns of the original single cycle solution; I'd guess other calculations are still being done or the multiplication is still being accumulated.

Nicholas Moellers

Xilinx Worldwide Technical Support

Xilinx Worldwide Technical Support

zslwyuan

Observer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-03-2019 12:42 AM

1,875 Views

Registered:
01-12-2018

Dear Nicolas,

Thanks a lot for your prompt reply! I show the source code below:

=====================

for (i = 0; i < 100; i++)

data_out[i] = data_in[i] * data_in[i] + 1;

=====================

This problem is the same even without any directives: the overall latency of multiplication increases as the frequency increases.

I am confused because such multiplication is actually done by combination logic (since it takes 8.47ns, less than 1 10ns-cycle, to accomplish), and it is not necessary to break it up to 6 pieces when the frequency gets higher.

Moreover, another interesting phenomenon is that when the clock period is 10ns, only 3 DSPs are needed but 4 DSPs are needed when the clock period is 3ns.

Besides, I wonder whether there is any way for me to figure this practical technical problem, or how I can feedback such confusing situation to the development team?

Thanks again for your time and suggestion!

Best Regards,

-------------------------------------

Tingyuan LIANG

zslwyuan

Observer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-03-2019 12:52 AM

1,867 Views

Registered:
01-12-2018

oliviert

Xilinx Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-03-2019 05:40 AM

1,858 Views

Registered:
01-09-2008

Hi Tingyuan

could you expose the datatype of your data_in and data_out arrays?

The estimated DSP# can be slightly off. You should push to the 'export' stage and 'evaluate' the generated RTL(synthesis, place and route) to have a much closer estimate.

Regards

Olivier TREMOIS

==================================

**Olivier Trémois**

XILINX SW Marketing AI Engine Tools

*Don't forget to reply, give kudos, and accept as solution.*

XILINX SW Marketing AI Engine Tools

zslwyuan

Observer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-05-2019 09:57 PM

1,817 Views

Registered:
01-12-2018

Hi Oliver,

Thanks a lot for your suggestions! The datatype is int (32-bit integer).

Actually, when I read the generated Verilog code, the HLS does instantiate extra DSPs in the source code.

It seem that when the frequency is high, the generated Verilog will do the multiplication in the way like A*B*1 instead of A*B.

Thanks again for your time and further explanation!

Best Regards,

---------------------------

Tingyuan

oliviert

Xilinx Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-07-2019 02:15 AM

1,796 Views

Registered:
01-09-2008

Your data are 32 bit integers.

DSP48E1 (7-series) have 18x25 multipliers.

DSP48E2 (Ultrascale and Ultrascale+) have 18x27 multipliers.

Let's have a look to the latter (UG579) but this is almost the same process for the previous generations (back to Virtex-4).

On p13, you can see this mult followed by multiplexers. The last one contains 2 inputs with integrated 17-bit shift (to the right, loosing the LSBs). This allows you to implement wide multipliers using 1 or more DSP slice. Typically for a 32x32 bit multiplication:

A = A2:A1 with A on 32 bits, A1 on 17 bits and A2 on 15 bits

B = B2:B1 with B on 32 bits, B1 on 17 bits and B2 on 15 bits

In order to multiply A by B (M = AxB) you must operate in 4 main steps:

- Tmp1 = (0:A1)x(0:B1)
- A 0 is prepended to the 17 bit MSBs in order to have positive numbers
- Tmp1 is on 18+18 = 36 bits (in a 48 bits container because P is a 48 bit register)
- M1 = LSB17(Tmp1) : the last 17 bits of Tmp1

- Tmp2 = A2x(0:B1) + (Tmp1>>17)
- Tmp2 = (0:A1)xB2 + Tmp2
- M2 = LSB17(Tmp2) : the last 17 bits of Tmp2

- Tmp3 = A2xB2 + (Tmp2>>17)
- M3 = Tmp3
- M3 is on 15+15+1 bits

At the end: M = M3:M2:M1

This technique will use 4 DSP slices and no LUTs (just FF to store intermediate registers)

Now if the clock rate is lower, you can use less DSP slices. If you split B into 6+26 bits, then you can operate the first 2 stages using DSP slices, but the 2 others can be done in LUTs because B2 is pretty small.

- Tmp1 = (0:A1)x((0:B1) : use DSP slice
- Tmp2 = A2x(0:B1) + Tmp1>>17 : use DSP slice
- Tmp3 = (0:A1)xB2
- Tmp4 = A2xB2 + Tmp3>>17

Then there is a large adder at the end to ad the 2 partial results that will use a DSP slice (to achieve clock rate).

On p45-46-47 you will see some operations requiring 1 to 8 DSP slices

You can also have a look to some older UG. The one of the Virtex-5, UG193, describes what I have just done on p70.

Regards

Olivier

==================================

**Olivier Trémois**

XILINX SW Marketing AI Engine Tools

*Don't forget to reply, give kudos, and accept as solution.*

XILINX SW Marketing AI Engine Tools

zslwyuan

Observer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-07-2019 11:48 PM

1,764 Views

Registered:
01-12-2018

Dear Olivier,

Thanks a lot for your detailed explanation! I get it!

I may still have one more question: why the multiplication might be implemented in the way like A*B*1 in the generated Verilog?

Thanks again !

Best Regards,

------------------------------

Tingyuan