Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

- Community Forums
- :
- Forums
- :
- Vivado RTL Development
- :
- Synthesis
- :
- DSP48 inference in Vivado

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

ross@bitbybitsp.com

Observer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-22-2018 07:48 PM

2,347 Views

Registered:
02-22-2018

DSP48 inference in Vivado

I'm trying to implement a multiply with symmetric round-toward-zero in a DSP48 using inference from source code, and obtain the highest possible operational speed. I could use some help getting Vivado DSP48 inference to work properly.

Note that round-to-even would be OK for my application also if it didn't compromise speed, and there are examples of code that infers to round-to-even, but these examples use the pattern detect and I understand that using the pattern detect lowers maximum speed. So I believe I should avoid this.

I've found no examples of doing round-toward-zero, but piecing together the round-to-even example with other examples I've found has led to partial success. The code below does round-toward-zero, but it does not obtain the highest speed, because I'm unable to get Vivado to infer a register on the DSP48 carry input. The carry input is currently the slowest path in my entire design, achieving only around 450MHz clock rate in an ultrascale. Code below, then more comments:

`define w assign

`define r always@(posedge clk)

//

// This implements symmetric round-towards-zero. It implements in a single DSP48E2,

// with a small amount of outside CLBs for early sign determination that is used

// for the rounding.

//

module mult_real_const

#(

parameter REAL_WIDTH = 27,

parameter CONSTANT_WIDTH = 18,

parameter CONSTANT_FRAC_BITS = CONSTANT_WIDTH-1,

parameter MULT_PROCESSING_DELAY = 4

)

(

input clk,

input [REAL_WIDTH-1:0] in_real_i,

input [CONSTANT_WIDTH-1:0] in_const_i,

output [REAL_WIDTH-1:0] out_real_o

);

reg sign_out_r;

reg sign_out_rr;

reg sign_out_rrr;

`r sign_out_r <= in_real_i[REAL_WIDTH-1] ^ in_const_i[CONSTANT_WIDTH-1];

`r sign_out_rr <= sign_out_r;

`r sign_out_rrr <= sign_out_rr;

// DSP48E2

mult_real_const_inner #(.REAL_WIDTH(REAL_WIDTH), .CONSTANT_WIDTH(CONSTANT_WIDTH), .CONSTANT_FRAC_BITS(CONSTANT_FRAC_BITS), .MULT_PROCESSING_DELAY(MULT_PROCESSING_DELAY))

dsp48(.clk(clk), .in_real_i(in_real_i), .in_const_i(in_const_i), .cin_i(sign_out_rrr), .out_real_o(out_real_o));

endmodule // mult_real_const

//

// This implements as a single DSP48E2. It has a DONT_TOUCH property because otherwise

// it gets flattened into the next level up, which causes Vivado to implement it in two

// DSP48E2s due to some Vivado error.

//

(* DONT_TOUCH = "YES" *)

module mult_real_const_inner

#(

parameter REAL_WIDTH = 27,

parameter CONSTANT_WIDTH = 18,

parameter MULT_WIDTH = REAL_WIDTH + CONSTANT_WIDTH,

parameter CONSTANT_FRAC_BITS = CONSTANT_WIDTH-1,

parameter ROUNDING_VALUE = (1<<(CONSTANT_FRAC_BITS-1))-1,

parameter MULT_PROCESSING_DELAY = 4, // The value the external world says it needs

parameter THIS_MULT_PROCESSING_DELAY = 4 // The value we actually supply in the code below

)

(

input clk,

input [REAL_WIDTH-1:0] in_real_i,

input [CONSTANT_WIDTH-1:0] in_const_i,

input cin_i,

output [REAL_WIDTH-1:0] out_real_o

);

generate

if(MULT_PROCESSING_DELAY != THIS_MULT_PROCESSING_DELAY)

error_module(clk, bad_processing_delay); // No such module. Just to throw an error in this case

endgenerate

reg signed [CONSTANT_WIDTH-1:0] in_const_r;

reg signed [CONSTANT_WIDTH-1:0] in_const_rr;

reg signed [REAL_WIDTH-1:0] in_real_r;

reg signed [REAL_WIDTH-1:0] in_real_rr;

reg signed [MULT_WIDTH-1:0] mult_r;

reg signed [MULT_WIDTH-1:0] pre_rounded_r;

// DSP48E2 begins

// Delay 1

`r in_real_r <= in_real_i;

`r in_const_r <= in_const_i;

// Delay 2

`r in_real_rr <= in_real_r;

`r in_const_rr <= in_const_r;

// Delay 3

`r mult_r <= in_real_rr * in_const_rr;

// Delay 4

`r pre_rounded_r <= mult_r + ROUNDING_VALUE + cin_i;

`w out_real_o = pre_rounded_r[REAL_WIDTH+CONSTANT_FRAC_BITS-1:CONSTANT_FRAC_BITS];

// DSP48E2 ends

endmodule // mult_real_const_inner

In this code, I had to separate it into two modules and put DONT_TOUCH on the inner module to keep Vivado from messing up. Without the DONT_TOUCH, or if you put it all in a single module, Vivado implements it as two DSP48s when it should implement it as a single DSP48. Also, if I move one of the registers on the carry from the outside module into the inner module, what *should* happen is that Vivado should put that register into the DSP48 to enable the register on the carry input. What *actually* happens is that Vivado implements this code using two DSP48s if I do that.

What I want to do to achieve maximum speed is be able to turn on the register that's in the carry chain input of the DSP48. I'm currently stumped, and clearly Vivado is doing some very weird things which makes it harder. Can anyone help, either with a solution or some insight into why Vivado is doing what it's doing? I'd really like to stay with inference from RTL code if I can.

Thanks!

Ross

3 Replies

Highlighted
##

maps-mpls

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-23-2018 09:12 AM - edited 02-23-2018 09:13 AM

2,299 Views

Registered:
06-20-2017

I don't think rounding to even will effect speed. It may use an additional DSP48, and add latency, depending on what you're doing. Did you look at UG901, the synthesis user guide, that has sample code? You might also reverse engineer what the FIR Compiler in the IP catalog is doing when you do a simple FIR with symmetric rounding toward 0 for clues.

Quality support from Xilinx: (1) since: 09-17-2020, (2) since 9/13/2020, (3) since: 09-05-2020, (4) since: 05-14-2020

, (5) since: 04-29-2018

, (5) since: 04-29-2018

Highlighted
##

ross@bitbybitsp.com

Observer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-23-2018 12:50 PM

2,286 Views

Registered:
02-22-2018

Thanks for the reply, maps-mpls. I understand that rounding-to-even reduces achievable FPGA speed because of what it says in UG479. Here's a clip; note the part I've highlighted in red:

Pattern Detect Logic

The pattern detector is connected to the output of the add/subtract/logic unit in the DSP48E1 slice (see Figure 2-14). The pattern detector is best described as an equality check on the output of the adder/subtracter/logic unit that produces its result on the same cycle as the P output. There is no extra latency between the pattern detect output and the P output of the DSP48E1 slice. The use of the pattern detector leads to a moderate speed reduction due to the extra logic on the pattern detect path (see Figure 2-17).

I can only assume that the user guide is correct, and that using the pattern detector will cause slower operation. Round-to-even uses the pattern detector, so I assume round-to-even incurs a speed hit.

I would assume the FIR Compiler uses a template instantiation of the DSP. Are you sure it uses inferred code with symmetric rounding? If so I should see if I can find it and take a look. For the moment I've found a workaround where I simply avoid using the carry_in. It costs extra LUTs, but it's fast.

Highlighted
##

maps-mpls

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-23-2018 05:00 PM

2,275 Views

Registered:
06-20-2017

When I use the FIR Compiler, and enable symmetric rounding toward 0, it uses an extra DSP in addition to any DSPs I need for the FIR filter. I also have optimize for speed enabled. I haven't tested all combinations of options though. That's the first I noticed the note you pointed out. You should make sure if you play around with this to set your sample frequency and clock frequency to the same. Pay attention to the Implementation details tab on the left.

However, I suppose that just because I've never noticed a performance hit doesn't mean there isn't one.

Quality support from Xilinx: (1) since: 09-17-2020, (2) since 9/13/2020, (3) since: 09-05-2020, (4) since: 05-14-2020

, (5) since: 04-29-2018

, (5) since: 04-29-2018