**UPGRADE YOUR BROWSER**

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Community Forums
- :
- Forums
- :
- Vivado RTL Development
- :
- Synthesis
- :
- How to Optimize Addition Logic?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

bfung

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-19-2017 12:07 PM

3,840 Views

Registered:
03-27-2017

How to Optimize Addition Logic?

Hello,

Design: A single data receive module accepts one *trial* of data, composed of 8-bit *samples *aligned contiguously. This module adds N number of these trials together (by adding N samples across trials together to form summed samples aligned contiguously making up a single summed trial) before sending this final trial summation to the next module for further processing.

Fig. 1: Trial 1[SAMPLE1-1, SAMPLE1-2, SAMPLE 1-3...] + Trial 2[SAMPLE2-1, SAMPLE2-2, SAMPLE 2-3...] = Summed Trial[(SAMPLE1-1+SAMPLE2-1), (SAMPLE1-2+SAMPLE2-2), (SAMPLE1-3+SAMPLE2-3), ...]

Currently in my RTL, I am using a for loop statement within a generate block to instantiate the number of adders (which I design myself and uses just the simple '+' operation) required to add samples across trials, and letting the synthesis tool (Vivado) decide the primitives to use.

I am seeking techniques to use the least # of CLBs and logic resources to perform this addition, whether through optimizations in my RTL, instantiating primitives directly, or others. Any suggestions would be greatly appreciated. Thanks!

PS: I am targeting the Virtex UltraScale+ XCVU9P-L2FLGA2104E

10 Replies

markcurry

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-19-2017 03:27 PM

3,804 Views

Registered:
09-16-2009

Re: How to Optimize Addition Logic?

bfung,

Good start at the algorithmic description - but you'll need to go into more detail of what the hardware / IO looks like.

Where are the input samples coming from? (External pins? RAM on the FPGA? Another module?) Where are the calculated results going? What are the sample rates? What differentiates the Trial sets i.e "Sample1" sets from "Sample2" sets? Is the number of samples "N" in each set fixed or variable?

Regards,

Mark

bfung

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-19-2017 03:57 PM

3,795 Views

Registered:
03-27-2017

Re: How to Optimize Addition Logic?

Certainly -

Input data: Each sample (8 bits) represents the magnitude of a parameter, with the sample's location contiguously arranged within a trial according to it's occurrence in time. (ex. the earliest sample received by a transceiver is Sample M within a single trial of M samples, while the latest is Sample 1. As such, samples across trials are duplicate "measurements", and the idea of adding them is to increase the quality of information.

OK, so these trials enter into this "Trial Adding" module which takes all the samples of a single trial, and adds them to a rolling summation (as previously described) destined to be the addition of N trials. The final summation is then output to another module for further signal processing.

I wish the addition of samples to be performed all in parallel (hence, generating parallel adders). If possible, all additions should be performed within one clock period (156.25 MHz).

bfung

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-19-2017 03:59 PM

3,794 Views

Registered:
03-27-2017

Re: How to Optimize Addition Logic?

anusheel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-19-2017 10:32 PM

3,758 Views

Registered:
07-21-2014

Re: How to Optimize Addition Logic?

Check if there is any possibility for resource('+' in this case) sharing and make sure resource_sharing is set to on/auto.

If you can share the piece of code and schematic, we will be able to check if any further optimizations can be done or not.

Thanks,

Anusheel

-----------------------------------------------------------------------------------------------

Search for documents/answer records related to your device and tool before posting query on forums.

Search related forums and make sure your query is not repeated.

Please mark the post as an answer "Accept as solution" in case it helps to resolve your query.

Helpful answer -> Give Kudos

-----------------------------------------------------------------------------------------------

a_chami

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-20-2017 08:39 AM

3,745 Views

Registered:
04-12-2017

Re: How to Optimize Addition Logic?

Did you consider using DSP48 blocks?

Avi Chami MSc

FPGA Site

FPGA Site

bfung

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-20-2017 09:58 AM

3,740 Views

Registered:
03-27-2017

Re: How to Optimize Addition Logic?

a_chami

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-20-2017 10:07 AM - edited 06-20-2017 10:11 AM

3,736 Views

Registered:
04-12-2017

Re: How to Optimize Addition Logic?

I don't think that the synthesis tool will fetch the DSP48 primitive if not specifically told to do that.

Actually, recently I made an adder and the only way I found for the synthesizer to infer a DSP48 was to use attributes. Without attributes, and without using an IP Wizard to instantiate a DSP48, the synthesizer chose to use LUTs. Maybe there is some switch on the synthesis options to achieve a similar result (fetching DSP blocks if possible), but I am not aware of such an option.

One advantage is that DSP48 blocks are silicon, and not LUTs. So if your design is LUT hungry (almost all of them are), you can be better off by using silicon resources like block RAM and DSP blocks, freeing LUTs and FFs in the process.

Another advantage of the DSP48 is that they are lightning fast. I made an adder with four operands. In logic, I would have chosen to pipeline it or it would in no way fit timing. When I used the DSP48, I got the results in a single clock.

So I did not only save myself the LUTs for the adders, but also the registers of the pipeline.

Avi Chami MSc

FPGA Site

FPGA Site

markcurry

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-20-2017 11:15 AM

3,720 Views

Registered:
09-16-2009

Re: How to Optimize Addition Logic?

bfung,

I'm still not clear on your algorithm "goes intos" and "goes outtas". If you're just receiving one 8-bit sample every 156.25 MHz, then nothing stated so far seems to indicate that you need anything more than ONE adder. (I'm still not clear on your output rates). I.e. receive a new 8-bit sample, accumulate with the one adder. Add some sort of control for resetting the accumulator, and/or other control.

You speak about M samples and N trials, but we're not clear on when/where/how all those samples are arriving. i.e. in one clock cycle, your not receiving a full M*N samples are you?

--Mark

bfung

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-20-2017 11:22 AM

3,719 Views

Registered:
03-27-2017

Re: How to Optimize Addition Logic?

Thanks for the suggestion, I will follow up shortly with code and schematic.

bfung

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-20-2017 03:58 PM

2,328 Views

Registered:
03-27-2017

Re: How to Optimize Addition Logic?

`include "Macros.v" /* Add and Decimate Trials ============== This module adds a specific number of trials together and decimates the result by outputing the same sized blocks (with the same sized samples) as the input data. Parameters ---------- | Parameter | Default | Explanation | |-----------|:-------:|------------:| | `COUNTER_BITS` | 32 | Bitwidth of the internal trial and block counters | | `BLOCKS_IN_TRIAL` | 4 | The number of blocks within a single trial | | `BITS_IN_SAMPLE` | 8 | The number of bits each sample correlation is encoded with| | `SAMPLES_IN_BLOCK` | 2 | The number of samples within a single block| | `TRIALS_TO_SUM` | 3 | The number of trials to sum| Interface --------- +----------+ clock >-----------> |> | reset >-----------> | Add/Dec | | Trials | | | in_data >-----------> | | >-----|Decimate|-------> out_data in_ready <-----------< | | <-----------< out_ready in_valid >-----------> | | >-----------> out_valid +----------+ Notes ----- - `out_valid` is asserted one clock cycle after last trial is loaded to allow one clock cycle duration for summations to finish. */ module Add_Dec_Trials ( clock, reset, data_in, data_out, in_valid, out_valid, in_ready, out_ready ); // ### Parameters parameter COUNTER_BITS = 32, BLOCKS_IN_TRIAL = 4, BITS_IN_SAMPLE = 1, SAMPLES_IN_BLOCK = 8, TRIALS_TO_SUM = 3; // ### Internal Parameters localparam TRIAL_SIZE = BLOCKS_IN_TRIAL * SAMPLES_IN_BLOCK * BITS_IN_SAMPLE, BLOCK_BITS = BITS_IN_SAMPLE * SAMPLES_IN_BLOCK, MAX_SAMPLE_SIZE_INNER = TRIALS_TO_SUM * (`pow2(BITS_IN_SAMPLE)-1), MAX_SAMPLE_SIZE = `flog2(MAX_SAMPLE_SIZE_INNER)+1, MAX_TRIAL_SIZE = MAX_SAMPLE_SIZE * SAMPLES_IN_BLOCK * BLOCKS_IN_TRIAL; // ### State Encoding localparam S_RESET = 2'd0, S_IDLE = 2'd1, S_SHIFT_ADD = 2'd2; // ### I/O input [0:0] clock; input [0:0] reset; input [BLOCK_BITS-1:0] data_in; input [0:0] in_valid; input [0:0] out_ready; output [TRIAL_SIZE-1:0] data_out; output [0:0] out_valid; output [0:0] in_ready; // ### Internal Wires and Registers reg [COUNTER_BITS-1:0] block_count; reg [COUNTER_BITS-1:0] next_block_count; reg [COUNTER_BITS-1:0] trial_count; reg [COUNTER_BITS-1:0] next_trial_count; reg [1:0] state; reg [1:0] next_state; reg [MAX_TRIAL_SIZE-1:0] trial_sum; reg [TRIAL_SIZE-1:0] shifted_trial_out; reg [MAX_TRIAL_SIZE-1:0] prev_trial_sum; integer j=0; wire [0:0] out_valid; wire [0:0] in_ready; wire [MAX_TRIAL_SIZE-1:0] adder_summed_trial; // ### Combinational logic assign in_ready = ((state == S_IDLE) || (state == S_SHIFT_ADD)) ? 1 : 0; assign out_valid = (trial_count == TRIALS_TO_SUM) ? 1 :0; assign data_out = (trial_count == TRIALS_TO_SUM) ? decimate_trial_sum(trial_sum) : 0; // ### Sequential Logic // ### STATE TRANSITIONS --------------------------------------- always @(posedge clock) begin : STATE_TRANSITIONS if (reset) state <= S_RESET; else state <= next_state; end // ### STATE TRANSITION LOGIC --------------------------------------- always @(*) begin : STATE_TRANSITION_LOGIC next_state = state; case (state) S_RESET: begin if (!reset && !in_valid) next_state = S_IDLE; else if (!reset && in_valid) next_state = S_SHIFT_ADD; else next_state = S_RESET; end S_IDLE: begin if (in_valid) next_state = S_SHIFT_ADD; else next_state = S_RESET; end S_SHIFT_ADD: begin if (in_valid) next_state = S_SHIFT_ADD; else next_state = S_IDLE; end endcase end // ### STATE MACHINE OUTPUTS --------------------------------------- always @(posedge clock) begin : STATE_OUTPUTS if (reset) begin shifted_trial_out <= 0; end else begin if (state == S_IDLE) begin shifted_trial_out <= 0; end else if (state == S_SHIFT_ADD) begin shifted_trial_out <= {data_in, shifted_trial_out[BLOCK_BITS +: TRIAL_SIZE-BLOCK_BITS]}; end end end // ### TRIAL SUMMATION LOGIC--------------------------------------- always @(posedge clock) begin prev_trial_sum <= trial_sum; end always @(trial_count) begin : TRIAL_SUMMATION trial_sum <= (trial_count == 1) ? save_first_trial(shifted_trial_out) : adder_summed_trial; end // ### BLOCK AND TRIAL COUNTERS --------------------------------------- always @(posedge clock) begin : COUNTERS block_count <= (reset) ? 0 : next_block_count; trial_count <= (reset) ? 0 : next_trial_count; end always @(*) begin : INCREMENT_COUNTERS next_block_count = block_count; next_trial_count = trial_count; if (state == S_SHIFT_ADD && in_valid) begin if (block_count == BLOCKS_IN_TRIAL) begin next_block_count = 1; if (trial_count == TRIALS_TO_SUM) next_trial_count = 0; end else next_block_count = block_count + 1; if (block_count == BLOCKS_IN_TRIAL-1) next_trial_count = trial_count + 1; end end // --------------------------------------- //### Function Descriptions --------------------------------------- function [MAX_TRIAL_SIZE-1:0] save_first_trial; input [TRIAL_SIZE-1:0] first_shifted_trial; begin for (j=1; j<=SAMPLES_IN_BLOCK * BLOCKS_IN_TRIAL; j=j+1) begin save_first_trial[(MAX_SAMPLE_SIZE*(j-1)) +: MAX_SAMPLE_SIZE] = shifted_trial_out[(BITS_IN_SAMPLE*(j-1)) +: BITS_IN_SAMPLE]; end end endfunction function [TRIAL_SIZE-1:0] decimate_trial_sum; input [MAX_TRIAL_SIZE-1:0] sum_to_decimate; begin for (j=1; j<=SAMPLES_IN_BLOCK * BLOCKS_IN_TRIAL; j=j+1) begin decimate_trial_sum[BITS_IN_SAMPLE*(j-1) +: BITS_IN_SAMPLE] = sum_to_decimate[(MAX_SAMPLE_SIZE*(j-1) + (MAX_SAMPLE_SIZE-BITS_IN_SAMPLE)) +: BITS_IN_SAMPLE]; end end endfunction //### Generate DSP48 Adders --------------------------------------- genvar i; generate for (i=1; i<=(SAMPLES_IN_BLOCK * BLOCKS_IN_TRIAL); i=i+1) begin : Sample_Adder wire [MAX_SAMPLE_SIZE-1:0] sum_out; Add_Samples Sample_Add ( .A(prev_trial_sum[(MAX_SAMPLE_SIZE*(i-1)) +: MAX_SAMPLE_SIZE]), // input wire [8 : 0] A .B(shifted_trial_out[(BITS_IN_SAMPLE*(i-1)) +: BITS_IN_SAMPLE]), // input wire [7 : 0] B .S(sum_out) // output wire [8 : 0] S ); assign adder_summed_trial[(MAX_SAMPLE_SIZE*(i-1)) +: MAX_SAMPLE_SIZE] = sum_out; end endgenerate // --------------------------------------- endmodule

This is the complete module. The code for instantiating the DSP48 adders is located at the bottom. Using the parameter values as stated in the comments in the header, the following resources are used after synthesis:

ARITHMETHIC/DSP - 4096

CLB/CARRY - 8

CLB/LUT - 628998

REGISTER/SDR - 71837

I am not surprised by the number of DSPs required, as this is the number of sample adders needed. What I am surprised is the # of LUTs needed. Perhaps other aspects of my module need optimization. Any pointers would help.