UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Explorer
Explorer
3,840 Views
Registered: ‎03-27-2017

How to Optimize Addition Logic?

Hello,

 

Design: A single data receive module accepts one trial of data, composed of 8-bit samples aligned contiguously. This module adds N number of these trials together (by adding N samples across trials together to form summed samples aligned contiguously making up a single summed trial) before sending this final trial summation to the next module for further processing.

 

Fig. 1: Trial 1[SAMPLE1-1, SAMPLE1-2, SAMPLE 1-3...] + Trial 2[SAMPLE2-1, SAMPLE2-2, SAMPLE 2-3...] = Summed Trial[(SAMPLE1-1+SAMPLE2-1), (SAMPLE1-2+SAMPLE2-2), (SAMPLE1-3+SAMPLE2-3), ...]

 

Currently in my RTL, I am using a for loop statement within a generate block to instantiate the number of adders (which I design myself and uses just the simple '+' operation) required to add samples across trials, and letting the synthesis tool (Vivado) decide the primitives to use. 

 

I am seeking techniques to use the least # of CLBs and logic resources to perform this addition, whether through optimizations in my RTL, instantiating primitives directly, or others. Any suggestions would be greatly appreciated. Thanks!

 

PS: I am targeting the Virtex UltraScale+ XCVU9P-L2FLGA2104E

0 Kudos
10 Replies
Scholar markcurry
Scholar
3,804 Views
Registered: ‎09-16-2009

Re: How to Optimize Addition Logic?

bfung,

 

Good start at the algorithmic description - but you'll need to go into more detail of what the hardware / IO looks like.

Where are the input samples coming from?  (External pins? RAM on the FPGA?  Another module?)  Where are the calculated results going?  What are the sample rates?  What differentiates the Trial sets i.e "Sample1" sets from "Sample2" sets?  Is the number of samples "N" in each set fixed or variable?

 

Regards,

 

Mark

0 Kudos
Explorer
Explorer
3,795 Views
Registered: ‎03-27-2017

Re: How to Optimize Addition Logic?

Certainly - 

 

Input data: Each sample (8 bits) represents the magnitude of a parameter, with the sample's location contiguously arranged within a trial according to it's occurrence in time. (ex. the earliest sample received by a transceiver is Sample M within a single trial of M samples, while the latest is Sample 1. As such, samples across trials are duplicate "measurements", and the idea of adding them is to increase the quality of information. 

 

OK, so these trials enter into this "Trial Adding" module which takes all the samples of a single trial, and adds them to a rolling summation (as previously described) destined to be the addition of N trials. The final summation is then output to another module for further signal processing.

 

I wish the addition of samples to be performed all in parallel (hence, generating parallel adders). If possible, all additions should be performed within one clock period (156.25 MHz). 

 

 

0 Kudos
Explorer
Explorer
3,794 Views
Registered: ‎03-27-2017

Re: How to Optimize Addition Logic?

The reason behind trying to optimize/minimize the synthesized logic is to be able to duplicate this adding module to more receiver channels.
0 Kudos
Moderator
Moderator
3,758 Views
Registered: ‎07-21-2014

Re: How to Optimize Addition Logic?

@bfung

 

Check if there is any possibility for resource('+' in this case) sharing and make sure resource_sharing is set to on/auto.

If you can share the piece of code and schematic, we will be able to check if any further optimizations can be done or not.

 

Thanks,
Anusheel
-----------------------------------------------------------------------------------------------
Search for documents/answer records related to your device and tool before posting query on forums.
Search related forums and make sure your query is not repeated.

Please mark the post as an answer "Accept as solution" in case it helps to resolve your query.
Helpful answer -> Give Kudos
-----------------------------------------------------------------------------------------------

 

 

Explorer
Explorer
3,745 Views
Registered: ‎04-12-2017

Re: How to Optimize Addition Logic?

Did you consider using DSP48 blocks?

Avi Chami MSc
FPGA Site
0 Kudos
Explorer
Explorer
3,740 Views
Registered: ‎03-27-2017

Re: How to Optimize Addition Logic?

Thanks for the suggestion - I had thought to just use the "+" operator and let the synthesis tool determine the best resource to use. Would you mind explaining why instantiating a DSP48 primitive myself would be better?

0 Kudos
Explorer
Explorer
3,736 Views
Registered: ‎04-12-2017

Re: How to Optimize Addition Logic?

I don't think that the synthesis tool will fetch the DSP48 primitive if not specifically told to do that.

 

Actually, recently I made an adder and the only way I found for the synthesizer to infer a DSP48 was to use attributes. Without attributes, and without using an IP Wizard to instantiate a DSP48, the synthesizer chose to use LUTs. Maybe there is some switch on the synthesis options to achieve a similar result (fetching DSP blocks if possible), but I am not aware of such an option.

 

One advantage is that DSP48 blocks are silicon, and not LUTs. So if your design is LUT hungry (almost all of them are), you can be better off by using silicon resources like block RAM and DSP blocks, freeing LUTs and FFs in the process.

 

Another advantage of the DSP48 is that they are lightning fast. I made an adder with four operands. In logic, I would have chosen to pipeline it or it would in no way fit timing. When I used the DSP48, I got the results in a single clock.

 

So I did not only save myself the LUTs for the adders, but also the registers of the pipeline.

Avi Chami MSc
FPGA Site
Scholar markcurry
Scholar
3,720 Views
Registered: ‎09-16-2009

Re: How to Optimize Addition Logic?

bfung,

 

I'm still not clear on your algorithm "goes intos" and "goes outtas".  If you're just receiving one 8-bit sample every 156.25 MHz, then nothing stated so far seems to indicate that you need anything more than ONE adder.  (I'm still not clear on your output rates).  I.e. receive a new 8-bit sample, accumulate with the one adder.  Add some sort of control for resetting the accumulator, and/or other control.

 

You speak about M samples and N trials, but we're not clear on when/where/how all those samples are arriving.  i.e. in one clock cycle, your not receiving a full M*N samples are you?

 

--Mark

 

 

 

 

0 Kudos
Explorer
Explorer
3,719 Views
Registered: ‎03-27-2017

Re: How to Optimize Addition Logic?

Thanks for the suggestion, I will follow up shortly with code and schematic.
0 Kudos
Explorer
Explorer
2,328 Views
Registered: ‎03-27-2017

Re: How to Optimize Addition Logic?

`include "Macros.v"

/*
Add and Decimate Trials
==============
This module adds a specific number of trials together and decimates the result by outputing the same sized blocks (with the same sized samples) as the input data.

Parameters
----------
| Parameter | Default | Explanation |
|-----------|:-------:|------------:|
| `COUNTER_BITS` | 32 | Bitwidth of the internal trial and block counters |
| `BLOCKS_IN_TRIAL` | 4 | The number of blocks within a single trial |
| `BITS_IN_SAMPLE` | 8 | The number of bits each sample correlation is encoded with|
| `SAMPLES_IN_BLOCK` | 2 | The number of samples within a single block|
| `TRIALS_TO_SUM` | 3 | The number of trials to sum|

Interface
---------
                           +----------+
   clock     >-----------> |>         |
   reset     >-----------> | Add/Dec  |
                           | Trials   |
                           |          |
   in_data   >-----------> |          | >-----|Decimate|-------> out_data
   in_ready  <-----------< |          | <-----------< out_ready
   in_valid  >-----------> |          | >-----------> out_valid
                           +----------+
Notes
-----
- `out_valid` is asserted one clock cycle after last trial is loaded to allow one clock cycle duration for summations to finish.
*/

module Add_Dec_Trials (
  clock, reset,
  data_in, data_out,
  in_valid, out_valid,
  in_ready, out_ready
);

// ### Parameters
  parameter
    COUNTER_BITS = 32,
    BLOCKS_IN_TRIAL = 4,
    BITS_IN_SAMPLE = 1,
    SAMPLES_IN_BLOCK = 8,
    TRIALS_TO_SUM = 3;

// ### Internal Parameters
  localparam
    TRIAL_SIZE = BLOCKS_IN_TRIAL * SAMPLES_IN_BLOCK * BITS_IN_SAMPLE,
    BLOCK_BITS = BITS_IN_SAMPLE * SAMPLES_IN_BLOCK,
    MAX_SAMPLE_SIZE_INNER = TRIALS_TO_SUM * (`pow2(BITS_IN_SAMPLE)-1),
    MAX_SAMPLE_SIZE = `flog2(MAX_SAMPLE_SIZE_INNER)+1,
    MAX_TRIAL_SIZE = MAX_SAMPLE_SIZE * SAMPLES_IN_BLOCK * BLOCKS_IN_TRIAL;

// ### State Encoding
  localparam
    S_RESET = 2'd0,
    S_IDLE = 2'd1,
    S_SHIFT_ADD = 2'd2;

// ### I/O
  input [0:0] clock;
  input [0:0] reset;
  input [BLOCK_BITS-1:0] data_in;
  input [0:0] in_valid;
  input [0:0] out_ready;

  output [TRIAL_SIZE-1:0] data_out;
  output [0:0] out_valid;
  output [0:0] in_ready;

// ### Internal Wires and Registers
  reg [COUNTER_BITS-1:0] block_count;
  reg [COUNTER_BITS-1:0] next_block_count;
  reg [COUNTER_BITS-1:0] trial_count;
  reg [COUNTER_BITS-1:0] next_trial_count;
  reg [1:0] state;
  reg [1:0] next_state;
  reg [MAX_TRIAL_SIZE-1:0] trial_sum;
  reg [TRIAL_SIZE-1:0] shifted_trial_out;
  reg [MAX_TRIAL_SIZE-1:0] prev_trial_sum;
  integer j=0;

  wire [0:0] out_valid;
  wire [0:0] in_ready;
  wire [MAX_TRIAL_SIZE-1:0] adder_summed_trial;

// ### Combinational logic
assign in_ready = ((state == S_IDLE) || (state == S_SHIFT_ADD)) ? 1 : 0;
assign out_valid = (trial_count == TRIALS_TO_SUM) ? 1 :0;
assign data_out = (trial_count == TRIALS_TO_SUM) ? decimate_trial_sum(trial_sum) : 0;

// ### Sequential Logic
  // ### STATE TRANSITIONS ---------------------------------------
  always @(posedge clock) begin : STATE_TRANSITIONS
     if (reset)   state <= S_RESET;
     else         state <= next_state;
  end
 // ### STATE TRANSITION LOGIC ---------------------------------------
  always @(*) begin : STATE_TRANSITION_LOGIC
    next_state = state;
    case (state)
      S_RESET: begin
          if (!reset && !in_valid)            next_state = S_IDLE;
          else if (!reset && in_valid)        next_state = S_SHIFT_ADD;
          else                                next_state = S_RESET;
        end
      S_IDLE: begin
          if (in_valid)                       next_state = S_SHIFT_ADD;
          else                                next_state = S_RESET;
        end
      S_SHIFT_ADD: begin
          if (in_valid)                       next_state = S_SHIFT_ADD;
          else                                next_state  = S_IDLE;
        end
    endcase
  end

// ### STATE MACHINE OUTPUTS ---------------------------------------
  always @(posedge clock) begin : STATE_OUTPUTS
    if (reset) begin
          shifted_trial_out <= 0;
    end else begin
      if (state == S_IDLE) begin
          shifted_trial_out <= 0;
      end else if (state == S_SHIFT_ADD) begin
          shifted_trial_out <= {data_in, shifted_trial_out[BLOCK_BITS +: TRIAL_SIZE-BLOCK_BITS]};
      end
    end
 end

// ### TRIAL SUMMATION LOGIC---------------------------------------
  always @(posedge clock) begin
      prev_trial_sum <= trial_sum;
  end

  always @(trial_count) begin : TRIAL_SUMMATION
    trial_sum <= (trial_count == 1) ? save_first_trial(shifted_trial_out) : adder_summed_trial;
  end

// ### BLOCK AND TRIAL COUNTERS ---------------------------------------
  always @(posedge clock) begin : COUNTERS
        block_count <= (reset) ? 0 : next_block_count;
        trial_count <= (reset) ? 0 : next_trial_count;
   end

  always @(*) begin : INCREMENT_COUNTERS
     next_block_count = block_count;
     next_trial_count = trial_count;
     if (state == S_SHIFT_ADD && in_valid) begin
        if (block_count == BLOCKS_IN_TRIAL) begin
          next_block_count = 1;
          if (trial_count == TRIALS_TO_SUM) next_trial_count = 0;
        end else next_block_count = block_count + 1;
          if (block_count == BLOCKS_IN_TRIAL-1) next_trial_count = trial_count + 1;
     end
  end
// ---------------------------------------
//### Function Descriptions ---------------------------------------
function    [MAX_TRIAL_SIZE-1:0] save_first_trial;
input       [TRIAL_SIZE-1:0] first_shifted_trial;
  begin
    for (j=1; j<=SAMPLES_IN_BLOCK * BLOCKS_IN_TRIAL; j=j+1) begin
      save_first_trial[(MAX_SAMPLE_SIZE*(j-1)) +: MAX_SAMPLE_SIZE] = shifted_trial_out[(BITS_IN_SAMPLE*(j-1)) +: BITS_IN_SAMPLE];
    end
  end
endfunction

function [TRIAL_SIZE-1:0] decimate_trial_sum;
input    [MAX_TRIAL_SIZE-1:0] sum_to_decimate;
  begin
    for (j=1; j<=SAMPLES_IN_BLOCK * BLOCKS_IN_TRIAL; j=j+1) begin
              decimate_trial_sum[BITS_IN_SAMPLE*(j-1) +: BITS_IN_SAMPLE] = sum_to_decimate[(MAX_SAMPLE_SIZE*(j-1) + (MAX_SAMPLE_SIZE-BITS_IN_SAMPLE)) +: BITS_IN_SAMPLE];
    end
  end
endfunction

//### Generate DSP48 Adders ---------------------------------------
genvar i;
generate
    for (i=1; i<=(SAMPLES_IN_BLOCK * BLOCKS_IN_TRIAL); i=i+1) begin : Sample_Adder
    
       wire [MAX_SAMPLE_SIZE-1:0] sum_out;
       
        Add_Samples Sample_Add (
                 .A(prev_trial_sum[(MAX_SAMPLE_SIZE*(i-1)) +: MAX_SAMPLE_SIZE]),    // input wire [8 : 0] A
                 .B(shifted_trial_out[(BITS_IN_SAMPLE*(i-1)) +: BITS_IN_SAMPLE]),    // input wire [7 : 0] B
                 .S(sum_out)    // output wire [8 : 0] S
               );
               
      assign adder_summed_trial[(MAX_SAMPLE_SIZE*(i-1)) +: MAX_SAMPLE_SIZE] = sum_out;
    end
endgenerate

// ---------------------------------------
endmodule

This is the complete module. The code for instantiating the DSP48 adders is located at the bottom. Using the parameter values as stated in the comments in the header, the following resources are used after synthesis:

 

ARITHMETHIC/DSP - 4096

CLB/CARRY - 8

CLB/LUT - 628998

REGISTER/SDR - 71837

 

I am not surprised by the number of DSPs required, as this is the number of sample adders needed. What I am surprised is the # of LUTs needed. Perhaps other aspects of my module need optimization. Any pointers would help.

 

0 Kudos