Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

- Community Forums
- :
- Forums
- :
- Vivado RTL Development
- :
- Timing Analysis
- :
- Re: Reduce Adder Speed Delay

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

nadaumtimuj

Adventurer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 04:48 AM

1,150 Views

Registered:
01-29-2021

I have a design where I need to add 22 9-bit binary numbers. In performance of my design, this adder delay is becoming significant. Is there a way to reduce or makeup for the adder delay? Any buffer or anything? Please help!

`I = J1*s1+ J2*s2+...............+ J21*s21`

Please note in my equation,

J is 9-bit number,

s is 1-bit number that can be either 0 or 1 and

I is extended to 14-bit.

1 Solution

Accepted Solutions

avrumw

Guide

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-23-2021 05:59 AM - edited 03-23-2021 06:02 AM

397 Views

Registered:
01-23-2009

(I am just guessing here as to why this thread is going in circles...)

My suspicion is that this is some kind of hardware implementation of a neural network or something similar. The goal is on each clock cycle to perform an update of all nodes simultaneously, where the value of one node is the result of this 22 valued sum of the values from the 22 nearest neighbors (or some subset thereof). Thus, there is one clock cycle latency feedback loop in this addition. This is why it can't be pipelined.

Consequently, there is a fixed amount of work to be done per iteration of this update. That work appears to be these 22 gated additions. Since there is a fixed amount of work for each cycle, this implies that there is a maximum frequency that can be obtained per iteration based on the technology. By definition the maximum speed of each iteration (hence the maximum frequency) is the delay through these adders.

So, lets take pipelining off the table - let's assume it can't be done.

The only way to speed this up is to find a better implementation for the multiple additions.

One is to use a tree of adders combinatorially - add pairs together then add the resulting pairs and so on (similar to what @joancab suggested, but combinatorial).

always @(*) begin

sum01 = d0 + d1;

sum23 = d2 + d3;

sum0123 = sum01 + sum23;

...

end

(if these things are all arrays then you can write loops to do this with less code, but the result will be the same)

This will lead to a binary tree of adders of depth 5. This **may** help, but the tools are pretty intelligent and may already have effectively implemented this on their own.

Another approach is a "Carry Save Adder". This is a mechanism that reduces the sum of 3 values to one level of logic and the sum of two values. This can be applied recursively to reduce your sum of 22 values to 21 levels of logic plus one adder. It's unclear whether this would be faster; the carry chain in the FPGA is quite fast with respect to LUT based operations.

You can also try combining the two - grouping the additions into groups of 3, reducing them to 2 with carry save adders, and then using a tree of these (or any other combination). Or groups of 4 or 5 or...

You probably also need to experiment with your zero-ing out process - maybe using the ternary operation would help

sum = (j0 ? s0 : 9'b0) + (j1 ? s1 : 9'b0) + ...

But it will be up to you to do these experiments. This is an unusual problem and has no "known" solution that is better than the others.

And, ultimately, you will find the "best" solution, and it may still not be as fast as you want. But that's what you get. If you are building a system with one clock cycle feedback, you are limited in performance by the length of the critical loop. If you have optimized the critical loop to be as fast as it can be, then that's that. That is your maximum frequency in the chosen technology.

Avrum

34 Replies

dpaul24

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 05:07 AM - edited 03-16-2021 05:08 AM

975 Views

Registered:
08-07-2014

*Is there a way to reduce or makeup for the adder delay?*

You are doing a structural design. So break up the equation in the RTL code and then see if there is improvement.

*Any buffer or anything?*

In a sequential design (involving clock and flops), adding flops would have improved design throughput, but it is a different story.

------------FPGA enthusiast------------

Consider giving "Kudos" if you like my answer. Please mark my post "Accept as solution" if my answer has solved your problem

bruce_karaffa

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 05:10 AM

972 Views

Registered:
06-21-2017

nadaumtimuj

Adventurer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 05:15 AM

966 Views

Registered:
01-29-2021

My design is something like this:

```
always @* begin
Iin_temp = J_n0_temp*s_in0 + J_n1_temp*s_in1 + J_n2_temp*s_in2 + J_n3_temp*s_in3 + J_n4_temp*s_in4 + J_n5_temp*s_in5 + J_n6_temp*s_in6 + J_n7_temp*s_in7 + J_n8_temp*s_in8 + J_n9_temp*s_in9 + J_n10_temp*s_in10 + J_n11_temp*s_in11 + J_n12_temp*s_in12 + J_n13_temp*s_in13 + J_n14_temp*s_in14 + J_n15_temp*s_in15 + J_n16_temp*s_in16 + J_n17_temp*s_in17 + J_n18_temp*s_in18 + J_n19_temp*s_in19 + J_n20_temp*s_in20 + J_n21_temp*s_in21 + h_temp;
if ($signed(Iin_temp) > $signed(p63_75))
Iin <= 9'b011111111;
else if ($signed(Iin_temp) < $signed(n64))
Iin <= 9'b100000000;
else
Iin <= Iin_temp[8:0];
end
```

The temp variables are used for sign extension.

@dpaul24 Thanks

@bruce_karaffa Thank you too

drjohnsmith

Teacher

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 05:17 AM

964 Views

Registered:
07-09-2009

Your performing a multiply accumilate there,

I'm guessing your background is software,

In RTL, you have to remember you are describing hardware,

The hardware you have described there is a chain of multiply accumulators ( MAC ) that complete in one clock !

This sounds like a DSP block , may be a filter.

If so you need to be looking at the DSP48 blocks

which are optimised to construct DSP functions such as MAC

https://www.xilinx.com/support/documentation/user_guides/ug479_7Series_DSP48E1.pdf

in particular look at chapter 3.

DSP architectures are very specific and fast ,

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>

nadaumtimuj

Adventurer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 05:27 AM

949 Views

Registered:
01-29-2021

richardhead

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 05:32 AM

944 Views

Registered:
08-01-2012

Do you really need to add all 22 values in a single clock cycle? This is going to be very slow. For example, wouldnt you rather have the result calculated over 4 clocks at 300Mhz rather than a single clock at 50 Mhz ? (note, the first answer returns in less actual time than the 2nd). So why does it have to be a single cycle?

What chip are you targeting? what clock speed are you targetting? why cant it be broken down into an adder tree?

drjohnsmith

Teacher

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 05:36 AM

932 Views

Registered:
07-09-2009

sorry missed that the multiplication was by one bit

So why are you multiplying ?

if you multiply a 7 bit number by a 1 bit number you end up with an 8 bit result ,

then you have to consider signed / unsigned...

What happens to a number if you multiply it by 0 ?

What you need is a bit wise AND not a multiply,

000 0000 and ABC DEFG is 000 0000

111 1111 and ABC DEFG is ABC DEFG

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>

nadaumtimuj

Adventurer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 05:55 AM

896 Views

Registered:
01-29-2021

@drjohnsmith So do you think the following code will be a speed improvement?

```
always @* begin
Iin_temp = J_n0_temp&s_in0 + J_n1_temp&s_in1 + J_n2_temp&s_in2 + J_n3_temp&s_in3 + J_n4_temp&s_in4 + J_n5_temp&s_in5 + J_n6_temp&s_in6 + J_n7_temp&s_in7 + J_n8_temp&s_in8 + J_n9_temp&s_in9 + J_n10_temp&s_in10 + J_n11_temp&s_in11 + J_n12_temp&s_in12 + J_n13_temp&s_in13 + J_n14_temp&s_in14 + J_n15_temp&s_in15 + J_n16_temp&s_in16 + J_n17_temp&s_in17 + J_n18_temp&s_in18 + J_n19_temp&s_in19 + J_n20_temp&s_in20 + J_n21_temp&s_in21 + h_temp;
if ($signed(Iin_temp) > $signed(p63_75))
Iin <= 9'b011111111;
else if ($signed(Iin_temp) < $signed(n64))
Iin <= 9'b100000000;
else
Iin <= Iin_temp[8:0];
end
```

dpaul24

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 06:20 AM

876 Views

Registered:
08-07-2014

I meant breaking up the structural equation of the adder in your RTL so that synthesis produces a better result.

Consider giving "Kudos" if you like my answer. Please mark my post "Accept as solution" if my answer has solved your problem

drjohnsmith

Teacher

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 08:42 AM

824 Views

Registered:
07-09-2009

not really

Draw out the hardware you want.

Why are you setting Iin to 011111111 , 100000000 or Iin_temp ?

where is s_in0 coming from ? is this a signal ?

remember your describing hardware,

Draw how you would you design the hardware , then code that description .

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>

bruce_karaffa

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 08:50 AM

820 Views

Registered:
06-21-2017

nadaumtimuj

Adventurer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 09:59 AM

778 Views

Registered:
01-29-2021

@drjohnsmith After using & instead of * LUT resource utilization got doubled and had a lot of timing failpoints!

Why are you setting Iin to 011111111 , 100000000 or Iin_temp ?

---because Iin_temp is sign extended to 14 bits for addition and then again needs to be passed to original 9-bit system.

where is s_in0 coming from ? is this a signal ?

-It's coming from equations from another modules. Yes, it's a signal.

nadaumtimuj

Adventurer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 10:00 AM

777 Views

Registered:
01-29-2021

Yes it must be completely asynchronous.

richardhead

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 10:04 AM

775 Views

Registered:
08-01-2012

You can only have timing failures if you're using a clock, so this clearly isnt "completely asynchronous". The output of this block is going to a register

Why does it have to return in a single clock? what interface are you using to accept the instruction?

nadaumtimuj

Adventurer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 10:10 AM

754 Views

Registered:
01-29-2021

richardhead

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 10:43 AM - edited 03-16-2021 11:44 AM

742 Views

Registered:
08-01-2012

Thats not really how synchronous design works. You said you are using AXI IP, hence it has ready and valid signals to flag when a data signal is valid and accepted. Hence you can quite simply delay the valid for a few clocks. The addition is not "slow" or "fast", it either meets the timing requirements or it does not. Having 22 adds in a single clock is going to be a great big logic chain and hence have a large delay. Having a requirement to do the sum in a single clock seems rather strange and silly to me. If this really is a requirement, you're going to have to deal with slow clock speeds. Meenwhile, others who dont have a strange requirement will be able to do the sum in several clocks and be able to do many many many more calculations than you can do simply because its far easier to throw far more data through the design.

I suggest re-appraising the requirements. Either reduce the number of additions or increase the number of clocks allowed (ie. increase pipelining)

drjohnsmith

Teacher

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 10:54 AM

737 Views

Registered:
07-09-2009

think you miss understood,

you either set the number to be added to zero or the actual number,

then add all the answers in a tree,

Draw out the circuit, then write the code to implement that

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>

drjohnsmith

Teacher

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 10:55 AM

720 Views

Registered:
07-09-2009

Are these '0' / '1' bits coming from each of the ring oscillators ?

are you assuming not to worry about metastability / or will it help in the random number generator ?

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>

nadaumtimuj

Adventurer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-16-2021 03:15 PM

625 Views

Registered:
01-29-2021

@drjohnsmith Are these '0' / '1' bits coming from each of the ring oscillators ? - They are coming from a logic that is triggered at posedge of ring oscillator. I don't worry much about the quality of the clock.

As soon as the ring oscillator triggered module sends these 0 and 1....I want to do the summation and fed that again in a logic in that ring oscillator triggered module. Both of these processes need to happen in the same clock cycle of the ring osc. It's a combinatorial loop and I am pretty much aware to allow it.

drjohnsmith

Teacher

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-17-2021 02:26 AM

538 Views

Registered:
07-09-2009

Can you provide a drawing ?

You have more than one rind oscillator to make the random number gen ?

and on a change of a ring positive edge, then you want to make a new additoin and feed that back to the ring oscilator ?

What happens if two ring oscillators produce a positive edge within say 50 ps of each other,

if they are asynchronous as you need why would they not at some point ?

Then you have to do this addition in 50 ps ?

Without a block diagram of the system,

it sounds to me as if your randomness will be just because of the metastability / propagation delay uncertainty of the system, as your adder is liable to not have a stable output .

not the random number gen you are expecting I think.

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>

dpaul24

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-17-2021 02:31 AM

537 Views

Registered:
08-07-2014

This thread has already 20+ comments.

It is high time that you provide us a COMPLETE description of your design. This means give us the details of the design inputs, the design functionality and desired outputs. Maybe then we can help you better.

Consider giving "Kudos" if you like my answer. Please mark my post "Accept as solution" if my answer has solved your problem

drjohnsmith

Teacher

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-22-2021 01:35 AM

430 Views

Registered:
07-09-2009

Another thought came to me at the weekend.

Your system, is either adding or not each sample of the "random" clocks

is there not a thrum that if you add multiple white noise sources you get gaussian noise.

if you look at the clocks as being random 1/0 generators,

and you adding them together,

then your going to get out gaussian noise, not white random noise,

QED, you making the number out more predictable as the probability of the center number is much higher than those at the extremes,

Sorry , I've made a few random white noise systems, they are incredibly hard to get through the tests of randomness.

generally , the more complicated you make them, the more likely it is that the answers are not really random.

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>

joancab

Advisor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-22-2021 01:50 AM - edited 03-22-2021 01:52 AM

420 Views

Registered:
05-11-2015

An addition is an addition no matter how you write it. Summing up N values is always done in pairs, and even with a single sum, the MSbit depends on the carry from the other bits.

My approach would be to pipeline the sums. Take your 22 values and sum up (register) 11 pairs. That can be fast. Those 11 values will become 6, 3, 2 and 1 in the 5th stage. So you have an interval of 1 and latency of 5 cycles.

drjohnsmith

Teacher

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-22-2021 02:19 AM

409 Views

Registered:
07-09-2009

The Op wants it all done asynchronously , one "period"

I think from fact the OP has disappeared, they are having a re think and not got back to us,

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>

joancab

Advisor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-22-2021 02:41 AM

403 Views

Registered:
05-11-2015

nadaumtimuj

Adventurer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-22-2021 03:37 AM

373 Views

Registered:
01-29-2021

Hi @drjohnsmith Sorry the reason I disappeared is I am not able to share the design since it's a confidential research project. But I sincerely appreciate your and everyone else's suggestions.

My design is actually a 278 node graph and each node has 22 nearest neighbors. What my design does is, it adds 22 9-bit numbers (J) each multiplied by corresponding 22 neighbor's 1-bit output (s). This action happens in an always @* block and should be triggered by s.

So the previous action happens whenever there is a change in any one of the 278 node's output (s). This action happens in an always block at posedge of clock.

My problem is, if the adder delay is too much, then the node outputs can flip again before the adder is done with its job which I do no want. And I also do not want to slow down my clocks.

Note: The node output always block is clocked but the adder always block is not clocked, rather triggered by * i.e. node output.

I apologize that the thread has gone so big and I cannot share my design. Thanks everyone.

joancab

Advisor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-22-2021 03:55 AM - edited 03-22-2021 04:00 AM

361 Views

Registered:
05-11-2015

1. Combinational designs (always @*) are not the best way of using FPGAs

2. If your input changes at a rate F, you need to register it and process at that rate.

3. Not everything needs to be done in one cycle. Tasks can be split in a chain of sub-tasks to be performed at a rate F. As long as you have an II of 1 you will be able to process data as it is generated. You don't need more speed.

4. Latency. You seem to want everything done in one cycle. Check your specs. Do you need that? What frequency do you aim for? 500 MHz. That's 2 ns. Is it going to be a problem if your processed data comes out after, say, 200 ns (100 cycles at 500 MHz or 20 at 100 MHz)? From my experience, that's usually not a problem.

nadaumtimuj

Adventurer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-23-2021 01:00 AM

307 Views

Registered:
01-29-2021

Hi @joancab Thanks. What did you mean by an II of 1? How can I pipeline an adder?

joancab

Advisor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-23-2021 01:29 AM - edited 03-23-2021 03:07 AM

295 Views

Registered:
05-11-2015

II = Initiation Interval. Time between data getting in or out (not necessarily the same as latency = time between data getting in and out of a process)

for example:

```
reg <type> sum12, sum34, sum56, sum78...
always @(posedge clock){
sum12 = d1 + d2;
sum34 = d3 + d4;
...
}
always @(posedge clock){
sum 1234 = sum12 + sum34;
sum 5678 = ...
}
```

Actually all the additions can be in the same process. The key thing is the signals to be register type (as opposed to wire).

You have your 0/1 switches, that can be implemented in the first round of additions.