cancel
Showing results for
Show  only  | Search instead for
Did you mean:
1,150 Views
Registered: ‎01-29-2021

I have a design where I need to add 22 9-bit binary numbers. In performance of my design, this adder delay is becoming significant. Is there a way to reduce or makeup for the adder delay? Any buffer or anything? Please help!

``I = J1*s1+ J2*s2+...............+ J21*s21``

J is 9-bit number,

s is 1-bit number that can be either 0 or 1 and

I is extended to 14-bit.

Tags (1)
1 Solution

Accepted Solutions
Guide
397 Views
Registered: ‎01-23-2009

(I am just guessing here as to why this thread is going in circles...)

My suspicion is that this is some kind of hardware implementation of a neural network or something similar. The goal is on each clock cycle to perform an update of all nodes simultaneously, where the value of one node is the result of this 22 valued sum of the values from the 22 nearest neighbors (or some subset thereof). Thus, there is one clock cycle latency feedback loop in this addition. This is why it can't be pipelined.

Consequently, there is a fixed amount of work to be done per iteration of this update. That work appears to be these 22 gated additions. Since there is a fixed amount of work for each cycle, this implies that there is a maximum frequency that can be obtained per iteration based on the technology. By definition the maximum speed of each iteration (hence the maximum frequency) is the delay through these adders.

So, lets take pipelining off the table - let's assume it can't be done.

The only way to speed this up is to find a better implementation for the multiple additions.

One is to use a tree of adders combinatorially - add pairs together then add the resulting pairs and so on (similar to what @joancab suggested, but combinatorial).

always @(*) begin
sum01 = d0 + d1;
sum23 = d2 + d3;
sum0123 = sum01 + sum23;
...
end

(if these things are all arrays then you can write loops to do this with less code, but the result will be the same)

This will lead to a binary tree of adders of depth 5. This may help, but the tools are pretty intelligent and may already have effectively implemented this on their own.

Another approach is a "Carry Save Adder". This is a mechanism that reduces the sum of 3 values to one level of logic and the sum of two values. This can be applied recursively to reduce your sum of 22 values to 21 levels of logic plus one adder. It's unclear whether this would be faster; the carry chain in the FPGA is quite fast with respect to LUT based operations.

You can also try combining the two - grouping the additions into groups of 3, reducing them to 2 with carry save adders, and then using a tree of these (or any other combination). Or groups of 4 or 5 or...

You probably also need to experiment with your zero-ing out process - maybe using the ternary operation would help

sum = (j0 ? s0 : 9'b0) + (j1 ? s1 : 9'b0) + ...

But it will be up to you to do these experiments. This is an unusual problem and has no "known" solution that is better than the others.

And, ultimately, you will find the "best" solution, and it may still not be as fast as you want. But that's what you get. If you are building a system with one clock cycle feedback, you are limited in performance by the length of the critical loop. If you have optimized the critical loop to be as fast as it can be, then that's that. That is your maximum frequency in the chosen technology.

Avrum

34 Replies
Scholar
975 Views
Registered: ‎08-07-2014

Is there a way to reduce or makeup for the adder delay?

You are doing a structural design. So break up the equation in the RTL code and then see if there is improvement.

Any buffer or anything?

In a sequential design (involving clock and flops), adding flops would have improved design throughput, but it is a different story.

------------FPGA enthusiast------------

Scholar
972 Views
Registered: ‎06-21-2017

Are you concerned about clock speed or latency?  Do you care if it requires multiple clock cycles to produce a result as long as you can accept and produce a result on each clock cycle?  If you can live with multiple clock latency, break the addition into an adder tree where you add sets of two numbers on each clock cycle, then add two results and so on.  you can still bring in new inputs on each clock and produce a result on each clock.  If you need the results of nine additions with only one cycle of latency, you will probably need to live with what the synthesizer is giving you.

966 Views
Registered: ‎01-29-2021

My design is something like this:

``````always @*  begin

Iin_temp = J_n0_temp*s_in0 + J_n1_temp*s_in1 + J_n2_temp*s_in2 + J_n3_temp*s_in3 + J_n4_temp*s_in4 + J_n5_temp*s_in5 + J_n6_temp*s_in6 + J_n7_temp*s_in7 + J_n8_temp*s_in8 + J_n9_temp*s_in9 + J_n10_temp*s_in10 + J_n11_temp*s_in11 + J_n12_temp*s_in12 + J_n13_temp*s_in13 + J_n14_temp*s_in14 + J_n15_temp*s_in15 + J_n16_temp*s_in16 + J_n17_temp*s_in17 + J_n18_temp*s_in18 + J_n19_temp*s_in19 + J_n20_temp*s_in20 + J_n21_temp*s_in21 + h_temp;

if (\$signed(Iin_temp) > \$signed(p63_75))
Iin <= 9'b011111111;
else if (\$signed(Iin_temp) < \$signed(n64))
Iin <= 9'b100000000;
else
Iin <= Iin_temp[8:0];
end``````

The temp variables are used for sign extension.

@dpaul24  Thanks What did you mean by break RTL?

@bruce_karaffa Thank you too I cannot wait for multiple clock cycles unfortunately. I need it as fast as possible with all inputs coming at the same time.

Teacher
964 Views
Registered: ‎07-09-2009

Your performing a multiply accumilate there,

I'm guessing your background is software,

In RTL, you have to remember you are describing hardware,

The hardware you have described there is a chain of multiply accumulators ( MAC )  that complete in one clock !

This sounds like a DSP block , may be a filter.

If so you need to be looking at the DSP48 blocks

which are optimised to construct DSP functions such as MAC

https://www.xilinx.com/support/documentation/user_guides/ug479_7Series_DSP48E1.pdf

in particular look at chapter 3.

DSP architectures are very specific and fast ,

949 Views
Registered: ‎01-29-2021

@drjohnsmith Please note that in my code s is only a 1 bit number that can be either 0 or 1 where J is 9-bit number. Do I really need to go to DSP just for such multiplication of 7-bit * 1-bit and then take the sum ? Thanks

Scholar
944 Views
Registered: ‎08-01-2012

Do you really need to add all 22 values in a single clock cycle? This is going to be very slow. For example, wouldnt you rather have the result calculated over 4 clocks at 300Mhz rather than a single clock at 50 Mhz ? (note, the first answer returns in less actual time than the 2nd). So why does it have to be a single cycle?

What chip are you targeting? what clock speed are you targetting? why cant it be broken down into an adder tree?

Teacher
932 Views
Registered: ‎07-09-2009

sorry missed that the multiplication was by one bit

So why are you multiplying ?

if you multiply a 7 bit number by a 1 bit number you end up with an 8 bit result ,

then you have to consider signed / unsigned...

What happens to a number if you multiply it by 0 ?

What you need is a bit wise AND not a multiply,

000 0000 and ABC DEFG is 000 0000

111 1111  and ABC DEFG is ABC DEFG

896 Views
Registered: ‎01-29-2021

@drjohnsmith So do you think the following code will be a speed improvement?

``````always @*  begin

Iin_temp = J_n0_temp&s_in0 + J_n1_temp&s_in1 + J_n2_temp&s_in2 + J_n3_temp&s_in3 + J_n4_temp&s_in4 + J_n5_temp&s_in5 + J_n6_temp&s_in6 + J_n7_temp&s_in7 + J_n8_temp&s_in8 + J_n9_temp&s_in9 + J_n10_temp&s_in10 + J_n11_temp&s_in11 + J_n12_temp&s_in12 + J_n13_temp&s_in13 + J_n14_temp&s_in14 + J_n15_temp&s_in15 + J_n16_temp&s_in16 + J_n17_temp&s_in17 + J_n18_temp&s_in18 + J_n19_temp&s_in19 + J_n20_temp&s_in20 + J_n21_temp&s_in21 + h_temp;

if (\$signed(Iin_temp) > \$signed(p63_75))
Iin <= 9'b011111111;
else if (\$signed(Iin_temp) < \$signed(n64))
Iin <= 9'b100000000;
else
Iin <= Iin_temp[8:0];
end``````

Scholar
876 Views
Registered: ‎08-07-2014

I meant breaking up the structural equation of the adder in your RTL so that synthesis produces a better result.

------------FPGA enthusiast------------

Teacher
824 Views
Registered: ‎07-09-2009

not really

Draw out the hardware you want.

Why are you setting Iin to 011111111 , 100000000 or Iin_temp ?

where is s_in0 coming from ? is this a signal ?

Draw how you would you design the hardware , then code that description .

Scholar
820 Views
Registered: ‎06-21-2017

I don't see any reference to a clock in this code.  Are you using a clock, or is this design completely asynchronous?

778 Views
Registered: ‎01-29-2021

@drjohnsmith After using & instead of * LUT resource utilization got doubled and had a lot of timing failpoints!

Why are you setting Iin to 011111111 , 100000000 or Iin_temp ?

---because Iin_temp is sign extended to 14 bits for addition and then again needs to be passed to original 9-bit system.

where is s_in0 coming from ? is this a signal ?

-It's coming from equations from another modules. Yes, it's a signal.

777 Views
Registered: ‎01-29-2021

Yes it must be completely asynchronous.

Scholar
775 Views
Registered: ‎08-01-2012

You can only have timing failures if you're using a clock, so this clearly isnt "completely asynchronous". The output of this block is going to a register

Why does it have to return in a single clock? what interface are you using to accept the instruction?

754 Views
Registered: ‎01-29-2021

@richardhead I have system clock that drives my AXI IP only. For the rest, I have ring oscillator clocks. However, the addition part needs to be async. Whoever will receive the summation will have a clock and needs to catch the summation within its one clock cycle. Therefore, if the addition is slow, it can miss the summation. I need fast addition so that a fast clock can catch it.

Scholar
742 Views
Registered: ‎08-01-2012

Thats not really how synchronous design works. You said you are using AXI IP, hence it has ready and valid signals to flag when a data signal is valid and accepted. Hence you can quite simply delay the valid for a few clocks. The addition is not "slow" or "fast", it either meets the timing requirements or it does not. Having 22 adds in a single clock is going to be a great big logic chain and hence have a large delay. Having a requirement to do the sum in a single clock seems rather strange and silly to me. If this really is a requirement, you're going to have to deal with slow clock speeds. Meenwhile, others who dont have a strange requirement will be able to do the sum in several clocks and be able to do many many many more calculations than you can do simply because its far easier to throw far more data through the design.

I suggest re-appraising the requirements. Either reduce the number of additions or increase the number of clocks allowed (ie. increase pipelining)

Teacher
737 Views
Registered: ‎07-09-2009

think you miss understood,

you either set the number to be added to zero or the actual number,

Draw out the circuit, then write the code to implement that

Teacher
720 Views
Registered: ‎07-09-2009

Are these '0' / '1' bits coming from each of the ring oscillators ?

are you assuming not to worry about metastability / or will it help in the random number generator ?

625 Views
Registered: ‎01-29-2021

@drjohnsmith  Are these '0' / '1' bits coming from each of the ring oscillators ? - They are coming from a logic that is triggered at posedge of ring oscillator. I don't worry much about the quality of the clock.

As soon as the ring oscillator triggered module sends these 0 and 1....I want to do the summation and fed that again in a logic in that ring oscillator triggered module. Both of these processes need to happen in the same clock cycle of the ring osc. It's a combinatorial loop and I am pretty much aware to allow it.

Teacher
538 Views
Registered: ‎07-09-2009

Can you provide a drawing ?

You have more than one rind oscillator to make the random number gen ?

and on a change of a ring positive edge, then you want to make a new additoin and feed that back to the ring oscilator ?

What happens if two ring oscillators produce a positive edge within say 50 ps of each other,

if they are asynchronous as you need why would they not at some point ?

Then you have to do this addition in 50 ps ?

Without a block diagram of the system,

it sounds to me as if your randomness will be just because of the metastability / propagation delay uncertainty of the system, as your adder is liable to not have a stable output .

not the random number gen you are expecting I think.

Scholar
537 Views
Registered: ‎08-07-2014

It is high time that you provide us a COMPLETE description of your design. This means give us the details of the design inputs, the design functionality and desired outputs. Maybe then we can help you better.

------------FPGA enthusiast------------

Teacher
430 Views
Registered: ‎07-09-2009

Another thought came to me at the weekend.

Your system, is either adding or not each sample of the "random" clocks

is there not a thrum that if you add multiple white noise sources you get gaussian noise.

if you look at the clocks as being random 1/0 generators,

then your going to get out gaussian noise, not white random noise,

QED, you making the number out more predictable as the probability of the center number is much higher than those at the extremes,

Sorry , I've made a few random white noise systems, they are incredibly hard to get through the tests of randomness.

generally , the more complicated you make them, the more likely it is that the answers are not really random.

420 Views
Registered: ‎05-11-2015

An addition is an addition no matter how you write it. Summing up N values is always done in pairs, and even with a single sum, the MSbit depends on the carry from the other bits.

My approach would be to pipeline the sums. Take your 22 values and sum up (register) 11 pairs. That can be fast. Those 11 values will become 6, 3, 2 and 1 in the 5th stage. So you have an interval of 1 and latency of 5 cycles.

Teacher
409 Views
Registered: ‎07-09-2009

The Op wants it all done asynchronously , one "period"

I think from fact the OP has disappeared, they are having a re think and not got back to us,

403 Views
Registered: ‎05-11-2015

One of the words typically associated with FPGA is 'fast'. That is true, but an oversimplification. FPGA clocks are actually modest, 100 - 200 MHz They are not really that 'fast' but overall more capable because of parallelization and pipelining. The art of achieving high data rates is not by faster clocks/ shorter delays, that are already very constrained physically. Wonderful designs use data, time and silicon in a wise manner, not brute force.

373 Views
Registered: ‎01-29-2021

Hi @drjohnsmith Sorry the reason I disappeared is I am not able to share the design since it's a confidential research project. But I sincerely appreciate your and everyone else's suggestions.

My design is actually a 278 node graph and each node has 22 nearest neighbors. What my design does is, it adds 22 9-bit numbers (J) each multiplied by corresponding 22 neighbor's 1-bit output (s). This action happens in an always @* block and should be triggered by s.

So the previous action happens whenever there is a change in any one of the 278 node's output (s). This action happens in an always block at posedge of clock.

My problem is, if the adder delay is too much, then the node outputs can flip again before the adder is done with its job which I do no want. And I also do not want to slow down my clocks.

Note: The node output always block is clocked but the adder always block is not clocked, rather triggered by * i.e. node output.

I apologize that the thread has gone so big and I cannot share my design. Thanks everyone.

361 Views
Registered: ‎05-11-2015

1. Combinational designs (always @*)  are not the best way of using FPGAs

2. If your input changes at a rate F, you need to register it and process at that rate.

3. Not everything needs to be done in one cycle.  Tasks can be split in a chain of sub-tasks to be performed at a rate F. As long as you have an II of 1 you will be able to process data as it is generated. You don't need more speed.

4. Latency. You seem to want everything done in one cycle. Check your specs. Do you need that? What frequency do you aim for? 500 MHz. That's 2 ns. Is it going to be a problem if your processed data comes out after, say, 200 ns (100 cycles at 500 MHz or 20 at 100 MHz)? From my experience, that's usually not a problem.

307 Views
Registered: ‎01-29-2021

Hi @joancab Thanks. What did you mean by an II of 1? How can I pipeline an adder?

295 Views
Registered: ‎05-11-2015

II = Initiation Interval. Time between data getting in or out (not necessarily the same as latency = time between data getting in and out of a process)

for example:

``````reg <type> sum12, sum34, sum56, sum78...

always @(posedge clock){
sum12 = d1 + d2;
sum34 = d3 + d4;
...
}
always @(posedge clock){
sum 1234 = sum12 + sum34;
sum 5678 = ...
}``````

Actually all the additions can be in the same process. The key thing is the signals to be register type (as opposed to wire).

You have your 0/1 switches, that can be implemented in the first round of additions.