cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
nadaumtimuj
Adventurer
Adventurer
692 Views
Registered: ‎01-29-2021

Measure a 14-bit combinatorial addition delay

Jump to solution

Recently I had a post where I asked how to reduce adder delay when the addition is fully combinatorial (always @*).

I = J1*s1+ J2*s2+...............+ J21*s21

 

Now, I want to measure the delay to know how much it is taking to perform the addition. I have read some posts that tell to check the unconstraint path but in Vivado I cannot see any such path there. Is it possible to do it with some counter? Please give me some idea. Thanks!

 

Please note in my equation,

J is 9-bit number,

s is 1-bit number that can be either 0 or 1 and

I is extended to 14-bit.

0 Kudos
1 Solution

Accepted Solutions
avrumw
Guide
Guide
675 Views
Registered: ‎01-23-2009

"Measure" how? 

Are you trying to experiment with different implementations to determine which is the fastest? Thus the "measure" is though static timing analysis?

If so, then the best thing to do is to put everything between flip-flops - create a wrapper module that instantiates the module with your combinatorial function in it and has flip-flops for all the inputs of the module and all the outputs of the module. Clock these with a clock (let's call it "clk"). 

Then write constraints for the clock - just as simple as

create_clock -name "clk" -period 3 [get_ports clk]

Now synthesize the wrapper design. The only way to know how fast it is is to constrain it for "faster than it can run" - a sufficiently small period so that the design fails timing. The tools work only hard enough to get the design to pass timing - after that they stop optimizing speed and start optimizing area and power, so in order to know the maximum speed, it must fail.

For your first estimate, comparing different results after synthesis is probably sufficient, but you might want to place and route the final candidates - the number of carry chains can affect placement. But doing place and route on a module can be complicated. If you just do it "normally" you end up with a design that is totally skewed by the paths from the inputs and to the outputs (even though these are to your wrapper flops). So the best way to really test this is with "out of context" mode. Unfortunately out-of-context synthesis is complicated to make work...

And, of course, testing one module in isolation is going to be very optimistic; if you instantiate a large number of these (over 200?) then they are going to compete for resources and any shared signals (and especially since these are all going to have very large fanout) the performance can drop significantly. So once you have the "best" adder implementation you should move to a full synthesis and implementation of your complete design.

Avrum

View solution in original post

14 Replies
avrumw
Guide
Guide
676 Views
Registered: ‎01-23-2009

"Measure" how? 

Are you trying to experiment with different implementations to determine which is the fastest? Thus the "measure" is though static timing analysis?

If so, then the best thing to do is to put everything between flip-flops - create a wrapper module that instantiates the module with your combinatorial function in it and has flip-flops for all the inputs of the module and all the outputs of the module. Clock these with a clock (let's call it "clk"). 

Then write constraints for the clock - just as simple as

create_clock -name "clk" -period 3 [get_ports clk]

Now synthesize the wrapper design. The only way to know how fast it is is to constrain it for "faster than it can run" - a sufficiently small period so that the design fails timing. The tools work only hard enough to get the design to pass timing - after that they stop optimizing speed and start optimizing area and power, so in order to know the maximum speed, it must fail.

For your first estimate, comparing different results after synthesis is probably sufficient, but you might want to place and route the final candidates - the number of carry chains can affect placement. But doing place and route on a module can be complicated. If you just do it "normally" you end up with a design that is totally skewed by the paths from the inputs and to the outputs (even though these are to your wrapper flops). So the best way to really test this is with "out of context" mode. Unfortunately out-of-context synthesis is complicated to make work...

And, of course, testing one module in isolation is going to be very optimistic; if you instantiate a large number of these (over 200?) then they are going to compete for resources and any shared signals (and especially since these are all going to have very large fanout) the performance can drop significantly. So once you have the "best" adder implementation you should move to a full synthesis and implementation of your complete design.

Avrum

View solution in original post

nadaumtimuj
Adventurer
Adventurer
655 Views
Registered: ‎01-29-2021

Thanks @avrumw Yes, I only need an estimate. Not necessarily I want to compare multiple designs. I am happy with a design and want to know how fast it can do the addition. I am trying to implement your idea. I have wrapped my addition module in an addition_wrapper module. In the wrapper module I am doing the following:

 

input a;
output reg b;
input test_clk;


addition............. (instantatiate)


always @ (posedge test_clk) begin

b <=a;

end

 

I have created the test_clk as 

create_clock -name test_clk -period 3 [get_ports test_clk]

 

Now my goal is to synthesize this and generate the timing report. And see if any path related to the addition module failed (not any other path). Am I following you correctly? Thank you.

0 Kudos
avrumw
Guide
Guide
644 Views
Registered: ‎01-23-2009

Let me be clearer.

You have a module that implements your 22 term addition - it has inputs j1, j2, j3... j22, s1, s2, s3... s22, and a single output I, which is the sum.

So make your wrapper have inputs j1_in, j2_in, j3_in... j22_in, s1_in, s2_in... s22_in and an output I_out.

Then instantiate flip-flops for all of thes

always @(posedge test_clk) begin
  j1 <= j1_in;
  j2 <= j2_in;
  ...
  j22 <= j22_in;
  s1 <= s1_in;
  ...
  s22 <= s22_in;

  I_out <= I;
end

Now synthesize your design. You should have timing paths from all your j1, j2, j3... j22, s1, s2, s3... s22 registers (they will be named j1_reg... s22_reg) to your I_out register (I_out_reg). These are the timing paths you are interested in.

Avrum

nadaumtimuj
Adventurer
Adventurer
620 Views
Registered: ‎01-29-2021

Thanks and I did exactly what you said. I couldn't find j1_reg to I_out_reg paths (I don't know how to find it). But I found that under test_clk, the following paths are failing. Can I safely conclude that my addition_wrapper module (instantiated as weight1) requires 10.554 ns to perform everything inside it?


Capture.JPG

0 Kudos
avrumw
Guide
Guide
580 Views
Registered: ‎01-23-2009

Thanks and I did exactly what you said.

Well, obviously not. My suggestion was that you create a project with ONLY your instantiated module (weight1) and the wrapper. This way the only timing paths you will have are the ones associated with this module. What you are showing here is an unrelated path associated with a block RAM.

Avrum

0 Kudos
joancab
Mentor
Mentor
568 Views
Registered: ‎05-11-2015

That's not the way.

FPGA stuff is mostly based on registers, not combinatorial gates.

If you have a trivial solution for a problem (like adding values) you can make it faster (for example with a faster clock). But you will eventually reach a physical limit on the silicon. Most of the times there is a faster solution than the fastest straightforward approach. We have already suggested pipelining.

Ok, let me suggest another approach: let's assume your addition takes 8 ns, and you cannot shorten it anymore (physical limit) but you need, say, an addition every 2 ns. What you need is four adders and multiplexers. First data goes to adder1 at t = 0. Second data to adder2 at t = 2, etc. Fourth data goes to adder4 at t = 6 and fifth, at t = 8, goes back to adder1 that has already delivered its result. Similarly you multiplex the outputs and get one result every 2 ns, even if each of them needs 8 ns.

 

nadaumtimuj
Adventurer
Adventurer
533 Views
Registered: ‎01-29-2021

@avrumw Sorry for misunderstanding and thanks for correcting me. I hope I did it correct this time. I removed all other module. Just the wrapper and the weight. And I can see that the delay from s_in to I_out is approximately 14 ns. I hope that's what I wanted to know. Thanks again.

Capture.JPG

 

 

0 Kudos
drjohnsmith
Teacher
Teacher
527 Views
Registered: ‎07-09-2009

We went through all this in your other post,

   The tools do not have a fixed delay for "a OR gate " or such like, 

      they synthesis your design to meet your timing constrains and the stop.

So there is no absolute timing, it depends upon the chip , the constraints, and the design, 

 

The tools will take your lovely constructed design, and synthesis it, 

    just liki you write lovely C++ code, that the tools pump out multi threaded ASM code for the processor

     as in a CPU, you can not say how long A + B will take, it depends.

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
bin_imad
Visitor
Visitor
312 Views
Registered: ‎04-06-2021

Thank you avrumw. I highly appreciate your detailed answers.

You helped me a lot.

I have an example if anyone needs it.

0 Kudos
bin_imad
Visitor
Visitor
291 Views
Registered: ‎04-06-2021

I have attached all my work here. Please check and tell me if my work is missing anything.

I have set the clock period to 1ns.

Worst Negative Slack after implementation is -0.186 so I calculated the maximum delay for the whole design as 1+0.186 = 1.186 ns

Thank you

Implementation Timing Summary.jpg
RTL_Schematic.jpg
Implemented_Design.jpg
0 Kudos
drjohnsmith
Teacher
Teacher
256 Views
Registered: ‎07-09-2009

Well yes , and no

 

As you show, the clock is coming direct in,

   is this into a chip or the module inside the chip ?

When this is synthesised into your chip,

    what happens to the FDE ?  are they not in the LUTS ?

This looks like pre synthesis , so the timings are an indication till you place and route.

 

Also you have constrained the clock at 1 GHz ( 1ns ) in a real design, this will be impssible , bu tth etool does not knwo that so synthesis will try its best ,a nd probably expand the design with paralle routes till it gives up.

Over constraining is no t away to design.

 

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
0 Kudos
binimad10
Newbie
Newbie
204 Views
Registered: ‎01-23-2021

I have done the place and route and run the FA on ZedBoard with switches for the inputs and leds for the outputs. Elaborated design and Implementation schematics have not changed.

How can I find the design delay now?

joe306
Scholar
Scholar
168 Views
Registered: ‎12-07-2018

Hello, did you get an answer to your question?

0 Kudos
binimad10
Newbie
Newbie
132 Views
Registered: ‎01-23-2021

Not yet.

Regards,

0 Kudos