03-25-2021 05:57 PM
Recently I had a post where I asked how to reduce adder delay when the addition is fully combinatorial (always @*).
I = J1*s1+ J2*s2+...............+ J21*s21
Now, I want to measure the delay to know how much it is taking to perform the addition. I have read some posts that tell to check the unconstraint path but in Vivado I cannot see any such path there. Is it possible to do it with some counter? Please give me some idea.
Please note in my equation,
J is 9-bit number,
s is 1-bit number that can be either 0 or 1 and
I is extended to 14-bit.
03-25-2021 06:48 PM
"Measure" how?
Are you trying to experiment with different implementations to determine which is the fastest? Thus the "measure" is though static timing analysis?
If so, then the best thing to do is to put everything between flip-flops - create a wrapper module that instantiates the module with your combinatorial function in it and has flip-flops for all the inputs of the module and all the outputs of the module. Clock these with a clock (let's call it "clk").
Then write constraints for the clock - just as simple as
create_clock -name "clk" -period 3 [get_ports clk]
Now synthesize the wrapper design. The only way to know how fast it is is to constrain it for "faster than it can run" - a sufficiently small period so that the design fails timing. The tools work only hard enough to get the design to pass timing - after that they stop optimizing speed and start optimizing area and power, so in order to know the maximum speed, it must fail.
For your first estimate, comparing different results after synthesis is probably sufficient, but you might want to place and route the final candidates - the number of carry chains can affect placement. But doing place and route on a module can be complicated. If you just do it "normally" you end up with a design that is totally skewed by the paths from the inputs and to the outputs (even though these are to your wrapper flops). So the best way to really test this is with "out of context" mode. Unfortunately out-of-context synthesis is complicated to make work...
And, of course, testing one module in isolation is going to be very optimistic; if you instantiate a large number of these (over 200?) then they are going to compete for resources and any shared signals (and especially since these are all going to have very large fanout) the performance can drop significantly. So once you have the "best" adder implementation you should move to a full synthesis and implementation of your complete design.
Avrum
03-25-2021 06:48 PM
"Measure" how?
Are you trying to experiment with different implementations to determine which is the fastest? Thus the "measure" is though static timing analysis?
If so, then the best thing to do is to put everything between flip-flops - create a wrapper module that instantiates the module with your combinatorial function in it and has flip-flops for all the inputs of the module and all the outputs of the module. Clock these with a clock (let's call it "clk").
Then write constraints for the clock - just as simple as
create_clock -name "clk" -period 3 [get_ports clk]
Now synthesize the wrapper design. The only way to know how fast it is is to constrain it for "faster than it can run" - a sufficiently small period so that the design fails timing. The tools work only hard enough to get the design to pass timing - after that they stop optimizing speed and start optimizing area and power, so in order to know the maximum speed, it must fail.
For your first estimate, comparing different results after synthesis is probably sufficient, but you might want to place and route the final candidates - the number of carry chains can affect placement. But doing place and route on a module can be complicated. If you just do it "normally" you end up with a design that is totally skewed by the paths from the inputs and to the outputs (even though these are to your wrapper flops). So the best way to really test this is with "out of context" mode. Unfortunately out-of-context synthesis is complicated to make work...
And, of course, testing one module in isolation is going to be very optimistic; if you instantiate a large number of these (over 200?) then they are going to compete for resources and any shared signals (and especially since these are all going to have very large fanout) the performance can drop significantly. So once you have the "best" adder implementation you should move to a full synthesis and implementation of your complete design.
Avrum
03-25-2021 08:36 PM
Thanks @avrumw Yes, I only need an estimate. Not necessarily I want to compare multiple designs. I am happy with a design and want to know how fast it can do the addition. I am trying to implement your idea. I have wrapped my addition module in an addition_wrapper module. In the wrapper module I am doing the following:
input a;
output reg b;
input test_clk;
addition............. (instantatiate)
always @ (posedge test_clk) begin
b <=a;
end
I have created the test_clk as
create_clock -name test_clk -period 3 [get_ports test_clk]
Now my goal is to synthesize this and generate the timing report. And see if any path related to the addition module failed (not any other path). Am I following you correctly? Thank you.
03-25-2021 09:03 PM
Let me be clearer.
You have a module that implements your 22 term addition - it has inputs j1, j2, j3... j22, s1, s2, s3... s22, and a single output I, which is the sum.
So make your wrapper have inputs j1_in, j2_in, j3_in... j22_in, s1_in, s2_in... s22_in and an output I_out.
Then instantiate flip-flops for all of thes
always @(posedge test_clk) begin
j1 <= j1_in;
j2 <= j2_in;
...
j22 <= j22_in;
s1 <= s1_in;
...
s22 <= s22_in;
I_out <= I;
end
Now synthesize your design. You should have timing paths from all your j1, j2, j3... j22, s1, s2, s3... s22 registers (they will be named j1_reg... s22_reg) to your I_out register (I_out_reg). These are the timing paths you are interested in.
Avrum
03-25-2021 10:21 PM
Thanks and I did exactly what you said. I couldn't find j1_reg to I_out_reg paths (I don't know how to find it). But I found that under test_clk, the following paths are failing. Can I safely conclude that my addition_wrapper module (instantiated as weight1) requires 10.554 ns to perform everything inside it?
03-26-2021 07:14 AM
Thanks and I did exactly what you said.
Well, obviously not. My suggestion was that you create a project with ONLY your instantiated module (weight1) and the wrapper. This way the only timing paths you will have are the ones associated with this module. What you are showing here is an unrelated path associated with a block RAM.
Avrum
03-26-2021 07:46 AM
That's not the way.
FPGA stuff is mostly based on registers, not combinatorial gates.
If you have a trivial solution for a problem (like adding values) you can make it faster (for example with a faster clock). But you will eventually reach a physical limit on the silicon. Most of the times there is a faster solution than the fastest straightforward approach. We have already suggested pipelining.
Ok, let me suggest another approach: let's assume your addition takes 8 ns, and you cannot shorten it anymore (physical limit) but you need, say, an addition every 2 ns. What you need is four adders and multiplexers. First data goes to adder1 at t = 0. Second data to adder2 at t = 2, etc. Fourth data goes to adder4 at t = 6 and fifth, at t = 8, goes back to adder1 that has already delivered its result. Similarly you multiplex the outputs and get one result every 2 ns, even if each of them needs 8 ns.
03-26-2021 10:33 AM - edited 03-26-2021 10:34 AM
@avrumw Sorry for misunderstanding and thanks for correcting me. I hope I did it correct this time. I removed all other module. Just the wrapper and the weight. And I can see that the delay from s_in to I_out is approximately 14 ns. I hope that's what I wanted to know. Thanks again.
03-26-2021 10:47 AM
We went through all this in your other post,
The tools do not have a fixed delay for "a OR gate " or such like,
they synthesis your design to meet your timing constrains and the stop.
So there is no absolute timing, it depends upon the chip , the constraints, and the design,
The tools will take your lovely constructed design, and synthesis it,
just liki you write lovely C++ code, that the tools pump out multi threaded ASM code for the processor
as in a CPU, you can not say how long A + B will take, it depends.
04-06-2021 05:20 PM
Thank you avrumw. I highly appreciate your detailed answers.
You helped me a lot.
I have an example if anyone needs it.
04-06-2021 07:15 PM
04-07-2021 04:22 AM
Well yes , and no
As you show, the clock is coming direct in,
is this into a chip or the module inside the chip ?
When this is synthesised into your chip,
what happens to the FDE ? are they not in the LUTS ?
This looks like pre synthesis , so the timings are an indication till you place and route.
Also you have constrained the clock at 1 GHz ( 1ns ) in a real design, this will be impssible , bu tth etool does not knwo that so synthesis will try its best ,a nd probably expand the design with paralle routes till it gives up.
Over constraining is no t away to design.
04-07-2021 08:35 PM
I have done the place and route and run the FA on ZedBoard with switches for the inputs and leds for the outputs. Elaborated design and Implementation schematics have not changed.
How can I find the design delay now?
04-08-2021 02:18 PM
Hello, did you get an answer to your question?
04-09-2021 06:03 PM
Not yet.
Regards,