01-16-2013 06:06 PM
Hi,
I write RTL including multiplier with asynchronous reset and set the certain XST properties to ensure dspe1 be employed.
The target device is xc7vx485t.
From the resource utilization report, it is sure that the dspe1 is employed.
However, from the timing report which contains all paths, I cannot find any "Location" or "Delay type" having relation to DSP. I am confused why there is nothing about DSP in the timing reort.
The attachment is the complete timing report.
1.RTL:
module multtop(clk,
rst,
a,
b,
c);
input signed [31:0] a;
input signed [31:0] b;
input clk,rst;
output reg signed [63:0] c;
reg signed [63:0] r_c;
reg signed [63:0] rr_c;
always@(posedge clk or posedge rst)begin
if(rst)
begin
r_c<=0;
rr_c<=0;
c<=0;
end
else
begin
r_c<=a*b;
rr_c<=r_c;
c<=rr_c;
end
end
endmodule
2.The fragment of resource Utilization:
Number of DSP48E1s: 4 out of 2,800 1%
3.The fragment of timing report:
Paths for end point rr_c_13 (SLICE_X83Y255.CX), 1 path
--------------------------------------------------------------------------------
Slack (setup path): 1.239ns (requirement - (data path - clock path skew + uncertainty))
Source: r_c_13 (FF)
Destination: rr_c_13 (FF)
Requirement: 4.000ns
Data Path Delay: 2.699ns (Levels of Logic = 0)
Clock Path Skew: -0.027ns (0.996 - 1.023)
Source Clock: clk_BUFGP rising at 0.000ns
Destination Clock: clk_BUFGP rising at 4.000ns
Clock Uncertainty: 0.035ns
Clock Uncertainty: 0.035ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
Total System Jitter (TSJ): 0.070ns
Total Input Jitter (TIJ): 0.000ns
Discrete Jitter (DJ): 0.000ns
Phase Error (PE): 0.000ns
Maximum Data Path at Slow Process Corner: r_c_13 to rr_c_13
Location Delay type Delay(ns) Physical Resource
Logical Resource(s)
------------------------------------------------- -------------------
SLICE_X90Y246.DQ Tcko 0.216 r_c<13>
r_c_13
SLICE_X83Y255.CX net (fanout=1) 2.464 r_c<13>
SLICE_X83Y255.CLK Tdick 0.019 rr_c<14>
rr_c_13
------------------------------------------------- ---------------------------
Total 2.699ns (0.235ns logic, 2.464ns route)
(8.7% logic, 91.3% route)
01-17-2013 07:55 AM - edited 01-17-2013 07:56 AM
The timing report is correct.
The paths that you are seeing are all the paths that you have constrained. The timing analysis will only show you constrained paths.
In your design, you multiply "a" and "b". In the RTL code you gave, these are primary inputs to your system. Thus, the tools are routing the a and b directly to the DSP48 for the multiplication.
As bwiec stated, the DSP48E's flip-flops only have synchronous resets. Since your code calls for async resets, the tools have no choice but to not use the DSP48 cell's flip-flops and instead use slice flip flops in the fabric. So (and you can look at this in FPGA Editor) the a and b inputs go directly to the inputs of the DSP48. The multiplication result from the DSP48 comes out combinatorially from the DSP48 and goes to the r_c FFs in the fabric. These are then FF'ed again into rr_c and again into c.
So, lets look at the paths. From your timing report it looks like you only specified a period constraint. Therefore, in this design, only the paths from r_c->rr_c and rr_c->c are constrained. The paths from a&b to r_c are not constrained (since you didn't put an OFFSET IN constraint) and hence are not reported.
So, lets say that you fix this. You change all your FFs to be synchronous resets, and you put an extra set of flip-flops on a and b (lets call them r_a and r_b) allow them to be flopped once before the multiply. Even here, you will NOT get any report on the multiplication itself. When done this way, the r_a, r_b and r_c FFs will all be pulled in to the FFs in the DSP48E slice. The timing engine does not show the paths internal to the block, so it will only show the paths from r_c to rr_c. If you also have an OFFSET IN, then it will show the path from the a&b inputs to r_a and r_b. The timing inside the DSP48 is "guaranteed by design" as long as the clock isn't too fast for the DSP48E in this configuration (where the M pipeline register is not used). Even if it is, this will fail a "pulse width" check - saying that the DSP48 is running too fast - not a setup check.
Avrum
01-17-2013 06:20 AM
DSP48E1 Slices only have synchronous resets (not async)
01-17-2013 07:55 AM - edited 01-17-2013 07:56 AM
The timing report is correct.
The paths that you are seeing are all the paths that you have constrained. The timing analysis will only show you constrained paths.
In your design, you multiply "a" and "b". In the RTL code you gave, these are primary inputs to your system. Thus, the tools are routing the a and b directly to the DSP48 for the multiplication.
As bwiec stated, the DSP48E's flip-flops only have synchronous resets. Since your code calls for async resets, the tools have no choice but to not use the DSP48 cell's flip-flops and instead use slice flip flops in the fabric. So (and you can look at this in FPGA Editor) the a and b inputs go directly to the inputs of the DSP48. The multiplication result from the DSP48 comes out combinatorially from the DSP48 and goes to the r_c FFs in the fabric. These are then FF'ed again into rr_c and again into c.
So, lets look at the paths. From your timing report it looks like you only specified a period constraint. Therefore, in this design, only the paths from r_c->rr_c and rr_c->c are constrained. The paths from a&b to r_c are not constrained (since you didn't put an OFFSET IN constraint) and hence are not reported.
So, lets say that you fix this. You change all your FFs to be synchronous resets, and you put an extra set of flip-flops on a and b (lets call them r_a and r_b) allow them to be flopped once before the multiply. Even here, you will NOT get any report on the multiplication itself. When done this way, the r_a, r_b and r_c FFs will all be pulled in to the FFs in the DSP48E slice. The timing engine does not show the paths internal to the block, so it will only show the paths from r_c to rr_c. If you also have an OFFSET IN, then it will show the path from the a&b inputs to r_a and r_b. The timing inside the DSP48 is "guaranteed by design" as long as the clock isn't too fast for the DSP48E in this configuration (where the M pipeline register is not used). Even if it is, this will fail a "pulse width" check - saying that the DSP48 is running too fast - not a setup check.
Avrum
01-18-2013 03:37 AM
01-18-2013 03:39 AM
Thank you very much for your reply. I have an extra question. I have another project , and after synthesis there is one path that contains DSP48E. The fragment of Synthesis report shows below.
Timing constraint: Default period analysis for Clock 'clk'
Clock period: 2.850ns (frequency: 350.853MHz)
Total number of paths / destination ports: 52461 / 2773
------------------------------------------------------------------------------------------------------
Delay: 2.850ns (Levels of Logic = 0)
Source: bx1/yi_11 (FF)
Destination: t_nextb/Maddsub_yiyix (DSP)
Source Clock: clk rising
Destination Clock: clk rising
Data Path: bx1/yi_11 to t_nextb/Maddsub_yiyix
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- -------------------------------------------- ------------
FDR:C->Q 5 0.232 0.298 bx1/yi_11 (bx1/yi_11)
DSP48E1:A20 2.320 t_nextb/Maddsub_yiyix
--------------------------------------------------------------------------------------------------
Total 2.850ns (2.552ns logic, 0.298ns route)
(89.5% logic, 10.5% route)
Since the timing engine does not show the paths internal to the DSP48E block, then what does DSP48E1:A20's Gate Delay mean? Thank you.
01-18-2013 06:11 AM
In this case, you have a path that starts outside the DSP48 and ends at a FF inside the DSP48. This path is timed and reported. In my previous post what I meant is that the tools will not report paths entirely in the DSP48 cell.
In this case the DSP48E1:A20 shows the propagation delay through the DSP48 pin A20 to an internal FF in the DSP48 (I can't tell which one from this report). This is (effectively) the setup time requirement of this pin of the DSP48 cell.
Avrum
01-19-2013 03:58 AM
Thank you for your reply . The attachment is the architecture of DSP48E1 slice from the datasheet . So what you mean is that the delay in the module of MULT25*18 will not be reported and is "guaranteed by design" as long as the clock isn't too fast for the DSP48E in this configuration. Is that right?
But I am confused why the delay from the pin A20 to the internal FF can be so long and why the MULT25*18 can run so fast. Because it is generally acknowledged that the path including multiplication will be the critical path of an architecture.
Thank you again for your reply. I have learned so much from your reply.
01-19-2013 04:06 AM
Here is the attachment of DSP48E1 slice. I forget in my last reply. Looking forward to your reply.
01-19-2013 07:48 AM
Almost all timing paths in an FPGA start at one element (or outside the FPGA) and end in another element. For these paths, the tool enumerates all the cells the path traverses through and gives you detailed timing for each component. The ISE timing analyzer is slice based, so each component in a timing report will be the time through a slice (i.e. from the input of the slice through a LUT and to the output of the slice).
There are very few components that have complete paths within them. I can only think of a couple
- the paths inside the IDDR and ODDR cells when the cell is in SAME_EDGE or SAME_EDGE_PIPELINED mode
- the paths inside an ISERDES and OSERDES
- the path inside the block RAM from - particularly from the output latches to the output FF (in pipelined mode)
- many paths within the DSP48 - all of these depend on whether the corresponding FFs are enabled or not
- between the FFs in the dual A and B registers
- from the A, B, and D FFs to the M FFs
- from the M FFs to the P FFs
For these paths, the tool does not explicitely describe the paths - they are completely contained in a single slice, so are not broken down. For these paths, instead of explicitly timing the cells, the tools know how fast the clock can be and still meet the internal timing of these paths. For example, if the slowest path is from the A FF to the M FF, and it is 3ns, then the tool knows that the cell can run at a clock period of 3ns or higher. So instead, it puts a pulse width check on the clock port of the DSP48 so that it must be greater than 3.
For all other paths, it generates timing reports. This includes paths that start outside the DSP cell and end inside it, or vice versa.
As for the relative speed of the multiplier vs. the other stuff, that is entirely the point of the DSP48 cell. Logic implemented in the FPGA fabric has inherent limitations due to the reprogrammability of the FPGA. Notably, the programmable interconnect between cells is "slow". Multipliers are relatively common extremely complex combinatorial functions. If implemented in the fabric they would be VERY SLOW. As a result, Xilinx has included the DSP48 cell. This is a hard block (i.e. an ASIC block) that implements the multiplication. Since it doesn't use programmable logic for implementing the actual multiplication, it is much faster than logic implemented in the fabric (maybe by a factor of 10 or more).
Avrum
01-19-2013 10:47 PM
Thank you for your reply. I don't know the use of Component Switching Limit Checks in timing report. Does it have something to do with the pulse width check you have mentioned?
Here's the fregment of Component Switching Limit Checks in timing report:
Component Switching Limit Checks: TS_clk = PERIOD TIMEGRP "clk" 4 ns HIGH 50%;
--------------------------------------------------------------------------------
Slack: 1.574ns (period - min period limit)
Period: 4.000ns
Min period limit: 2.426ns (412.201MHz) (Tdspper_AREG_PREG_MULT)
Physical resource: Mmult_r_a[25]_r_b[18]_MuLt_6_OUT_submult_0/CLK
Logical resource: Mmult_r_a[25]_r_b[18]_MuLt_6_OUT_submult_0/CLK
Location pin: DSP48_X14Y81.CLK
Clock network: clk_BUFGP
--------------------------------------------------------------------------------
Slack: 2.651ns (period - min period limit)
Period: 4.000ns
Min period limit: 1.349ns (741.290MHz) (Tbcper_I(Fmax))
Physical resource: clk_BUFGP/BUFG/I0
Logical resource: clk_BUFGP/BUFG/I0
Location pin: BUFGCTRL_X0Y0.I0
Clock network: clk_BUFGP/IBUFG
--------------------------------------------------------------------------------
Slack: 3.200ns (period - (min low pulse limit / (low pulse / period)))
Period: 4.000ns
Low pulse: 2.000ns
Low pulse limit: 0.400ns (Tcl)
Physical resource: r_b_17_7/CLK
Logical resource: r_b_17_4/CK
Location pin: SLICE_X129Y201.CLK
Clock network: clk_BUFGP
01-20-2013 05:26 AM
Yes, I meant component switching limit. You can see the requirement for the DSP48 in the first one - in this configuration, the DSP48 can run at 2.426ns, or 412.201MHz.
Avrum
01-20-2013 10:08 PM
Thank you for your reply. But the Component Switching Limit Checks of my another project didn't show the path from AREG, BREG or DREG to MREG(because these paths contains the cell of 25*18MULT). It only shows the path from MREG to PREG and another path.
Here's the RTL code:
module multtop(clk,
rst,
a,
b,
c
);
input signed [24:0] a;
input signed [17:0] b;
input clk,rst;
output reg signed [42:0] c;
reg signed [42:0] r_c;
reg signed [24:0] r_a;
reg signed [17:0] r_b;
always@(posedge clk)begin
if(rst)
begin
r_a<=0;
r_b<=0;
end
else
begin
r_a<=a;
r_b<=b;
end
end
always@(posedge clk)begin
if(rst)
begin
r_c<=0;
c<=0;
end
else
begin
r_c<=r_a*r_b;
c<=r_c;
end
end
endmodule
Here's the fragment of timing report:
Component Switching Limit Checks: TS_clk = PERIOD TIMEGRP "clk" 4 ns HIGH 50%;
--------------------------------------------------------------------------------
Slack: 2.651ns (period - min period limit)
Period: 4.000ns
Min period limit: 1.349ns (741.290MHz) (Tbcper_I(Fmax))
Physical resource: clk_BUFGP/BUFG/I0
Logical resource: clk_BUFGP/BUFG/I0
Location pin: BUFGCTRL_X0Y0.I0
Clock network: clk_BUFGP/IBUFG
--------------------------------------------------------------------------------
Slack: 2.652ns (period - min period limit)
Period: 4.000ns
Min period limit: 1.348ns (741.840MHz) (Tdspper_MREG_PREG)
Physical resource: Mmult_r_a[24]_r_b[17]_MuLt_6_OUT/CLK
Logical resource: Mmult_r_a[24]_r_b[17]_MuLt_6_OUT/CLK
Location pin: DSP48_X3Y83.CLK
Clock network: clk_BUFGP
--------------------------------------------------------------------------------
01-21-2013 09:43 AM
Basically, the tool does what it does. How it deals with the Component Switching Limit isn't exactly clearly described anywhere that I know of. What I suspect it does, is that for every cell it figures out what the limit is based on all internal paths (in a given configuration), and then uses that as the minimum requirement. In this case, it (correctly) identified that the longest used path in the DSP48 is the one through the multiplier, and therefore this sets the lower limit on the whole cell based on this.
I had hoped that speedprint would show you all of the Tdspper* parameters, but it doesn't appear to.
I think you pretty much have to accept that the tool does what it does. It may not be giving you explicit information through all paths of the DSP, but I am sure it is properly timing your design.
Avrum
01-22-2013 04:41 AM
Thanks again for your help. I have gained a lot about fpga from you. Thank you.
04-13-2013 07:16 PM
Hi avrumw.
I was faced with another problem. And I have to ask you for your help.
I have a large project with case statement in my souce. But I found something strange after synthesis. If I remain the default in my case statement, the Maximum Frequency is 249.308MHz . But if I remove the default, the Maximum Frequency is 324.317MHz. . As far as I know, the synthesizer will add an extra launch when it comes across the case statement without default . How can this lead to the increasement of the Maximum Frequency? Thank you.
My device is Virtex7 XC7VX485T. And the case implementation style option in synthesize property in none.
The souce containing case statement shows below.(Since my project is a little large, so I don't attach my project to the internet):
module tw512_stage1_part(d_in,
p1,
d1_out,
d2_out);
parameter wd=14;
parameter fd=13;
input signed [wd-1:0] d_in;
input [3:0] p1;
output signed [wd-1:0] d1_out,d2_out;
reg signed [24:0] d1_outx,d2_outx; //25_24
wire signed [16:0] d_in_p2; //17_15
wire signed [15:0] d_in_m2; //16_15
wire signed [17:0] d_in_p3; //18_16
wire signed [16:0] d_in_m3; //17_16
wire signed [17:0] d_in_m4; //18_17
wire signed [23:0] add_1_d1; //24_23
wire signed [24:0] add_1_d2; //25_24
wire signed [21:0] add_2_d1_0; //22_21
wire signed [24:0] add_2_d1; //25_24
wire signed [20:0] add_2_d2; //21_20
wire signed [21:0] add_3_d1; //22_21
wire signed [20:0] add_3_d2_0; //21_20
wire signed [24:0] add_3_d2; //25_24
wire signed [22:0] add_4_d1; //23_22
wire signed [20:0] add_4_d2; //21_20
wire signed [23:0] add_5_d1; //24_23
wire signed [18:0] add_5_d2_0; //19_18
wire signed [24:0] add_5_d2; //25_24
wire signed [19:0] add_6_d1_0; //20_19
wire signed [24:0] add_6_d1; //25_24
wire signed [23:0] add_6_d2; //24_23
wire signed [20:0] add_7_d1_0; //21_20
wire signed [24:0] add_7_d1; //25_24
wire signed [22:0] add_7_d2_0; //23_22
wire signed [24:0] add_7_d2; //25_24
wire signed [17:0] add_8_d1_0; //18_17
wire signed [21:0] add_8_d1; //22_21
wire signed [21:0] add_8_d2; //22_21
assign d_in_p2={d_in[wd-1],d_in,{2{1'b0}}}+{{3{d_in[wd-1]}},d_in};
assign d_in_m2={d_in,{2{1'b0}}}-{{2{d_in[wd-1]}},d_in};
assign d_in_p3={d_in[wd-1],d_in,{3{1'b0}}}+{{4{d_in[wd-1]}},d_in};
assign d_in_m3={d_in,{3{1'b0}}}-{{3{d_in[wd-1]}},d_in};
assign d_in_m4={d_in,{4{1'b0}}}-{{4{d_in[wd-1]}},d_in};
assign add_2_d1_0={d_in,{8{1'b0}}}-{{5{d_in_p2[16]}},d_in_p2};
assign add_3_d2_0={{2{d_in[wd-1]}},d_in,{5{1'b0}}}+{{4{d_in_p2[16]}},d_in_p2};
assign add_5_d2_0={d_in[wd-1],d_in,{4{1'b0}}}-{{5{d_in[wd-1]}},d_in};
assign add_6_d1_0={d_in_m2,{4{1'b0}}}+{{3{d_in_p2[16]}},d_in_p2};
assign add_7_d1_0={d_in_m2,{5{1'b0}}}+{{5{d_in_m2[15]}},d_in_m2};
assign add_7_d2_0={d_in_p2,{6{1'b0}}}+{{6{d_in_p2[16]}},d_in_p2};
assign add_8_d1_0={d_in_m2,{2{1'b0}}}-{{4{d_in[wd-1]}},d_in};
assign add_1_d1={d_in,{10{1'b0}}}-{{7{d_in_p2[16]}},d_in_p2}; //csd code 1 0 0 0 0 0 0 0 -1 0 -1 0
assign add_1_d2={{3{d_in_m2[15]}},d_in_m2,{6{1'b0}}}+{{7{d_in_p3[17]}},d_in_p3};//csd code 0 0 0 1 0 -1 0 0 1 0 0 1
assign add_2_d1={add_2_d1_0,{3{1'b0}}}+{{11{d_in[wd-1]}},d_in};//csd code 1 0 0 0 0 0 -1 0 -1 0 0 1
assign add_2_d2={{2{d_in_m2[15]}},d_in_m2,{3{1'b0}}}+{{7{d_in[wd-1]}},d_in};//csd code 0 0 1 0 -1 0 0 1 0 0 0 0
assign add_3_d1={{5{d_in_p2[16]}},d_in_p2}+{d_in_m4,{4{1'b0}}}; //csd code 1 0 0 0 -1 0 1 0 1 0 0 0
assign add_3_d2={add_3_d2_0,{4{1'b0}}}+{{9{d_in_m2[15]}},d_in_m2};//csd code 0 0 1 0 0 1 0 1 0 1 0 -1
assign add_4_d1={d_in_m4,{5{1'b0}}}-{{6{d_in_m3[16]}},d_in_m3};//csd code 1 0 0 0 -1 0 -1 0 0 1 0 0
assign add_4_d2={d_in_m2[15],d_in_m2,{4{1'b0}}}+{{7{d_in[wd-1]}},d_in};//csd code 0 1 0 -1 0 0 0 1 0 0 0 0
assign add_5_d1={d_in_m3,{7{1'b0}}}+{{7{d_in_m3[16]}},d_in_m3};//csd code 1 0 0 -1 0 0 0 1 0 0 -1 0
assign add_5_d2={add_5_d2_0,{6{1'b0}}}+{{8{d_in_p2[16]}},d_in_p2};//csd code 0 1 0 0 0 -1 0 0 0 1 0 1
assign add_6_d1={add_6_d1_0,{5{1'b0}}}+{{8{d_in_m3[16]}},d_in_m3};//csd code 1 0 -1 0 1 0 1 0 1 0 0 -1
assign add_6_d2={d_in_p3,{6{1'b0}}}-{{7{d_in_m3[16]}},d_in_m3};//csd code 0 1 0 0 1 0 0 -1 0 0 1 0
assign add_7_d1={add_7_d1_0,{4{1'b0}}}-{{11{d_in[wd-1]}},d_in};//csd code 1 0 -1 0 0 1 0 -1 0 0 0 -1
assign add_7_d2={add_7_d2_0,{2{1'b0}}}-{{11{d_in[wd-1]}},d_in};//csd code 0 1 0 1 0 0 0 1 0 1 0 -1
assign add_8_d1={add_8_d1_0,{4{1'b0}}}+{{5{d_in_p2[16]}},d_in_p2};//csd code 1 0 -1 0 -1 0 1 0 1 0 0 0
assign add_8_d2={add_8_d1_0,{4{1'b0}}}+{{5{d_in_p2[16]}},d_in_p2}; //csd code 1 0 -1 0 -1 0 1 0 1 0 0 0
always@*begin
case(p1)
4'b0000:
begin
d1_outx={d_in,{11{1'b0}}};
d2_outx=0;
end
4'b0001:
begin
d1_outx={add_1_d1,1'b0};
d2_outx=~add_1_d2+1;
end
4'b0010:
begin
d1_outx=add_2_d1;
d2_outx=~{add_2_d2,{4{1'b0}}}+1;
end
4'b0011:
begin
d1_outx={add_3_d1,{3{1'b0}}};
d2_outx=~add_3_d2+1;
end
4'b0100:
begin
d1_outx={add_4_d1,{2{1'b0}}};
d2_outx=~{add_4_d2,{4{1'b0}}}+1;
end
4'b0101:
begin
d1_outx={add_5_d1,1'b0};
d2_outx=~add_5_d2+1;
end
4'b0110:
begin
d1_outx=add_6_d1;
d2_outx=~{add_6_d2,1'b0}+1;
end
4'b0111:
begin
d1_outx=add_7_d1;
d2_outx=~add_7_d2+1;
end
4'b1000:
begin
d1_outx={add_8_d1,{3{1'b0}}};
d2_outx=~{add_8_d2,{3{1'b0}}}+1;
end
default:
begin
d1_outx='bx;
d2_outx='bx;
end
endcase
end
rns1 #(.W_x(25),.F_x(24),.W_y(wd),.F_y(fd)) rns1_1(.x_in(d1_outx),.y_out(d1_out));
rns1 #(.W_x(25),.F_x(24),.W_y(wd),.F_y(fd)) rns1_2(.x_in(d2_outx),.y_out(d2_out));
endmodule
And here's the fragment of synthesis report with remaining default in case statement:
Timing Summary:
---------------
Speed Grade: -3
Minimum period: 4.011ns (Maximum Frequency: 249.308MHz)
Minimum input arrival time before clock: 2.884ns
Maximum output required time after clock: 0.511ns
Maximum combinational path delay: No path found
Timing Details:
---------------
All values displayed in nanoseconds (ns)
=========================================================================
Timing constraint: Default period analysis for Clock 'clk'
Clock period: 4.011ns (frequency: 249.308MHz)
Total number of paths / destination ports: 32098287 / 40874
-------------------------------------------------------------------------
Delay: 4.011ns (Levels of Logic = 20)
Source: path0/bf_ctrl/Mmult_n0083 (DSP)
Destination: path0/stage1_tw512/r_sg1_2_13_1 (FF)
Source Clock: clk rising
Destination Clock: clk rising
Data Path: path0/bf_ctrl/Mmult_n0083 to path0/stage1_tw512/r_sg1_2_13_1
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
DSP48E1:CLK->P0 4 0.324 0.293 path0/bf_ctrl/Mmult_n0083 (path0/p1_512<0>)
BUF:I->O 22 0.053 0.375 path0/bf_ctrl/Mmult_n0083_98 (path0/bf_ctrl/Mmult_n0083_98)
BUF:I->O 21 0.053 0.519 path0/bf_ctrl/Mmult_n0083_2 (path0/bf_ctrl/Mmult_n0083_2)
LUT6:I3->O 5 0.043 0.561 path0/stage1_tw512/tw512_ctrl/p12<4>1 (path0/stage1_tw512/p1<1>)
LUT6:I1->O 1 0.043 0.428 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/Mmux_d2_outx61 (path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/Mmux_d2_outx6)
LUT5:I2->O 1 0.043 0.000 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/Mmux_d2_outx63 (path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/d2_outx<11>)
MUXCY:S->O 1 0.230 0.000 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<11> (path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<11>)
MUXCY:CI->O 1 0.013 0.000 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<12> (path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<12>)
MUXCY:CI->O 1 0.013 0.000 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<13> (path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<13>)
MUXCY:CI->O 1 0.013 0.000 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<14> (path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<14>)
MUXCY:CI->O 1 0.012 0.000 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<15> (path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<15>)
MUXCY:CI->O 1 0.012 0.000 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<16> (path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<16>)
MUXCY:CI->O 1 0.012 0.000 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<17> (path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<17>)
MUXCY:CI->O 1 0.012 0.000 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<18> (path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<18>)
MUXCY:CI->O 1 0.012 0.000 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<19> (path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<19>)
MUXCY:CI->O 1 0.012 0.000 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<20> (path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<20>)
MUXCY:CI->O 1 0.012 0.000 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<21> (path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<21>)
MUXCY:CI->O 1 0.012 0.000 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<22> (path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<22>)
MUXCY:CI->O 0 0.012 0.000 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<23> (path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_cy<23>)
XORCY:CI->O 4 0.251 0.293 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_xor<24> (path0/stage1_tw512/sg1_2<13>)
BUF:I->O 5 0.053 0.298 path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_xor<24>_1 (path0/stage1_tw512/tw512_stage1/tw512_stage1_part1/rns1_2/Madd_temp_round_xor<24>_1)
FDR:D -0.001 path0/stage1_tw512/r_sg1_2_13_9
----------------------------------------
Total 4.011ns (1.243ns logic, 2.768ns route)
(31.0% logic, 69.0% route)