10-20-2018 01:51 PM
Hi,
Setup:
Vivado 2018.2
Ultrascale+ Architecture.
I have always had issue understanding the right way to use BUFG modules to properly clock gate part of the design, so I would really appreciate expert advice / help here.
In the design I am just trying to Clock gate BUFGCE through CE input, which is driven from a FF.
Both FF and BUFGCE are "clocked" from the same source. Path delay between those units is very small -> 0.248.
However, the tool claims there is much bigger delay from the same CLOCK ROOT (X2Y2) to FF -> 3.237 in comparison to BUFGCE -> 0.426, hence the timing violation.
Here is the path:
Max Delay Paths -------------------------------------------------------------------------------------- Slack (VIOLATED) : -1.570ns (required time - arrival time) Source: CDC_STEP[2].u_cdc_step/enable_r_reg/C (rising edge-triggered cell FDRE clocked by clk_3 {rise@0.000ns fall@0.833ns period=1.667ns}) Destination: CDC_STEP[2].u_cdc_step/BUFGCE_inst/CE (rising edge-triggered cell BUFGCE clocked by clk_3 {rise@0.000ns fall@0.833ns period=1.667ns}) Path Group: clk_3 Path Type: Setup (Max at Slow Process Corner) Requirement: 1.667ns (clk_3 rise@1.667ns - clk_3 rise@0.000ns) Data Path Delay: 0.327ns (logic 0.079ns (24.159%) route 0.248ns (75.841%)) Logic Levels: 0 Clock Path Skew: -2.794ns (DCD - SCD + CPR) Destination Clock Delay (DCD): 0.426ns = ( 2.093 - 1.667 ) Source Clock Delay (SCD): 3.237ns Clock Pessimism Removal (CPR): 0.017ns Clock Uncertainty: 0.057ns ((TSJ^2 + DJ^2)^1/2) / 2 + PE Total System Jitter (TSJ): 0.071ns Discrete Jitter (DJ): 0.088ns Phase Error (PE): 0.000ns Clock Net Delay (Source): 3.237ns (routing 1.675ns, distribution 1.562ns) Clock Net Delay (Destination): 0.426ns (routing 1.522ns, distribution -1.096ns) Location Delay type Incr(ns) Path(ns) Partition Netlist Resource(s) ------------------------------------------------------------------- ---------------------------------- (clock clk_3 rise edge) 0.000 0.000 r BUFGCTRL_X1Y42 BUFGCTRL 0.000 0.000 r static BUFGCE[3].BUFGMUX_sel_1/O X2Y2 (CLOCK_ROOT) net (fo=144244, routed) 3.237 3.237 static CDC_STEP[2].u_cdc_step/clk_3 SLICE_X112Y330 FDRE r static CDC_STEP[2].u_cdc_step/enable_r_reg/C ------------------------------------------------------------------- ---------------------------------- SLICE_X112Y330 FDRE (Prop_FFF_SLICEL_C_Q) 0.079 3.316 r static CDC_STEP[2].u_cdc_step/enable_r_reg/Q net (fo=1, routed) 0.248 3.564 static CDC_STEP[2].u_cdc_step/enable_r BUFGCE_X1Y132 BUFGCE r static CDC_STEP[2].u_cdc_step/BUFGCE_inst/CE ------------------------------------------------------------------- ---------------------------------- (clock clk_3 rise edge) 1.667 1.667 r BUFGCTRL_X1Y42 BUFGCTRL 0.000 1.667 r static BUFGCE[3].BUFGMUX_sel_1/O X2Y2 (CLOCK_ROOT) net (fo=144244, routed) 0.426 2.093 static CDC_STEP[2].u_cdc_step/clk_3 BUFGCE_X1Y132 BUFGCE r static CDC_STEP[2].u_cdc_step/BUFGCE_inst/I clock pessimism 0.017 2.110 clock uncertainty -0.057 2.053 BUFGCE_X1Y132 BUFGCE (Setup_BUFCE_BUFGCE_I_CE) -0.059 1.994 static CDC_STEP[2].u_cdc_step/BUFGCE_inst ------------------------------------------------------------------- required time 1.994 arrival time -3.564 ------------------------------------------------------------------- slack -1.570
Could someone please explain what is going on?
Cheers,
Arsen.
10-20-2018 06:54 PM - edited 01-09-2019 11:41 AM
Take a look at this post on using the BUFGCE. It was written for the 7 series, so the BUFHCE part doesn't apply, but the mechanism of using the BUFGCE in UltraScale is the same; the "base" clock (which clocks the flip-flops that generate the CE) need to be on one BUFG/BUFGCE and the divided clock need to be on another. In your architecture, you have your BUFGCE in series (not in parallel) with the BUFGCTRL, which is what is causing the extra skew.
In UltraScale/UltraScale+ you need to also ensure that the output nets of the two BUFGCE's are placed in the same CLOCK_DELAY_GROUP to ensure that their CLOCK_ROOT are co-located. Take a look at this post on the CLOCK_DELAY_GROUP.
[Edit:
So referring to the image of the BUFGCE from the first referenced post, you would want to add
set_property CLOCK_DELAY_GROUP <unique_name> [get_nets {fastClk slowClk}]
where <unique_name> is any string that is not the name of another clock delay group
]
Also, there is something odd with your constraints - this timing path starts at the output of a BUFGCTRL - it looks like there is a primary clock defined there. This is "bad form" (or incorrect) - primary clocks (create_clock) should only be defined on input clock ports. The only exceptions are
Avrum
10-21-2018 11:04 PM
Hi,
I would suggest to go through UG949 to understand the correct methodology for clock gating.
Thanks,
Yash
10-22-2018 02:38 PM
Thanks @avrumw for your response and detailed explanation.
I was hoping you will reply to my issue.
Dear @yashp. Thanks a lot for the link. Definitely missed this doc. Went through it quickly, but will read it in much more details.
Avrum,
Basically I went through each of the steps you have suggested and here is what I have now:
Max Delay Paths -------------------------------------------------------------------------------------- Slack (VIOLATED) : -2.603ns (required time - arrival time) Source: CDC_STEP[3].u_cdc_step/clk_on_reg/C (rising edge-triggered cell FDCE clocked by clk_4 {rise@0.000ns fall@0.833ns period=1.666ns}) Destination: BUFGCE_GATED[4].u_BUFGCE/CE (rising edge-triggered cell BUFGCE clocked by clk_out3_clk_wiz_0 {rise@0.000ns fall@0.833ns period=1.666ns}) Path Group: clk_out3_clk_wiz_0 Path Type: Setup (Max at Slow Process Corner) Requirement: 1.666ns (clk_out3_clk_wiz_0 rise@1.666ns - clk_4 rise@0.000ns) Data Path Delay: 0.327ns (logic 0.079ns (24.159%) route 0.248ns (75.841%)) Logic Levels: 0 Clock Path Skew: -3.833ns (DCD - SCD + CPR) Destination Clock Delay (DCD): 0.711ns = ( 2.377 - 1.666 ) Source Clock Delay (SCD): 4.331ns Clock Pessimism Removal (CPR): -0.213ns Clock Uncertainty: 0.050ns ((TSJ^2 + DJ^2)^1/2) / 2 + PE Total System Jitter (TSJ): 0.071ns Discrete Jitter (DJ): 0.071ns Phase Error (PE): 0.000ns Clock Net Delay (Source): 2.666ns (routing 0.874ns, distribution 1.792ns) Clock Net Delay (Destination): 0.425ns (routing 0.004ns, distribution 0.421ns) Location Delay type Incr(ns) Path(ns) Partition Netlist Resource(s) ------------------------------------------------------------------- ---------------------------------- (clock clk_4 rise edge) 0.000 0.000 r F32 0.000 0.000 r static clk_in1_p (IN) net (fo=0) 0.000 0.000 static u_pll/inst/clkin1_ibufds/I HPIOBDIFFINBUF_X0Y204 DIFFINBUF (Prop_DIFFINBUF_HPIOBDIFFINBUF_DIFF_IN_P_O) 0.569 0.569 r static u_pll/inst/clkin1_ibufds/DIFFINBUF_INST/O net (fo=1, routed) 0.050 0.619 static u_pll/inst/clkin1_ibufds/OUT F32 IBUFCTRL (Prop_IBUFCTRL_HPIOB_M_I_O) 0.000 0.619 r static u_pll/inst/clkin1_ibufds/IBUFCTRL_INST/O net (fo=1, routed) 0.384 1.003 static u_pll/inst/clk_in1_clk_wiz_0 MMCM_X0Y8 MMCME4_ADV (Prop_MMCM_CLKIN1_CLKOUT2) -2.416 -1.413 r static u_pll/inst/mmcme4_adv_inst/CLKOUT2 net (fo=1, routed) 0.248 -1.165 static u_pll/inst/clk_out3_clk_wiz_0 BUFGCE_X0Y204 BUFGCE (Prop_BUFCE_BUFGCE_I_O) 0.028 -1.137 r static u_pll/inst/clkout3_buf/O net (fo=8, routed) 1.106 -0.031 static clk_600mhz BUFGCTRL_X0Y76 BUFGCTRL (Prop_BUFGCTRL_I1_O) 0.093 0.062 r static BUFG_MUX[4].BUFGMUX_sel_1_0/O net (fo=2, routed) 1.575 1.637 static clk_l[4] BUFGCE_X0Y134 BUFGCE (Prop_BUFCE_BUFGCE_I_O) 0.028 1.665 r static BUFGCE[4].u_BUFGCE/O X1Y4 (CLOCK_ROOT) net (fo=2316, routed) 2.666 4.331 static CDC_STEP[3].u_cdc_step/out[0] SLICE_X55Y571 FDCE r static CDC_STEP[3].u_cdc_step/clk_on_reg/C ------------------------------------------------------------------- ---------------------------------- SLICE_X55Y571 FDCE (Prop_DFF_SLICEL_C_Q) 0.079 4.410 r static CDC_STEP[3].u_cdc_step/clk_on_reg/Q net (fo=1, routed) 0.248 4.658 static clk_on[4] BUFGCE_X0Y236 BUFGCE r static BUFGCE_GATED[4].u_BUFGCE/CE ------------------------------------------------------------------- ---------------------------------- (clock clk_out3_clk_wiz_0 rise edge) 1.666 1.666 r F32 0.000 1.666 r static clk_in1_p (IN) net (fo=0) 0.000 1.666 static u_pll/inst/clkin1_ibufds/I HPIOBDIFFINBUF_X0Y204 DIFFINBUF (Prop_DIFFINBUF_HPIOBDIFFINBUF_DIFF_IN_P_O) 0.472 2.138 r static u_pll/inst/clkin1_ibufds/DIFFINBUF_INST/O net (fo=1, routed) 0.040 2.178 static u_pll/inst/clkin1_ibufds/OUT F32 IBUFCTRL (Prop_IBUFCTRL_HPIOB_M_I_O) 0.000 2.178 r static u_pll/inst/clkin1_ibufds/IBUFCTRL_INST/O net (fo=1, routed) 0.333 2.511 static u_pll/inst/clk_in1_clk_wiz_0 MMCM_X0Y8 MMCME4_ADV (Prop_MMCM_CLKIN1_CLKOUT2) -1.888 0.623 r static u_pll/inst/mmcme4_adv_inst/CLKOUT2 net (fo=1, routed) 0.217 0.840 static u_pll/inst/clk_out3_clk_wiz_0 BUFGCE_X0Y204 BUFGCE (Prop_BUFCE_BUFGCE_I_O) 0.024 0.864 r static u_pll/inst/clkout3_buf/O net (fo=8, routed) 1.008 1.872 static clk_600mhz BUFGCTRL_X0Y76 BUFGCTRL (Prop_BUFGCTRL_I1_O) 0.080 1.952 r static BUFG_MUX[4].BUFGMUX_sel_1_0/O X2Y9 (CLOCK_ROOT) net (fo=2, routed) 0.425 2.377 static clk_l[4] BUFGCE_X0Y236 BUFGCE r static BUFGCE_GATED[4].u_BUFGCE/I clock pessimism -0.213 2.164 clock uncertainty -0.050 2.114 BUFGCE_X0Y236 BUFGCE (Setup_BUFCE_BUFGCE_I_CE) -0.059 2.055 static BUFGCE_GATED[4].u_BUFGCE ------------------------------------------------------------------- required time 2.055 arrival time -4.658 ------------------------------------------------------------------- slack -2.603
1) First of all indeed I was creating a clock on the output of BUFGCTRL, so that's now removed.
2) I have parallelised BUFGs. Basically in my design MMCM generates 3 different clocks. Two cascaded BUFGCTRLs select one of them, which is driving two independent BUFGCE's. One of them has CE constantly assigned to 1 ( BUFGCE_X0Y134 ) and the other one ( BUFGCE_X0Y236 ) is controlled by a FF, which is clocked by BUFGCE_X0Y134.
3) I have setup the following constraint for CLOCK ROOT balancing:
set_property CLOCK_DELAY_GROUP clk_and_clk_gated_4 [get_nets {clk[4] clk_gated[4]}]
clk[4] is the net driven by BUFGCE_X0Y134, whereas clk_gated[4] is the net driven by BUFGCE_X0Y236.
Don't know whether that makes any difference, but to let you know. The clk[4] is driving logic both in Partial Reconfiguration region as well as Static region, whereas clk_gated[4] is driving logic located only in Static region.
Honestly speaking simple clock gating should not be so complicated, I must be still doing something fundamentally wrong, so would appreciate your further comments.
Cheers,
Arsen.
10-23-2018 09:19 AM
First you still have too many clock buffers in this design. Normally the path would go straight from the MMCM to the two parallel BUFGCEs. You say you are also doing clock MUXing - this will add one more BUFGCTRL, but you have a total of 3 in series - that is one more than you need.
But the real problem is the route from your BUFGCTRL to the two BUFGCEs - the routes are VERY unbalanced; one is 1.575ns and the other is 0.425ns. These are not timed at the same corner - [SLOW_MAX] vs. [SLOW_MIN], but the difference between them is much larger than what one would expect from the different timing corners (they should be closer to 10% different). I don't know why this is - it is the same net going to two different BUFGCEs - maybe the BUFGCEs are too far apart? If so you may consider placing a LOC constraint on the BUFGTRL and the 2 BUFGCEs.
But, ultimately, you are trying to do clock gating at 600MHz - that simply may not be possible. The clock tree of your ungated clock (BUFGCE_X0Y134) takes 2.666ns to get to the gating flip-flop - this is significantly larger than your period of 1.666ns. Generally clock gating is used to generate really slow clocks from medium speed clocks - it may be impossible to use a clock this fast as the source for clock gating.
Why don't you just generate the slower clock in the MMCM. If you need it to be MUXed, then generate both sets of clocks you need and MUX them both...
Avrum