UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Adventurer
Adventurer
5,870 Views
Registered: ‎11-22-2016

Advice needed on timing closure problem

I have the following path failing timing, and I would really appreciate some advice in understanding what I'm seeing here and how to fix it.

 

First some observations:

  1. The clock speed is 500 MHz, my device is speed grade -2.  I know that I'm pushing my luck with this speed, but I usually meet timing with this design.
  2. The failing path is hard-wired internal path between two DSP48E1 units, specifically the MULTSIGN and PC paths; I'm astonished to see a failure here!
  3. Placement has decided on this occasion to place the two DSP units on separate clocking tiles (X0Y3 and X0Y4) to my misfortune.

My build options are default except for these settings:

set_property strategy Flow_PerfOptimized_high [get_runs synth_1]
set_property STEPS.SYNTH_DESIGN.ARGS.ASSERT true [get_runs synth_1]
set_property STEPS.PHYS_OPT_DESIGN.IS_ENABLED true [get_runs impl_1]

 

path.png

 

I am quite unfamiliar with how to read the clock path here, and I have to admit to being completely baffled by the fact that although the source and destination paths are identical up to BUFGCTRL_X0Y16 there is already a huge skew (2.541 - 2 - 0.930 = -0.389 if I'm reading this right) -- does this make sense?  I think I understand that there is a -0.306 (1.162-1.468?) routing skew ... but very little else makes much sense to me.  Clearly the signal routing delay itself is tiny (0.383), so my problem is entirely clock skew across clock tiles.

 

So two questions:

 

  1. How can I usefully understand the path report I'm seeing here?  Which documentation should I be reading?
  2. How can I get more information about *why* this problem has arisen and how to resolve it?
0 Kudos
17 Replies
Instructor
Instructor
5,858 Views
Registered: ‎08-14-2007

Re: Advice needed on timing closure problem

Since this appears to be a placement issue, it would be worth finding out why the tools have placed the offending DSP48s in different clock regions.  You said that the design usually meets timing.  Have you added more DSP48s since the last working placement?  If not, is is possible to recreate the last working placement and find out where these DSP48s were located, and then lock them to those locations?  The problem with that approach is that if the design has changed significantly, especially in DSP48 usage, forcing the placement of these two DSP48s may just cause an issue to pop up elsewhere.

-- Gabor
0 Kudos
Adventurer
Adventurer
5,847 Views
Registered: ‎11-22-2016

Re: Advice needed on timing closure problem

The only change since the last synthesis was to increase an address bus in an unrelated component some distance away from 9 to 10 bits!  However, synthesis on that occasion was tight (+5ps slack, not a good sign).

 

At the moment my best guess is that my register control interface doesn't have enough relaxation in it -- large parts of my design do seem to be pulled towards the centre of the die, causing unnecessary crowding, and these tight timings are definitely down to crowding.

 

Unfortunately when faced with a global problem like this, the timing reports seem to be almost useless, and it's incredibly difficult to figure out which forces are actually causing the trouble.  At the moment *all* my tight paths are logic free routings which cross clock tiles!  Very annoying...

 

One idea I might try is to pblock the offending sub-system some distance away from the centre and see exactly which paths break.  I'm certainly not ready to tie parts down for production, thank you very much!  Think I'm going to sleep on this problem for now.

0 Kudos
Scholar markcurry
Scholar
5,836 Views
Registered: ‎09-16-2009

Re: Advice needed on timing closure problem

 

The DSP48's are already effectively RLOC'd since you're using the dedicated PCIN->PCOUT carry chains between DSP48s.  You're guaranteed that the two DSP48's will be right next to each other.

 

So the only variable is clock domain differences, as you've found out.

 

I can't offer much advice on solving clock skew problems, however. Other than driving the two DSP48's with the same BUFG, which I believe you're already doing.

 

Good Luck

 

--Mark

 

 

 

Historian
Historian
5,824 Views
Registered: ‎01-23-2009

Re: Advice needed on timing closure problem

So, first...

 

there is already a huge skew (2.541 - 2 - 0.930 = -0.389 if I'm reading this right

 

You are reading it right, but you also aren't... What you are implying (correctly) is that this makes no sense - the output of the BUFGCTRL is the same path in the source and destination, so how can it have skew? The answer is, it doesn't.

 

Vivado does attribute different delays to these components in the Source Clock Delay (at [SLOW_MAX]) and the Designation clock delay (at [SLOW_MIN]), and therefore does end up with this 0.389ns of skew. But this is overly pessimistic (and the tools know it), so this skew (and a bit more) are added back into the budget by the "clock pessimism" line, which adds back 0.401ps (the 0.389 plus a small portion of the routing delay of 0.383ps). The resulting actual clock skew is actually shown in the report header as "Clock Path Skew: -0.293" - you can hover over this to see the calculation it used.

 

But as for the rest, your analysis is correct. There is essentially a jump discontinuity in clock skew as you pass between clock regions, so the skew between the last DSP48 in one clock region and the first DSP48 in the next will be larger than all the other adjacent pairs. At 500MHz, this can (as you see) make a big difference.

 

The next question is "how many DSP48's" do you need to chain together - is it possible to keep all your longer chains within one clock region. (You don't tell us what architecture you are using - I will presume it is a 7 series device...). There are 20 DSP48 cells in a single clock region, so you should be able to create chains of 20 cells without incurring this penalty. In fact, you will find that the timing is slightly better in the upper 10 of the 20 than the lower 10 (since the PCOUT->PCIN propagation will be going in the same direction as the clock skew, rather than the opposite direction - the clocks fan out from the middle of the clock region...)

 

So, in the worst case, you can either put the related DSP48s into their own PBLOCK and then assign the PBLOCK area to a clock region, or you can actually LOC the DSP48s to a given location. If you have some designs that pass timing, you can consider extracting the locations of the DSP48s in that placement, and using that placement (as LOC constraints) for your subsequent implementations (this is a pretty common approach when timing is critical around BRAMs or DSP48s).

 

Avrum

0 Kudos
Adventurer
Adventurer
5,752 Views
Registered: ‎11-22-2016

Re: Advice needed on timing closure problem

@avrumw, thank you for a very helpful and illuminating answer.  I have to confess that I'm still quite baffled when trying to read the timing report.  I'm guessing my best way to get any further details is to look at UG472?  Am I correct in understanding the edge-to-edge propagation time in my case to be 0.383+1.276 (path delay plus logic and setup time)?  This now seems to add up:

 

Logic+routing delay = 1.659

Clock skew = 0.293

Clock uncertainty = 0.053

Total edge to edge delay = 2.005 (5ps too late)

 

So I presume that 293ps is the clock skew between my two regions, and the 53ps uncertainty is jitter.  Do I have any influence whatsoever on these numbers, or are they intrinsic to my device?  Indeed, it is a Virtex-7, a 690 in my case, so I am swimming in DSP slices!

 

In this particular corner of my design my DSPs are in pairs, so I'm dismayed the placing has split this.  In another part of the design I have four separate chains which are 8 units long, so I will keep an eye on them.  Your notes on how the clocks behave through the regions and between regions are eye opening for me.

 

I'm not ready to start pinning parts down, except for debugging, but I'll resort to that when I need to...

 

Can you recommend any reading other than UG472, and is there anything I can do in my clocking to improve skew and jitter?

0 Kudos
Historian
Historian
5,728 Views
Registered: ‎01-23-2009

Re: Advice needed on timing closure problem

  I'm guessing my best way to get any further details is to look at UG472?

 

UG472 gives details on the MMCM and PLL, but not really on timing analysis (you should definitely read it, though).

 

More detail on timing constraints are available in UG903 and timing analysis in UG906.

 

So I presume that 293ps is the clock skew between my two regions, and the 53ps uncertainty is jitter. Do I have any influence whatsoever on these numbers, or are they intrinsic to my device?  Indeed, it is a Virtex-7, a 690 in my case, so I am swimming in DSP slices!

 

The 293ps is the skew between the two DSP48 cells - the skew between any two cells will be different, but you incur more skew when the source and destination are not in the same clock region. This skew is based purely on the location of the cells - other than moving the cells there is nothing you can do to change them

 

The 53ps is the jitter. You have some control over this. First of all, you need to make sure that you have constrained the jitter of your input clock correctly. When you do a create_clock on a primary pin, you should also specify the correct input jitter on the clock with the set_input_jitter on that clock. The jitter required here is peak-to-peak (or cycle-cycle jitter); most crystals specify RMS jitter - you need to extract cycle-cycle jitter from RMS jitter based on your desired bit error rate (BER) - you can take a look at this paper from Maxim.

 

However, the input jitter is only a (small) component of the total jitter calculation; the MMCM/PLL significantly attenuates input jitter. But the choice of MMCM vs. PLL and the choice of the BANDWIDTH parameter as well as the settings of the M and D have a significant effect on the output jitter. In general PLLs are better than MMCMs, so that is an easy way to reduce jitter if you don't need the features of the MMCM (fractional multiply/divide, dynamic phase shift, fine phase shift, the extra outputs...). You are already using a PLL, so that doesn't apply. For the others, use the clocking wizard to see the effect of setting different parameters for the PLL (particularly the BANDWIDTH) - the wizard automatically chooses the M and D that minimize jitter.

 

Finally, I see that you have an IDELAY before the PLL. This is a bit unusual; the phase shifting of the MMCM is usually preferable over the IDELAY, and even the PLL with its 1/8 VCO phase shifting usually provides enough granularity for phase shifting. I don't think removing the IDELAY will reduce your jitter, though (but it is unusual).

 

Avrum

Tags (2)
0 Kudos
Adventurer
Adventurer
5,688 Views
Registered: ‎11-22-2016

Re: Advice needed on timing closure problem

Many thanks again @avrumw.

 

I guess the IDELAY (rather than PLL control) shows my lack of FPGA training!  You've given me plenty to digest.

0 Kudos
Contributor
Contributor
5,644 Views
Registered: ‎11-21-2016

Re: Advice needed on timing closure problem

Hi @araneidae

You seem to have a setup violation. In the summary, it says that maximum computation time is taken by logic cells i.e. logic delay. Routing seems to be fine.

 

Since the setup slack is not too much, these steps can come handy.

 

In the synthesis settings:

1. keep -flatten_hierarchy = none

2. keep_equivalent_register enabled

 

https://www.xilinx.com/support/answers/52257.html

 

Try to not force the design implementation tool. You might wanna keep the implementation opt settings minimal and default.

keep the defaults. Use power_opt, route_opt and phy_opt only when paths and timings are critical.

Opt_design_is_enabled: checked. and run the synthesis and implementation again.

 

Let me know if this solution helps.

 

Thanks.

 

Regards,

Suraj Kothari

0 Kudos
Adventurer
Adventurer
5,579 Views
Registered: ‎11-22-2016

Re: Advice needed on timing closure problem

So.  My design seems to have moved into a very difficult space, and I think I have two questions:

 

  1. Is it possible for me to write my design in such a way that place_design will reliably produce a routeable placement?
  2. Is there any way to get any detailed insight into the forces which are causing placement to be unrouteable?

I am now in a position where the any change causes unpredictable timing errors ... but always across clock region boundaries!

 

My big problem is that it would seem that placing is doing the hard work, but the in the end only detailed information I get is the final placement and the failing paths.  None of the failing paths (typically register to RAMB, or in one egregious case, DSP to DSP) would be challenging if they didn't cross a clock region.  Unfortunately I cannot see any way to get a global view of the forces guiding placement ... and I'm starting to wonder if the tool is actually able to solve the problem.

 

So how on earth do I figure out what is wrong with my design, and is there any way to get reliably routeable placement of a moderately sized design running at 500 MHz, or is pblock placement the only way forward?  At the moment I make tiny changes in one part of the design, and next timing failure is somewhere completely unrelated.  Predictably enough, the DSP->DSP failure is long behind me, but the underlying problem remains.

 

Any suggestions?

0 Kudos
Historian
Historian
5,049 Views
Registered: ‎01-23-2009

Re: Advice needed on timing closure problem

Achieving 500MHz is ALWAYS going to be challenging.

 

Placement instability is ALWAYS going to happen.

 

If the clock region clock skew is really your biggest problem, then it sounds like your RTL is fundamentally sound.

 

So you are stuck in the (uncomfortable) place where you do not get consistently passing implementation runs.

 

The tools you have have pretty much mostly been mentioned:

  - tool options (although I don't know any that will make much difference with this stuff)

  - PBLOCKS

  - LOCing critical resources (like DSP blocks or BRAMs or both)

      - this one is often the most successful and least invasive - if you get one run that passes, extract the "big block" placements from it (DSP and BRAM) and turn them into LOC constraints for your next runs

  - multiple place and route runs

      - this is tricky to do in Vivado since there are no seeds, but there are mechanisms to make this work

 

Avrum

Tags (1)
0 Kudos
Adventurer
Adventurer
4,961 Views
Registered: ‎11-22-2016

Re: Advice needed on timing closure problem

Hi @avrumw.  I've taken a week's break, so am picking up this thread (and the problem) again.

 

Yes, I'm definitely starting to regret my decision to run at 500 MHz.  To be honest, I think I got seduced into it.  When the design was smaller I was pleasantly surprised at how well everything mapped to 500 MHz, and all the issues I encountered made sense and were instructive (correct pipelining, managing logic complexity, treating DSP units properly), except for an interesting bit of nonsense with block rams (default READ_FIRST mode?)  And of course not having to manage two lanes of *everything* is just one less headache ... which I appear to have traded for a migraine.

 

Unfortunately, I seem to have hit an unpleasant threshold.  I have the horrible feeling that place_design doesn't actually understand the implications of some of its placement choices, and now my design is large enough I'm really struggling to meet timing.  I fear I'm reaching the limits of what I can do with RTL coding, though I can continue to worry at the pipelines -- the trouble is, it's incredibly difficult to distinguish between places where a longer pipeline helps and where it does nothing.

 

I'm playing a little with tool options -- setting ExtraTimingOpt for place_design was very interesting, it made the layout take up much more room on the device (suits me fine, there's plenty of room), but still makes some embarrassing clock region crosses (always always FDRE -> RAMB!)

 

So on to the more desperate options.  I'm really sad that the tool is letting me down here.  I am of course hugely reluctant to resort to placement of logic, but it's clearly worth a try.  In particular I think it would make quite a bit of sense for me to pin down the MIG and PCIe components, as when they fail timing this is just useless nonsense.

 

For extracting LOC constraints from a run, what is the right way to export what I want from the design?  I've looked at running File -> Export -> Export Constraints (which generates an *enormous* file) and then extracting the constraints I'll need, but this seems kind of clumsy!  Searching did find this link: How to export ... constraints ... but the suggested menu entry (Tools -> Directed Routing Constraints) doesn't actually seem to exist (Vivado 2016.4).

 

Alas, PBLOCKs do look like the way forward :(  Unfortunately Vivado's treatment of my constraint files is dreadfully unsanitary when it comes to actually exporting a PBLOCK setup (for some reason it insists on rewriting *everything* in a "normalised" format), and of course physical placement is just storing up trouble...

0 Kudos
Historian
Historian
4,952 Views
Registered: ‎01-23-2009

Re: Advice needed on timing closure problem

@araneidae,

 

It seems that you have stumbled on a shortcoming of the tools. I suspect that there is no heuristic in the current placer to try and force chains of DSP48 cells to be in the same clock region where possible. This is a pretty small corner case - it will only show up when trying to do cascaded chains of DSP48 cells at near the maximum frequency of the DSPs.

 

You could try filing a webcase if you have access (otherwise maybe a moderator can help here) - this is worth opening a case for.

 

However, that is not going to help you in the short term - so you will have to work around it. Again PBLOCKs or LOC constraints are going to be your best bets here.

 

The other timing error you mentioned (between FF and RAMB) is not uncommon - particularly if you have a module that needs a fair amount of RAM (or wide RAM). The RAMB cells are located in columns. When a module gets placed (particularly at high speeds), the tools need to keep the cells associated with the module near eachother. This makes placed modules tend to look like balls (or amoebas). However, if a module needs RAMs, then it needs to reach up and/or down a RAMB column to get access to enough RAMBs. If the number of RAMs is more than 10, then this is going to require a clock region crossing (since there are only 10 RAMB36 per column per clock region). And, again, I don't think the tool has a bias toward staying in the same clock region vs. traversing clock regions...

 

As for extracting LOC locations, I seem to remember that there used to be a way, but I can't find it. But everything can be done with Tcl! With a placed design open, you can execute the script below - it will put out the LOC constraints for all the block RAMs (replace RAMB* with DSP48* to get the DSP48s) - you can then cut and paste this into an XDC file.

 

foreach i [get_cells -hier -filter {REF_NAME =~ RAMB*}] {
  set loc [get_property LOC $i]
  set name [get_property NAME $i]
  puts "set_property LOC $loc \[get_cells $name \] "
}

Alas, PBLOCKs do look like the way forward :(  Unfortunately Vivado's treatment of my constraint files is dreadfully unsanitary when it comes to actually exporting a PBLOCK setup (for some reason it insists on rewriting *everything* in a "normalised" format), and of course physical placement is just storing up trouble...

 

I never let the tools overwrite my (hand written) XDC files - yes, they can mess up formatting... I always make my XDC changes by hand (cut and paste from the tool).

 

However, for the PBLOCKs, it does make sense to do them in the GUI and then write them out. What you can do for this is add a new xdc file to the project (maybe call it <project>_loc.xdc) and make it the "Target constraint file" (right click on the file and select "Set as Target Constraint File"). All newly created constraints will get written to this file (leaving your "main" one alone, unless you change any of the constraints from your main file in the GUI).

 

Avrum

Tags (1)
Adventurer
Adventurer
4,939 Views
Registered: ‎11-22-2016

Re: Advice needed on timing closure problem

@avrumw,

 

Actually, the split DSP cascade (with which I stared this thread) is long behind me -- haven't seen it in a while -- so I doubt there's a web case for me to make.  (Not sure I do have access, think I have to go through Europractice support.)

 

The troublesome part of my design consists of alternating DSP -> RAMB -> DSP chains (16 DSP units long, 36 bits wide), currently two FF on the path between each stage (but 1 or 2 FF doesn't seem to make much difference).

 

Because I think it's quite illuminating, I've attached an image of the device layout near my latest timing failure:

 

layout.png

Key: Purple: DSP->RAMB->DSP->... chain

White: Timing failure

Green: Associated supporting logic (immediate parent entity)

 

The placer doesn't really seem to have been compelled to straddle the clock region boundary here.  Grr.  And there's a second copy of this unit immediately adjacent (in fact, the two columns of RAMB to the left, and about half the column of DSP) which has routed without problems ... and which straddles four clocking regions!  Oh, I'm losing the plot now.

 

So.  My routing always fails across the clock boundary ... but an identical configuration routes without problems across four clock boundaries.  I've no idea how to interpret this!

 

Thank you for your other comments, very helpful.

Adventurer
Adventurer
4,917 Views
Registered: ‎11-22-2016

Re: Advice needed on timing closure problem

Hi @suraj.kothari,

 

I've been doing some experiments along the lines you suggested, and some of the results look interesting.  Alas, my only measurement of "quality" is the worst slack, but it'll do as a proxy for brittleness.  The point in the design where timing always fails is a chain of the form (DSP -> FF -> FF -> RAMB -> FF -> DSP...).  At this point I can report (with my current design snapshot):

 

  1. With Flow_PerfOptimized_high strategy and phys_opt_design enabled I fail timing
  2. If I omit phys_opt_design, I meet timing with 1ps margin!
  3. If I shorten the DSP->RAMB chain by dropping a FF I fail timing ... despite all the FF I can see being adjacent, though there are rather a lot of them
  4. If I set -flatten_hierarchy=none (on top of 1,2 above) I meet timing with 8ps margin.  That's a surprise; guess it's a sign my design is sound ;')
  5. Setting the default strategy + keep_equivalent_register + hierarchy none, as you suggested, fails timing more spectacularly with -48ps margin.

This is all very confusing.  I guess my design is pretty congested near the block rams.

 

Still, I currently meet timing (for now) without having to resort to constraints, and I now need to write the control system, so this will have to do.

0 Kudos
Contributor
Contributor
4,910 Views
Registered: ‎11-21-2016

Re: Advice needed on timing closure problem

Hi @araneidae,

 

Is it possible for you to share the timing report, also the snapshot of synthesis/implementation settings window?

 

Thanks.

 

Regards,

Suraj Kothari

0 Kudos
Highlighted
Adventurer
Adventurer
4,906 Views
Registered: ‎11-22-2016

Re: Advice needed on timing closure problem

Sure, will do so tomorrow morning.

 

If you can give me the TCL commands to output the information you're interested in that will be helpful.

 

I've encountered one interesting snag which I'll dig into tomorrow: setting flatten_hierarachy=none has caused my IBUFs to forget their IO standard setting causing the bit generation stage to fail!  Rather unexpected, I'll investigate tomorrow.

0 Kudos
Adventurer
Adventurer
4,887 Views
Registered: ‎11-22-2016

Re: Advice needed on timing closure problem

@suraj.kothari

 

As requested.  First, here are my synthesis and implementation settings (screenshots are a desperately inefficient way of communicating this!).  I've collapsed the implementation settings for which is_enabled is not checked, otherwise the image is rather too large.  These settings are set as part of my project generation by these two TCL commands:

 

set_property strategy Flow_PerfOptimized_high [get_runs synth_1]
set_property STEPS.SYNTH_DESIGN.ARGS.ASSERT true [get_runs synth_1]

 

Note that I've set -flatten_hierarchy to rebuilt for now, see the posting I've just made on this subject...  Amusingly enough, setting this to full (which I did by accident) completely wrecks my timing closure, with failure deep in the MIG (good luck solving that one!)

 

synthesis.png

implementation.png

 

For the timing report I'm not sure what you're interested in, so I've just run Tools->Timing->Report Timing...

And then because the result goes into a window (not so helpful), I ran the command again but trimming the -name parameter:

 

report_timing -delay_type min_max -max_paths 10 -sort_by group -input_pins -file timing.txt

 

As the resulting file is rather large, I've attached a zipped version, still pretty enormous I'm afraid.  Also 99% of the timing report is utterly uninteresting: everything involving interconnect_inst is Xilinx components over which I have little control: MIG, PCIe bridge, AXI interconnect.

0 Kudos