UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

Reply

Timing Closure at 500 MHz

Highlighted
Adventurer
Posts: 92
Registered: ‎11-22-2016

Timing Closure at 500 MHz

[ Edited ]

As I am all too aware, timing closure at 500 MHz is tricky at best.  Sometimes however the points of failure are a complete mystery to me, this posting describes one case.  If anyone can give any advice on how to avoid this kind of issue I'd be extremely grateful.

 

This is kind of a reprise of an earlier posting of mine, Advice needed on timing problem, but the problem here looks completely different.

 

I have a design where a large part of the logic runs at 500 MHz (yes, I have learnt, not a smart choice), and I'm finding meeting timing closure to be astonishingly fragile.  In this particular query I have exactly one failing path in my design, and I'd love to understand why this particular path has failed.

 

So here is the failing path (this is from a 250MHz clocked FF to a 500MHz FF):

 

failing.png

 

I think the important numbers are the following:

  • Clock edge to edge delay (after skew, pessimism, uncertainty): 4.190-2.715 = 1.475 ns
  • Source FF C->Q delay: 0.223
  • Net propagation: 1.250
  • Destination setup C->D: 0.022
  • Data edge to edge delay: 0.223+1.250+0.022 = 1.495 ns
  • Clock to data slack = 1.475 - 1.495 = -0.020 ns (failure)

Hmm.  Let's compare with an *identical* path elsewhere in the design.

success.png

The relevant numbers are:

  • Clock edge to edge: 1.478
  • Source delay: 0.259
  • Net delay: 1.157
  • Destination setup: 0.002
  • Data edge to edge: 1.418
  • Slack: +0.060 (success)

 

Let me ask one obvious question at this point: why on earth are the setup and propagation delays so different (0.022/0.223 vs 0.002/0.259) in these two cases?  As it happens I appear to gain in my failing case, so this is really a bit of a red herring.

 

Ok, so the problem is the difference between 1.157 and 1.250 routing delay.  So let's have a peek at the placement in the failing case.  The slice locations is a hint: SLICE_X69Y96 => SLICE_X69Y97.  Hmm.  Let's have a picture of the environment, shall we:

 

placement.png

 

This is where I started.  If this kind of routing fails ... then what on earth can I do?

 

I am out of ideas.

 

For reference, I am targeting an xc7vx690tffg1761-2 device using Vivado 2016.4.

 

EDIT: Had to replace first image.

Historian
Posts: 4,495
Registered: ‎01-23-2009

Re: Timing Closure at 500 MHz

Even without the device view, it was clear that these two cells were adjacent (just from their numbering). So that means that the 1.250ns of routing delay is not due to distance. You should turn on the detailed route view so that you can see the actual route - I assume you will see it taking a serpentine route from the source to the destination (since if it went direct the route would be WAY smaller...

 

So the question is why.

 

In some cases, we see things like this when the tool is trying to fix a hold violation on the same path. I would ask it directly with

 

report_timing -hold -from <source> -to <destination>

 

See how much slack there is on the hold path. If the slack is near 0, then the tool added this extra delay to fix the hold time (and in the process violated the setup).

 

That being said, there isn't really any reason there should be a hold violation here - two different outputs of the same PLL are supposed to be within 120ps of each other, plus a little bit of extra skew due to the different BUFGs. Even using the 3:1 rule of thumb (fast process corner to slow process) this should be no problem even at 2ns.

 

The only other reason I can think of a route like this is due to congestion. You can try looking at the congestion map of your design and seeing if it is particularly congested in this area. If so, you can try one of the congestion reduction strategies.

 

As an aside (and probably unrelated to your problem) why do you have an IDELAY before your PLL? If you need to delay your clock and you are already using a clock modifying block, you can use the phase shift capability of the PLL (which is pretty coarse) or if you need finer control you can use an MMCM. There is nothing "wrong" with what you are doing, but it is a little unusual.

 

Of course, it is worth noting that (by default) the delay caused by an MMCM is a phase delay (or a "WAVEFORM" delay) whereas the delay through the IDELAY is a propagation delay (or "LATENCY" delay). This changes the way the tool decides launch and capture edges, and hence your constraints on your input interface will fail. You can either fix your constraints or (new feature) change the mechanism that the tool uses to deal with delay on the MMCM. This is done in your XDC file with

 

set_property PHASESHIFT_MODE=LATENCY [get_cells <instance_name_of_MMCM>]

 

(The default is WAVEFORM in the 7 series and UltraScale, but LATENCY in UltraScale+)

 

Avrum

Adventurer
Posts: 92
Registered: ‎11-22-2016

Re: Timing Closure at 500 MHz


@avrumwwrote:
As an aside (and probably unrelated to your problem) why do you have an IDELAY before your PLL? If you need to delay your clock and you are already using a clock modifying block, you can use the phase shift capability of the PLL (which is pretty coarse) or if you need finer control you can use an MMCM. There is nothing "wrong" with what you are doing, but it is a little unusual.

In all honesty, the reason is pure ignorance and inexperience.  I am a software engineer in FPGA land (and finding the experience maddening!)

 

I probably don't need fine control over my clock phase at all if I knew how to constrain my inputs properly: my data and clock source is an ADC (AD9684) with DDR data synchronous with a data clock, and I'm just empirically aligning my clock and data with the IDELAY.  I imagine that with the correct constraints I could throw this entire engine away ... but I have bigger problems biting me, and to date I have got away with minimal clocking constraints.  Basically all I have are clock definitions with defined period and a few strategically chosen asynchronous clock groups and false paths.  I have literally no other project level clock constraints at present, and so I remain ignorant of what is possible.

 

 

How do I get to the "congestion map"?  I've often wished for such a view, and never managed to find it.  If this is a feature introduced since 2016.4 it's a good enough reason to upgrade!

Historian
Posts: 4,495
Registered: ‎01-23-2009

Re: Timing Closure at 500 MHz

If I understand you correctly, you are using a 500MHz DDR ADC interface with no timing constraints? If so, I potentially have some very bad news for you...

 

A 500MHz DDR interface is VERY tricky. In many (actually, I think pretty much all) cases that is too fast to be captured statically with any clocking structure - and the clocking structure you are using is hardly the fastest one (assuming you are using the PLL clock to capture the ADC data).

 

At these speeds you need to do dynamic calibration. If you don't do so, then your design will not work across process, voltage and temperature - it may work in the lab on one or a few boards, but there is no way you can go to production with a system like this...

 

Avrum

Scholar
Posts: 1,186
Registered: ‎02-24-2014

Re: Timing Closure at 500 MHz

I strongly suggest you examine the Analog Devices reference design for the AD9434, which is a 12 bit ADC running at 500 MHz.   It has very similar performance requirements, and they do it very simply,  by using the ISERDES elements to perform a 1:4 sampling ratio, producing parallel data at 125 MHz instead of 500 MHz.   Timing closure becomes trivial in this case.

 

https://wiki.analog.com/resources/fpga/xilinx/fmc/ad9434

 

 

Don't forget to close a thread when possible by accepting a post as a solution.
Adventurer
Posts: 92
Registered: ‎11-22-2016

Re: Timing Closure at 500 MHz

@avrumw, forgive me, but I think this conversation has rapidly gone off the rails.  I would love to discuss the issues of DDR data capture with you (I feel you're hugely overstating the problem: the only relevant variable is clock to data skew, which can only be statically managed during operation, because there is no dynamic feedback mechanism available without disrupting data flow), but this really doesn't feel like the right place for that.

 

I will try to open some kind of support channel to Xilinx through STFC Europractice, our tool provider.

 

@prathikm, this was fundamentally a question about synthesis, which has become derailed.

Adventurer
Posts: 92
Registered: ‎11-22-2016

Re: Timing Closure at 500 MHz

[ Edited ]

@avrumw, you wrote:

and the clocking structure you are using is hardly the fastest one (assuming you are using the PLL clock to capture the ADC data).

 

Can you expand on this at all?  Maybe my underlying problem is that I'm not doing anything special about "clocking structure", I'm just using a BUFG to drive my entire design.  Are there other options?  What's the best way to learn about options for "clocking structure"?

 

P.S. I have made a small logic change in unrelated code (adding untimed control of two I/O pins the other side of the device) ... and my worst slack has jumped to +43ps (in the good).  This is what bugs me, the utterly unpredictable routing behaviour of Vivado.  Right now I have a working system.  Next time I breath on it I'll get more mystery problems.  Something is fundamentally wrong, and I can't see what it is.

Scholar
Posts: 1,155
Registered: ‎09-16-2009

Re: Timing Closure at 500 MHz

 

I'd just like to reiterate Avrum's post regarding hold times and congestion.  Focus on those points he made - I think there's something to learn about your design here and why the tools are doing what they are doing.  (Why is that route longer than expected?)

 

Vivado is actually much, much better than it's predecessor in giving predictable results.  i.e. unrelated code changes causing pass/fail results such as what you're seeing here.

 

But then 500 MHz is really tough.  You're in a rare place here.

 

Regards,

 

Mark

Historian
Posts: 4,495
Registered: ‎01-23-2009

Re: Timing Closure at 500 MHz

This is what bugs me, the utterly unpredictable routing behaviour of Vivado.  Right now I have a working system.  Next time I breath on it I'll get more mystery problems.  Something is fundamentally wrong, and I can't see what it is.

 

Take a look at this post on the chaotic nature of synthesis, place and route.  It was written in the days of ISE, but it is equally true for Vivado (as I mentioned in the post, it is fundamental to the problem, not a side effect of the way the tool solves the problem).

 

As @markcurry said, Vivado is less chaotic than ISE, or maybe Vivado is just more consistently able to find "better" solutions in spite of the chaos, but Vivado is still chaotic. As I said in the other thread, this is true now, has always been true, and probably always will be true...

 

Avrum

Historian
Posts: 4,495
Registered: ‎01-23-2009

Re: Timing Closure at 500 MHz

[ Edited ]

I would love to discuss the issues of DDR data capture with you (I feel you're hugely overstating the problem: the only relevant variable is clock to data skew, which can only be statically managed during operation,

 

I'm sorry - but you are wrong...

 

The performance of every Silicon based transistor varies over three variables:

  - Process (P) - the variation in manufacturing from die to die

  - Voltage (V) - the variation within the legal range of voltages allowed for the device

      - for example for the VCCint supply in 7 series devices, the nominal voltage is 1.0V, but it can vary between 0.97V and 1.03V

      - virtually all power supply designs and on board power distribution networks need this variability - it is impossible to design a production system that can deliver exactly 1.00V +/- 0.00V

   - Temperature (T) - the variation in die temperature allowed for the device

      - for example, commercial devices can operate with die temperatures of 0C to 85C

 

These three parameters (PVT) have a significant impact on the performance of a transistor (and hence gate, LUT, switch matrix, etc...) - the rule of thumb is the ratio of slowest to fastest for any transistor based delay is 3:1.

 

When you perform static timing analysis, the tools are verifying that the design will operate at all legal corners of the PVT cube. If Vivado says your slack is 0 or positive, then the design will work at any legal combination of PVT. When there is negative slack that means that there is at least some part of this PVT cube where your device will fail.

 

So, when you are looking at your 0.02ns negative slack on your 2ns path, this is a 1% violation on the requirement. Almost certainly at most PVT corners this will work.

 

Now let's look at your input interface. This is a 500MHz DDR interface, therefore the requirement on each capture is 1ns. The source device itself (the ADC) and the board (board skew, signal integrity) probably use a small portion of that - lets say the best case is that there is 900ps left for the FPGA. This is the data valid window at the FPGA.

 

Now lets look at the requirements of the capture of the FPGA. If you had timing constraints, the tools would report the magnitude of the violations on these paths, but you don't. So lets look at the datasheet.

 

Related to this topic, there are several different clock architectures for capturing input data. You can see a description of them in this post on capture architectures. Each of them has a different "Pin to Pin" requirement specified in the datasheet - these are shown in the post.

 

Right now, you are using a global buffer with PLL - at its base, this is Tpspllcc/Tphpllcc - yours is probably a bit worse since you are also using the IDELAY... But if we look at the datasheet for these numbers in your device (Virtex-7 690T-2), the timing is 3.40/-0.1. Thus, to statically capture an input interface with this clocking structure, you need a 3.3ns data eye. At best, your system is providing a 0.9ns data valid window. This isn't even close. Other clocking structures are better, but none are as good as 0.9ns.

 

If you look more closely at the datasheet, you will also see Tsamp - this is the minimum data window you can capture with "the perfect" dynamic calibration (Tsamp_bufio if you are using a BUFIO rather than an MMCM/BUFG). For the MMCM (they don't even specify it with a PLL) this is 0.560ns, and even this includes some sources of uncertainty. What we can draw from this is that the actual setup/hold width required for the IDDR flip-flop is small - well under 0.560ns (let's assume it is 250ps). The problem is that with PVT variation this critical 250ps window can occur anywhere within the 3.3ns window of Tpspllcc/Tphpllcc.

 

In your system, with manual adjustment of the IDELAY taps, you are adjusting timing so that one of these 250ps critical windows is landing within your 900ps data valid window. But since this 250ps window can move with PVT by more than 3ns, there is no way that the value you are using will work at different PVTs.

 

So, while you may succeed at finding a value that seems to work on a given board under relatively narrow conditions (i.e. in the lab), this capture mechanism (in fact any static capture mechanism) is simply not viable for at this speed (at least for anything other than one or a handful of  boards in the lab).

 

Avrum