UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Adventurer
Adventurer
1,555 Views
Registered: ‎11-22-2016

Timing Closure at 500 MHz

As I am all too aware, timing closure at 500 MHz is tricky at best.  Sometimes however the points of failure are a complete mystery to me, this posting describes one case.  If anyone can give any advice on how to avoid this kind of issue I'd be extremely grateful.

 

This is kind of a reprise of an earlier posting of mine, Advice needed on timing problem, but the problem here looks completely different.

 

I have a design where a large part of the logic runs at 500 MHz (yes, I have learnt, not a smart choice), and I'm finding meeting timing closure to be astonishingly fragile.  In this particular query I have exactly one failing path in my design, and I'd love to understand why this particular path has failed.

 

So here is the failing path (this is from a 250MHz clocked FF to a 500MHz FF):

 

failing.png

 

I think the important numbers are the following:

  • Clock edge to edge delay (after skew, pessimism, uncertainty): 4.190-2.715 = 1.475 ns
  • Source FF C->Q delay: 0.223
  • Net propagation: 1.250
  • Destination setup C->D: 0.022
  • Data edge to edge delay: 0.223+1.250+0.022 = 1.495 ns
  • Clock to data slack = 1.475 - 1.495 = -0.020 ns (failure)

Hmm.  Let's compare with an *identical* path elsewhere in the design.

success.png

The relevant numbers are:

  • Clock edge to edge: 1.478
  • Source delay: 0.259
  • Net delay: 1.157
  • Destination setup: 0.002
  • Data edge to edge: 1.418
  • Slack: +0.060 (success)

 

Let me ask one obvious question at this point: why on earth are the setup and propagation delays so different (0.022/0.223 vs 0.002/0.259) in these two cases?  As it happens I appear to gain in my failing case, so this is really a bit of a red herring.

 

Ok, so the problem is the difference between 1.157 and 1.250 routing delay.  So let's have a peek at the placement in the failing case.  The slice locations is a hint: SLICE_X69Y96 => SLICE_X69Y97.  Hmm.  Let's have a picture of the environment, shall we:

 

placement.png

 

This is where I started.  If this kind of routing fails ... then what on earth can I do?

 

I am out of ideas.

 

For reference, I am targeting an xc7vx690tffg1761-2 device using Vivado 2016.4.

 

EDIT: Had to replace first image.

Tags (1)
0 Kudos
16 Replies
Historian
Historian
1,510 Views
Registered: ‎01-23-2009

Re: Timing Closure at 500 MHz

Even without the device view, it was clear that these two cells were adjacent (just from their numbering). So that means that the 1.250ns of routing delay is not due to distance. You should turn on the detailed route view so that you can see the actual route - I assume you will see it taking a serpentine route from the source to the destination (since if it went direct the route would be WAY smaller...

 

So the question is why.

 

In some cases, we see things like this when the tool is trying to fix a hold violation on the same path. I would ask it directly with

 

report_timing -hold -from <source> -to <destination>

 

See how much slack there is on the hold path. If the slack is near 0, then the tool added this extra delay to fix the hold time (and in the process violated the setup).

 

That being said, there isn't really any reason there should be a hold violation here - two different outputs of the same PLL are supposed to be within 120ps of each other, plus a little bit of extra skew due to the different BUFGs. Even using the 3:1 rule of thumb (fast process corner to slow process) this should be no problem even at 2ns.

 

The only other reason I can think of a route like this is due to congestion. You can try looking at the congestion map of your design and seeing if it is particularly congested in this area. If so, you can try one of the congestion reduction strategies.

 

As an aside (and probably unrelated to your problem) why do you have an IDELAY before your PLL? If you need to delay your clock and you are already using a clock modifying block, you can use the phase shift capability of the PLL (which is pretty coarse) or if you need finer control you can use an MMCM. There is nothing "wrong" with what you are doing, but it is a little unusual.

 

Of course, it is worth noting that (by default) the delay caused by an MMCM is a phase delay (or a "WAVEFORM" delay) whereas the delay through the IDELAY is a propagation delay (or "LATENCY" delay). This changes the way the tool decides launch and capture edges, and hence your constraints on your input interface will fail. You can either fix your constraints or (new feature) change the mechanism that the tool uses to deal with delay on the MMCM. This is done in your XDC file with

 

set_property PHASESHIFT_MODE=LATENCY [get_cells <instance_name_of_MMCM>]

 

(The default is WAVEFORM in the 7 series and UltraScale, but LATENCY in UltraScale+)

 

Avrum

Adventurer
Adventurer
1,506 Views
Registered: ‎11-22-2016

Re: Timing Closure at 500 MHz


@avrumwwrote:
As an aside (and probably unrelated to your problem) why do you have an IDELAY before your PLL? If you need to delay your clock and you are already using a clock modifying block, you can use the phase shift capability of the PLL (which is pretty coarse) or if you need finer control you can use an MMCM. There is nothing "wrong" with what you are doing, but it is a little unusual.

In all honesty, the reason is pure ignorance and inexperience.  I am a software engineer in FPGA land (and finding the experience maddening!)

 

I probably don't need fine control over my clock phase at all if I knew how to constrain my inputs properly: my data and clock source is an ADC (AD9684) with DDR data synchronous with a data clock, and I'm just empirically aligning my clock and data with the IDELAY.  I imagine that with the correct constraints I could throw this entire engine away ... but I have bigger problems biting me, and to date I have got away with minimal clocking constraints.  Basically all I have are clock definitions with defined period and a few strategically chosen asynchronous clock groups and false paths.  I have literally no other project level clock constraints at present, and so I remain ignorant of what is possible.

 

 

How do I get to the "congestion map"?  I've often wished for such a view, and never managed to find it.  If this is a feature introduced since 2016.4 it's a good enough reason to upgrade!

0 Kudos
Historian
Historian
1,500 Views
Registered: ‎01-23-2009

Re: Timing Closure at 500 MHz

If I understand you correctly, you are using a 500MHz DDR ADC interface with no timing constraints? If so, I potentially have some very bad news for you...

 

A 500MHz DDR interface is VERY tricky. In many (actually, I think pretty much all) cases that is too fast to be captured statically with any clocking structure - and the clocking structure you are using is hardly the fastest one (assuming you are using the PLL clock to capture the ADC data).

 

At these speeds you need to do dynamic calibration. If you don't do so, then your design will not work across process, voltage and temperature - it may work in the lab on one or a few boards, but there is no way you can go to production with a system like this...

 

Avrum

Scholar jmcclusk
Scholar
1,491 Views
Registered: ‎02-24-2014

Re: Timing Closure at 500 MHz

I strongly suggest you examine the Analog Devices reference design for the AD9434, which is a 12 bit ADC running at 500 MHz.   It has very similar performance requirements, and they do it very simply,  by using the ISERDES elements to perform a 1:4 sampling ratio, producing parallel data at 125 MHz instead of 500 MHz.   Timing closure becomes trivial in this case.

 

https://wiki.analog.com/resources/fpga/xilinx/fmc/ad9434

 

 

Don't forget to close a thread when possible by accepting a post as a solution.
0 Kudos
Adventurer
Adventurer
1,461 Views
Registered: ‎11-22-2016

Re: Timing Closure at 500 MHz

@avrumw, forgive me, but I think this conversation has rapidly gone off the rails.  I would love to discuss the issues of DDR data capture with you (I feel you're hugely overstating the problem: the only relevant variable is clock to data skew, which can only be statically managed during operation, because there is no dynamic feedback mechanism available without disrupting data flow), but this really doesn't feel like the right place for that.

 

I will try to open some kind of support channel to Xilinx through STFC Europractice, our tool provider.

 

@prathikm, this was fundamentally a question about synthesis, which has become derailed.

0 Kudos
Adventurer
Adventurer
1,459 Views
Registered: ‎11-22-2016

Re: Timing Closure at 500 MHz

@avrumw, you wrote:

and the clocking structure you are using is hardly the fastest one (assuming you are using the PLL clock to capture the ADC data).

 

Can you expand on this at all?  Maybe my underlying problem is that I'm not doing anything special about "clocking structure", I'm just using a BUFG to drive my entire design.  Are there other options?  What's the best way to learn about options for "clocking structure"?

 

P.S. I have made a small logic change in unrelated code (adding untimed control of two I/O pins the other side of the device) ... and my worst slack has jumped to +43ps (in the good).  This is what bugs me, the utterly unpredictable routing behaviour of Vivado.  Right now I have a working system.  Next time I breath on it I'll get more mystery problems.  Something is fundamentally wrong, and I can't see what it is.

0 Kudos
Scholar markcurry
Scholar
1,423 Views
Registered: ‎09-16-2009

Re: Timing Closure at 500 MHz

 

I'd just like to reiterate Avrum's post regarding hold times and congestion.  Focus on those points he made - I think there's something to learn about your design here and why the tools are doing what they are doing.  (Why is that route longer than expected?)

 

Vivado is actually much, much better than it's predecessor in giving predictable results.  i.e. unrelated code changes causing pass/fail results such as what you're seeing here.

 

But then 500 MHz is really tough.  You're in a rare place here.

 

Regards,

 

Mark

Historian
Historian
1,406 Views
Registered: ‎01-23-2009

Re: Timing Closure at 500 MHz

This is what bugs me, the utterly unpredictable routing behaviour of Vivado.  Right now I have a working system.  Next time I breath on it I'll get more mystery problems.  Something is fundamentally wrong, and I can't see what it is.

 

Take a look at this post on the chaotic nature of synthesis, place and route.  It was written in the days of ISE, but it is equally true for Vivado (as I mentioned in the post, it is fundamental to the problem, not a side effect of the way the tool solves the problem).

 

As @markcurry said, Vivado is less chaotic than ISE, or maybe Vivado is just more consistently able to find "better" solutions in spite of the chaos, but Vivado is still chaotic. As I said in the other thread, this is true now, has always been true, and probably always will be true...

 

Avrum

Historian
Historian
1,388 Views
Registered: ‎01-23-2009

Re: Timing Closure at 500 MHz

I would love to discuss the issues of DDR data capture with you (I feel you're hugely overstating the problem: the only relevant variable is clock to data skew, which can only be statically managed during operation,

 

I'm sorry - but you are wrong...

 

The performance of every Silicon based transistor varies over three variables:

  - Process (P) - the variation in manufacturing from die to die

  - Voltage (V) - the variation within the legal range of voltages allowed for the device

      - for example for the VCCint supply in 7 series devices, the nominal voltage is 1.0V, but it can vary between 0.97V and 1.03V

      - virtually all power supply designs and on board power distribution networks need this variability - it is impossible to design a production system that can deliver exactly 1.00V +/- 0.00V

   - Temperature (T) - the variation in die temperature allowed for the device

      - for example, commercial devices can operate with die temperatures of 0C to 85C

 

These three parameters (PVT) have a significant impact on the performance of a transistor (and hence gate, LUT, switch matrix, etc...) - the rule of thumb is the ratio of slowest to fastest for any transistor based delay is 3:1.

 

When you perform static timing analysis, the tools are verifying that the design will operate at all legal corners of the PVT cube. If Vivado says your slack is 0 or positive, then the design will work at any legal combination of PVT. When there is negative slack that means that there is at least some part of this PVT cube where your device will fail.

 

So, when you are looking at your 0.02ns negative slack on your 2ns path, this is a 1% violation on the requirement. Almost certainly at most PVT corners this will work.

 

Now let's look at your input interface. This is a 500MHz DDR interface, therefore the requirement on each capture is 1ns. The source device itself (the ADC) and the board (board skew, signal integrity) probably use a small portion of that - lets say the best case is that there is 900ps left for the FPGA. This is the data valid window at the FPGA.

 

Now lets look at the requirements of the capture of the FPGA. If you had timing constraints, the tools would report the magnitude of the violations on these paths, but you don't. So lets look at the datasheet.

 

Related to this topic, there are several different clock architectures for capturing input data. You can see a description of them in this post on capture architectures. Each of them has a different "Pin to Pin" requirement specified in the datasheet - these are shown in the post.

 

Right now, you are using a global buffer with PLL - at its base, this is Tpspllcc/Tphpllcc - yours is probably a bit worse since you are also using the IDELAY... But if we look at the datasheet for these numbers in your device (Virtex-7 690T-2), the timing is 3.40/-0.1. Thus, to statically capture an input interface with this clocking structure, you need a 3.3ns data eye. At best, your system is providing a 0.9ns data valid window. This isn't even close. Other clocking structures are better, but none are as good as 0.9ns.

 

If you look more closely at the datasheet, you will also see Tsamp - this is the minimum data window you can capture with "the perfect" dynamic calibration (Tsamp_bufio if you are using a BUFIO rather than an MMCM/BUFG). For the MMCM (they don't even specify it with a PLL) this is 0.560ns, and even this includes some sources of uncertainty. What we can draw from this is that the actual setup/hold width required for the IDDR flip-flop is small - well under 0.560ns (let's assume it is 250ps). The problem is that with PVT variation this critical 250ps window can occur anywhere within the 3.3ns window of Tpspllcc/Tphpllcc.

 

In your system, with manual adjustment of the IDELAY taps, you are adjusting timing so that one of these 250ps critical windows is landing within your 900ps data valid window. But since this 250ps window can move with PVT by more than 3ns, there is no way that the value you are using will work at different PVTs.

 

So, while you may succeed at finding a value that seems to work on a given board under relatively narrow conditions (i.e. in the lab), this capture mechanism (in fact any static capture mechanism) is simply not viable for at this speed (at least for anything other than one or a handful of  boards in the lab).

 

Avrum

Adventurer
Adventurer
1,182 Views
Registered: ‎11-22-2016

Re: Timing Closure at 500 MHz

@avrumw, thank you very much for your thoughtful response.

 

Inevitably I have questions.

 

You end by saying "any static capture mechanism ... is not viable ... at this speed ... for [more than] a handful of boards".  Let me focus on this point, and see if we're talking about the same thing.  I think my biggest concern is that I don't think the data stream provides any other option.

 

First of all, let's talk about PVT and the manual adjustment of IDELAY.  Perhaps the one thing that might be saving me is the fact that my application is laboratory scale: I will have two cards in production, a third in the lab, and probably a fourth as a standby spare.  I am collaborating with two, three, maybe four other similar organisations (I don't need to be coy, but names will only distract), so we're looking at a total population of no more than a dozen boards.

 

My manual measurement and adjustment of IDELAY effectively compensates for process variation, so we're down to VT;  I imagine this is dominated by T.  However, I expect the system to run for approximately six months without interruption, and although they are installed in air-cooled racks, I'm sure there will be some significant temperature variation.

 

So now, as I understand you, the problem is dynamic compensation during operation, and here I think there is a fundamental problem: looking at the datasheet for the AD9684 ADC I'm not sure that there is a signal I can use to measure variation.  Certainly I can't touch the data stream, this needs to flow without interruption.  There is a STATUS signal which is shockingly poorly documented (I'm getting so tired from gappy datasheets) ... I'll have to capture it and see if I can use it as a calibration signal.  I'll update this post once I have some data on this.

 

So let's assume I do have a calibration signal available to me.  Is it correct for me to infer from what you're saying that I should track this calibration signal (presumably by passing it through an IDELAY and shifting it to metastability against my clock) and use this to adjust my data clock?

 

Otherwise, you appear to be saying that reliable DDR ADC capture is literally impossible.  I presume that with DDR DRAM the data stream can be regularly interrupted for rounds of timing calibration, obviously for a continuously operational ADC this is not an option!

 

You do hint that more complex clocking architectures would reduce the timing uncertainty, but not enough to solve the problem; is there really any point in me looking at doing the data capture with BUFIO clocking?

 

P.S.  I don't see constraint specifications playing any useful role in this problem!

0 Kudos
Adventurer
Adventurer
1,175 Views
Registered: ‎11-22-2016

Re: Timing Closure at 500 MHz

No, there is no separate signal available to calibrate against, so static alignment is my only option. The only extra available output (STATUS) has no edges to work with!
0 Kudos
Adventurer
Adventurer
1,169 Views
Registered: ‎11-22-2016

Re: Timing Closure at 500 MHz

Returning to my original theme.  For a quick debug (the STATUS signal above) I rewired the bottom bit of my ADC input, and got a 20ps timing failure elsewhere:

device.png

I think this is worth posting because there's a reasonable amount to chew on here.

 

First of all, the failing path is highlighted in white at the top of this view, and the IO paths where I changed the wiring of one IDDR are highlighted at the bottom.  Again, we have a tiny change in one part of the design breaking timing in an apparently completely unrelated part of the design.

 

Secondly, the failing path itself is of some interest.  The blocks highlighted in green are all Xilinx IP, with AXI interconnect to the left, and an AXI-PCIe bridge to the right; the failing path is deep inside an AXI buffer clocked at 250 MHz; there's no clock crossing here.  This is the second class of timing failure I get where I have no control over the failing path (we've already seen the first, I can't do better than FF->FF).  My best guess is that the register linkage (from the top to the middle of this figure) is "stretching" things, but I don't seem to have any tools to help me evaluate this.

 

Below is the routing view of the failing path.  Not sure whether you'd call this circuitous or not; kind of academic, as there's nothing I can do about it.  The logic is quite deep, but again, this is deep inside Xilinx IP.

routing.png

@avrumw, you did suggest I look at the "congestion map" ... but I've not been able to find it.

 

EDIT: Right click in the display view brings up Metric-> Horizontal/Vertical congestion (found AR# 66698), but there's little of interest: all the congestion is around the PCIe hard-core and the fast DRAM MIG, neither of which are near to my 500 MHz design, so deep inside Xilinx IP.  My design is not congested, as far as I can tell.

 

I'd love to progress from inspired guesswork to something more reliable.

0 Kudos
Scholar jmcclusk
Scholar
1,159 Views
Registered: ‎02-24-2014

Re: Timing Closure at 500 MHz

Welcome to the sausage factory!   As I am sure you are discovering,  pushing the tools and design to their limits results in random failures for no apparent reason.   This is normal and standard.   The usual approach is to try different things to produce a design that will meet timing for 99% (or at least 90%) of the place and route attempts.   You need to lower your peak clock speeds by running more channels of data in parallel at a lower clock speed.    FPGA's have vast numbers of parallel resources, and this is the resource you should be exploiting.   But if you can't do that, there is another last thing you can do..   Throw massive amounts of computation at the problem by running a LOT of place and route jobs with slightly different parameters.    By the laws of chance,  you might get some slim percentage of place & route attempts that meet timing.    I've done this before on projects that used ASIC code (hence frozen) for ASIC emulation.  

 

If you have some critical section of the design that MUST operate at 500 MHz, and it has frequent timing failures in the same place,  you can try directed routing constraints to lock down the placement and routing to guarantee that this critical logic is not affected by random variables.    This is very labor intensive, and usually only worth it for truly critical sections of logic.

Don't forget to close a thread when possible by accepting a post as a solution.
Tags (1)
0 Kudos
Adventurer
Adventurer
1,158 Views
Registered: ‎11-22-2016

Re: Timing Closure at 500 MHz

@jmcclusk, thank you for the welcome!

 

I think you are right, that I'm going to have to rewrite to double up my processing channels.  Unfortunately this will do horrible things to the code, and I will grieve hugely at having to abandon so much code which (when the tool is in a good mood) just works.

 

I haven't given up quite yet (one reason being that the rewrite will be painful).  Alas, I can see that going for a mix-in strategy, where all the core functional blocks continue to work full speed and the communication is half speed, will cause more trouble than relief, as it seems that clock crossing is harder than fast logic!

 

I have to confess that I am growing an abiding and deepening hatred of the tools.  I'm trying not to, but there is so much in this field that could be better.  Languages with core concepts not stuck in the 80s would be nice...

0 Kudos
Historian
Historian
1,139 Views
Registered: ‎01-23-2009

Re: Timing Closure at 500 MHz

looking at the datasheet for the AD9684 ADC I'm not sure that there is a signal I can use to measure variation...

 

I rarely give advice about dynamic calibration techniques - I am far more comfortable with static capture, and I avoid dynamic calibration mostly because they can never be "proven" to work... However, I will give you some ideas.

 

First, you can look at XAPP524, it gives some ideas about dynamic calibration.

 

The majority of the uncertainty in this system comes from the distribution of the clock inside the FPGA - even with the BUFIO (which has the shortest distribution path), this is what accounts for the difference of Tsamp_bufio=350ps  and Tpscs/Tphcs which gives a window of 1190ps. So we need to cancel this out.

 

The main idea is to sample the clock with itself; the clock from the pin can go directly to the data input of an ISERDES (or IDDR), and can also go through the IDELAY and the BUFIO back to the CLK of the same ISERDES (or IDDR). Using the dynamic taps of the ISERDES, you can then add delay to the BUFIO path. If you sample the clock with itself at a "stable" point (where your Tsamp is in the stable region of the clock), you will always get the same value - depending on some delays, you will always get X on the rising edge and !X on the falling edge (lets say you get 1 on the rising edge sample and 0 on the falling edge). As you move away from this (lets say you move forward), you will eventually get to a point where the rising edge either starts getting a 0, or spontaneously switches back and forth between 0 and 1. When you find this point, you have found the metastability point - the point where the clock is changing. You can dynamically adjust your tap so that if you are consistently getting a 0 you know you are too far, and can back off by a tap, and if you are consistently getting a 1 you are too early and can go forward a tap. If you have a state machine that does this, you can track the metastability point even as it changes over VT.

 

Now, if your data and clock relationship is "center aligned" - if the tap you are using to sample the clock is at the metastability point then the same tap value used on a data bit will be in the middle of the data eye. If you do this correctly (track the metastability point on a center aligned interface and use the same tap value on the same clock from the BUFIO to capture the data), then assuming the clock data is perfectly centered (and the device you are using is close - it specifies -150ps/+100ps), then you can capture the data as long as it is larger than Tsamp. Assuming -150ps/+150ps (then we don't have to worry about the extra 50ps), your data eye is a minimum of 1000-300=700ps wide - larger than Tsamp.

 

If the clock is edge aligned, this is more difficult - you either need an MMCM to generate a 90degree clock, or you can try and algorithmically add 1/2 clock delay to the IDELAY setting of the data vs. the IDELAY setting of the clock. Neither is ideal; the MMCM adds extra phase error, and the tap delays of the IDELAY aren't precise (and you lose some window by using large data tap delays).

 

But as I understand it, you can program your ADC to be center aligned...

 

Avrum

Historian
Historian
1,137 Views
Registered: ‎01-23-2009

Re: Timing Closure at 500 MHz

I presume that with DDR DRAM the data stream can be regularly interrupted for rounds of timing calibration, obviously for a continuously operational ADC this is not an option!

 

No - DDR SDRAMs use a similar concept to what I described above. The 7 series devices have special resources in the IOB/Clock column (the PHASER) for dynamically tracking the phase of the DQS signal, which is the strobe for the DQ (the data). It uses this phase information (with a PLL) to ensure that the capture stays in the middle of the eye.

 

Avrum