09-23-2014 09:07 AM
09-23-2014 09:31 AM
A pll removes skew (deskews ) introduced by the entire clock tree delay with respect to the input clock (or better said a specific point of the clock tree). It can not deskew individual clocking end points of the clock tree as they "arrive" at sequential elemeents. In other words, the PLL can not remove skew that is introduced by the clock tree by travelling to different points accross the chip, e.g. near vs far. So perhaps floorplan your design such that critical components are placed closer together.
09-23-2014 09:43 AM
It might be more useful to describe the functions of the PLL as being able to remove clock insertion delay. While the process is often referred to as clock de-skew, its only fixing the phase of the clock so that the phase of the clock at the endpoint is similar to the phase of the clock at the clock input pin.
Clock skew is different from clock insertion delay - and there is nothing the PLL can do to affect clock skew.
Think of it this way...
The clock arrives at a pin. From there it needs to be fanned out to hundreds of thousands of flip-flops. To do this a huge network comprising buffers, active interconnects and lots of routing is required. Xilinx has pre-designed this clock network (the global clock network) so that the clock reaches these hundreds of thousands of endpoints at "close to the same time". However, some endpoints will be reached sooner, and others later. In general, the time to all endpoints can be defined as (insertion_delay +/- clock_skew/2). The PLL can compensate for the insertion_delay (by adding an effective delay of (CLK_PERIOD - insertion_delay), but it can't do anything about the skew - the range is still going to have a spanof (clock_skew).
09-23-2014 09:44 AM
09-23-2014 02:02 PM
Thanks for promt and comprehensive answers!
I understand all this, and that is why i asked the question. Many documents could avoid confusoin by reffering to insertion delay rather tha skew as they do.
So, I'm back on square one, and now exploring somewhat unorthodox ideas. I will probably get shot down in flames here, but try this on for size;
1) use a PLL and shift phase slightly negative to balance between setup violations (which is all i get now) and hold time violations.
2) Get the feedback to the PLL from a point down the clock tree do do the same.
I just tried 2), and it appears to work. synth/place/route comes out clean, all violatoins are gone in that area. I did _not_ use the dedicated .CLKFBOUT (from what i gathered, the .CLKFBOUT is just another port out from the PLL), but rather the actual clock signal, after a BUFG placed near the the gates in that region.
The question is, is who's fooling who. Am i fooling Xilinx timing analysis, or am i tricking the logic to do the seeminlgy impossible?
09-23-2014 02:26 PM
What problem are you trying to solve?
If you are having skew issues from internal flip-flop to internal flip-flop there is nothing (at least reasonable) that a PLL can do to help you. The PLL can move its output clock forward or backward, but that will move the arrival time of all flip-flops by exactly the same amount, so will have not impact on any setup or hold analysis on a FF to FF path.
If you are having trouble on an input or output interface (i.e. capturing data at the input pins of an FPGA), then yes - it is normal to adjust the phase of the PLL or MMCM to place the clock in the middle of your data eye on input interfaces.
As for the CLKFBIN port, you have to connect it to the CLKFBOUT port. There are two legal configurations
- directly (with no buffer between them)
- do this if you don't care about the insertion delay of your clock - i.e. only when there are no input interfaces or system synchronous output interfaces associated with this clock
- with a clock buffer between them. When you use a clock buffer between them you have two choices for your input/output interfaces
- clock the interface with the output of the clock buffer on the CLKFBOUT port (the wizards don't do this, but it is legal)
- use an identical buffer on the CLKOUT0 (or other CLKOUT port of the MMCM/PLL) and use this to clock the interface
- if you do not use identical buffers, your clock insertion will be improperly cancelled out, and you will have timing issues with your interface
09-23-2014 03:28 PM
Actually, Mike has a point. The CLKFBOUT port/counter is fundamentially like any other clock output. So you can use another CLKOUT output for feedback. There are a few "gotchas" with that: The feedback frequency must match the CLKIN frequency at the PFD. Any positive phase shift of the feedback clock will negatively phase shift all other output clocks. Having said that, it's probably easier to simply use the BUFG in the feedback path to also drive logic. This is assuming you have a ZHOLD configuration (not external or internal or bufin) and therefore a BUFG (or BUFH or BUFR in some cases) is in the feedback path.
I don't get what you are trying to say by "get the feedback path froma point down the clock tree". Unless you managed to have a local route via interconnect as feedback, the feedback path is hard coded from the far end of the clock region (simplified explanation).
As Avrum said, phase shifting the clock won't do you any good if you just deal with skew once the data is on the chip.
09-25-2014 12:46 PM
Thanks for all feedback, much appreciatied!
@avrumw, the problem we're trying to solve is clock skew due to both wide and deep data paths in combination with a rather congested FPGA. We are using the SerDes and the clock paths gets pretty long, so hece my idea to adjust the clock phase slightly negative to increase the slack. We don't see any hold time violations so i assume there is room to do this. Adding to the problem is wide paths used with BRAM which spread out clock paths over large areas of the chip.
I intend to adjust the clock phase _relatve_ to the data from SerDes to allow more slack (which is for timing closure). The idea to route the feedback from after the BUFG that actually drives the all FF's is based on the assumption that a high fan-out BUFG will delay the signal more than a light load (as is suggested in Xilinx documentation), i.e. rise time due to higher capacative load will affect the timing. Please correct me if I'm wrong in this assumption.
@ralfk, thanks, i realized that the in/out frequency must be the same. What i did (and think is actually working) was to place the BUFG near the logic that uses the clock, and then drive the PLL feedback from after the BUFG near said logic.
It looks something like this (I hope my poor explanation makes sense):
SerDes/CLKRX -> PLL/CLKOUT0 -> [wire to anothter clock region] -> BUFG -> (Many FF's on this path) -> [wire back to SerDes/PLL clock region] -> PLL/CLKFBIN.
I did not quite understand what you meant with ZHOLD configuration, could you enlighten me?
09-25-2014 05:21 PM
Mike, I think you will be best served by re-reading Avrum's post. You cannot use a PLL to adjust for skew between different regions of the FPGA.
Also note that the BUFG's are all in or near the center of the FPGA, so you cannot place a BUFG near certain logic, at best you can move some logic near the central location where all BUFGs are found.
ZHOLD is one of the compensation modes. See UG472 for a full description. Here is a quick take:
ZHOLD: Indicates the MMCM is configured to provide a negative hold time at the I/O registers.
EXTERNAL: Indicates a network external to the FPGA is being compensated.
INTERNAL: Indicates the MMCM is using its own internal feedback path so no delay is being compensated.
BUF_IN: Indicates that the configuration does not match with the other compensation modes and no delay will be compensated. This is the case if a clock input is driven by a BUFG/BUFH/BUFR or GTX/GTH/GTP.
There is no need to delay the clock after the SerDes: if you meet the setup and hold for the SerDes, then parallel bits clocked out of the SerDes are synchronous with the output clock of the SerDes, much as the Q output of a flip-flop is synchronous with the clock driving the flip-flop.
If the SerDes parallel data clock period is short--say less than 3 ns--and the SerDes is driving some logic that is close in and some logic that is far away on the FPGA, lagging the clock to fix setup problems for the logic that is far away will only create hold problems for the logic that is close in. In such a case you are better off to segment the data flow so no bits have to travel a long distance on the FPGA without being reclocked through an intermediate register.
If you are having specific timing violations that you want to post, perhaps we can comment further.
09-26-2014 08:30 AM - edited 09-26-2014 08:33 AM
And I agree with Daniel!
We are having trouble understanding exactly what your issue is.
If you are having a specific timing violation (like a setup failure from the output data of a high speed transceiver) then lets talk about that problem and the conventional ways to fix it. Mucking with the clock is generally not how one fixes timing errors.
A couple of specific things... FPGA clocks are not ASIC clocks. They are not "load dependent" in the way they are for ASICs; the global clock network in an FPGA is already pre-routed, can span the entire FPGA, and is actively buffered at a whole bunch of connection points. So, the delay on the clock tree doesn't get larger with number of endpoints; the skew may increase since there is some variation from load to load (so if you happen to use the FF connected to both the shortest route from the BUFG and the fastest route from the BUFG then you will get the largest skew report), but the overall delay won't change much. So you can't delay a clock by changing the load on the BUFG.
Also, you can't change how the PLL is connected to the global network. As I said, the global network is already pre-routed. This includes a pre-wired connection from the global domain to the CLKFBIN of the PLL/MMCM (which, depending on your design, may or may not be used). But, for a given clock network, there is only one such connection - you cannot control it.
If you do need to adjust the phase of clocks (and it doesn't look like this is an application where you should consider it), then you can use the phase shifting capabilities of the MMCM (instead of the PLL). Like the PLL you can already adjust the phase in increments of 1/8 of the VCO period. If you use the MMCM you can the fine phase shifter and go down to increments of 1/56 of the VCO period.
Next, the GTX/GTH data is clocked by your BUFG. When using a GTX/GTH you supply the GTX/GTH with an RXUSRCLK, which is supposed to be a buffered clock (via either a BUFH or BUFG) that clocks the receive (parallel) data of the GTX/GTH as well as the user logic that captures this data. Changing the timing of the output clock of the BUFG/BUFH won't make any difference unless you give them different clocks (which you really shouldn't do).
09-29-2014 09:39 AM
Daniel, Avrum & co,
many thanks for all clarifications, i learned a lot from this. We're back at floor planning now =)
If you are interested, i just stumbled on this, which is related to this thread.
09-29-2014 11:01 AM
Thanks for the paper, Mike.
The authors wrote and tested this using Virtex-5 and considering an Altera FPGA, but I am not sure they understand for Virtex-5 what at the time was called "Regional, I/O, and local clocks in addition to global clocks" (DS100) because they only describe the global clock distribution network. In fairness to them, Xilinx did not start stressing BUFH and BUFR until Virtex-6 and BUFMR's showed up in Virtex-7. The authors could have improved the skew they were calculating by having a BUFG feed a BUFH for each horizontal region.
But the amount of clock skew from using a single BUFG is not the problem most people have in meeting timing. The problem is signal congestion which leads to long signal routes. Consider one of two paths that are failing to meet timing in a design I am working on right now. The requirement is 3.103 ns; the implementation misses by 0.025 ns; the logic delay is 0.520 ns, the path delay 2.293 ns and the clock skew plus clock uncertainty from clock and system jitter is 0.315 ns. And I have system jitter set at 0.250 ns. So to solve my problem, what should I work on? Clock skew is at the very bottom of the list of things that would help. Not to mention all of places you could create new problems while crossing between two clock domains, especially if you ever needed to cross back from the lagged domain to the original domain.
09-29-2014 01:26 PM
thanks for the heads-up. Interesting indeed to understand more about this. I‘m puzled about how a BUFH fed from a BUFG could be faster than just the BUFG alone. Would you care to elaocrate?
Again, many thanks for taking the time to answer on this thread.
09-29-2014 02:00 PM
Good question. There are two effects. And it is not that it would be faster: it would add more delay, but the variation would be less. Let's compare these two paths using prop delays from the data sheet and guessing where we do not have numbers.
\With just the BUFG, lets say the net prop delay is from 2 ps right next to a clock driver up to 150 ps as far away as you can get from a global clock driver.
If you add a BUFH's after the BUFG's, then you add a 110 ps of BUFH prop delay, let's say 30 ps of BUFG to BUFH net prop delay, and since the BUFH's have less area to drive, maybe the net prop delay in that horizontal area is 1 ps to 75 ps.
Add up the prop delays, starting from the input to the BUFG, including the BUFG adding 100 ps of prop delay. In the BUFG only circuit the delay is from 102 ps (100+2) to 250 ps (100+150) for a skew of 148 ps. In the BUFG plus BUFH circuit the delay is from 241 ps (100+110+30+1) to 325 ps (100+110+30+75) for a skew of 84 ps. So the total delay is more, but the skew is less. The difference is that BUFH's drive less area than BUFG's and so the skew should be less for a BUFH, but the delay will be more.
Does that make sense?
Now I could be wrong about this because Xilinx does not tell us what the maximum skew is for the various clock circuits, but as an area problem and if I have made reasonable estimates of what their clock trees probably look like, this is the way it might work.
09-29-2014 03:40 PM
No. BUFGs can not directly drive into clock regions. There is always a BUFH driving into the clock region. In other words, a BUFH is always added after a BUFG whether you put it in your design or not. The 2 advantages of adding the BUFH explicitly in your design are a) You can control placement of the logic that sits on those explicitly added BUFHs you added to the BUFG in series with the BUFG by LOCing and b) You can individually control the CE on the BUFHs that are driven by the BUFG in the design. On the other hand, if you drive a BUFH directly from a valid clock source then there is no BUFG inserted. So if you can do that you basically eliminate the vertical skew component and only deal with the horizontal skew from the BUFH source to the end point of the BUFH tree. If you go the left/right horizonrtally adjacent BUFH combination then potential skew, of course, is larger than just using a sigle BUFH directly.
09-29-2014 03:58 PM
Thanks, Ralfk. There is lots of good information in your reply.
When does a BUFG get converted into a BUFG and BUFH? I just opened an implementation of a design and drew the schematic for the clock driving one register. It seems to be directly driven by a BUFG. But maybe something in the Bitstream Generation will replace it?
09-29-2014 04:53 PM
Look in the Device Viewer or FPGA_EDITOR if you still use ISE. I assume the schematic view want to retain the origional representation and not the physical implementation.
09-29-2014 04:59 PM
09-29-2014 05:30 PM - edited 09-29-2014 05:35 PM
I did and it shows the same thing: the BUFG directly driving into clocking regions.
You can see eight BUFH in the last image. They were assigned by me elsewhere and all have names that I gave them.
The only thing I can guess is that the BUFH you are describing is part of the BUFG clock tree and is not accessible to a user--except as a BUFG--and so the tools don't show them.
Thanks again for your information.
09-29-2014 05:41 PM
09-29-2014 06:03 PM - edited 09-29-2014 06:04 PM
Interesting. I thought it was the other way around. I thought there were only 12 BUFH's per clock region because each clock region only has 12 horizontal clock lines, which are separate from the 12 BUFH's. UG472 puts it this way:
Each 7 series monolithic device has 32 global clock lines that can clock and provide control
signals to all sequential resources in the whole device. Global clock buffers (BUFGCTRL,
simplified as BUFG throughout this user guide) drive the global clock lines and must be
used to access global clock lines. Each clock region can support up to 12 of these global
clock lines using the 12 horizontal clock lines in the clock region.
Then they give us several diagrams that show this layout. The third in particular shows that a BUFG can drive through a BUFH, but the BUFG can also bypass the BUFH and directly access the 12 HROW lines and from there, the fabric.
09-29-2014 06:49 PM - edited 09-29-2014 06:52 PM
OK - Here's the deal re: BUFH and BUFG.
There are 32 BUFGs. These are located in the center of the die and drive only the 32 vertical spines of the global clock.
There are a number of clock regions on the die. These regions are fixed heights (a certain number of CLBs) and there is a left one and a right one - the boundary between them is where the vertical spines of the global clock are.
In the middle (horizontally) of each clock region are the horizontal spines of the global clock network. The number of them is different from family to family - in the V6 and 7 series there are 12, I think V5 only had 10 (but I am not sure). Each one of these horizontal spines is driven by a BUFH - there is one-to-one mapping of the BUFH to the horizontal spine.
The input to the BUFH can be any of the 32 vertical spines of the global clock. This is how/why the "no more than 12 global clocks in a region" rule exists.
The BUFHs have always been there - I think some primitive version of them exists as far back as Virtex-II. In earlier families, they had only simple functionality (they were simple buffers) and they could only take the vertical spines as inputs. As a result, they were uninteresting from the schematic point of view, so the BUFHs just weren't discussed/described/documented - they were merely the "un-named" driver from the vertical spines to the horizontal spines.
In later families (I think as early as the V5 - certainly in the V6 and 7 series), additional capability was added to the BUFH
- it got a CE pin so it could become a BUFHCE
- it was able to be driven by inputs other than the 32 global spines - notably the outputs of CMTs, HSSIO and the clock capable I/O in the same region (actually the left/right pair of regions, depending on the technology)
Since these additional usage models were created, it was necessary to name the BUFH. So now the BUFH is used
- by direct instantiation
- useful for the BUFHCE and connecting to sources other than the global clock spine or
- silently whenever a global clock network drives load in a region
- the tool chooses one of the BUFH's and its associated horizontal spine to bring the global clock into the region
- the tool can do this in as many regions as it needs to - if the clock is needed in 5 clock regions, one BUFH in each region is "assigned" to that global clock (and hence connected to the particular vertical spine)
So, whether you instantiate the BUFH or not, it is always there.
The "silent" BUFHs are never shown in the schematic - they are added at placement time based on where the loads are placed in the device (and hence which clock regions need which clocks).
They are, however, visible in the device viewer with "Routing Resources" view enabled (the pictures posted earlier by @dwisehart were without the routing resources shown; the ratsnest view shows the connection from the driver to the load, the intervening "silent" BUFHs are just considered to be part of the route, and hence aren't shown in this view).
The attached picture shows the intersection of the vertical and horizontal spines. Here we see the 12 BUFHs on a side. One of these (the lowest one) is a manually instantiated BUFHCE - note that it shows as a "used cell" (it is colored blue). However, the 9th and 10th (from the bottom on the left) have clocks running through them, even though they are not shown as a "used cell" (they are not colored blue) - these are the automatically inserted BUFHs. They don't show up as cells in the design, but they are none-the-less part of the routing of the global clock into the horizontal spine for this region. So from a delay point of view, they exist...
09-29-2014 07:01 PM
Just as you said. Here are the same BUFH's I showed an earlier picture of, this time with Routing Resources turned on. The eight I placed (for TXOUTCLK and RXOUTCLK from the GT) are blue, and two more are used for my oExtClk without color.
UG472 could probably be updated a bit to show that access to the HROW from the BUFG always go through the BUFH, but I know that is not your department.
09-29-2014 07:05 PM
As for the paper, it is academically interesting. The problem is (presumably) real - that the on-chip variation of the clock distribution network adds a variable amount of skew (on top of the fixed amount of skew), and that the variable amount has to be guard banded.
In theory, the stuff proposed here could minimize it (at least somewhat). Using the CMT to cancel out skew would require dynamic calibration; the variable skew is (by definition) dynamic and chip dependent, and the phase error of two outputs of a CMT are large relative to the variation (on the order of +/-100ps), and are also dynamic. So, you would have to dynamically measure not only the clock tree skew but also the CMT phase error and then dynamically choose a tap setting that cancels them out. Again, academically interesting, but practically - well, it just isn't. The amount of effort required to do this is huge - you would be better off just re-pipelining your design!
As for choosing which BUFH to use - again, academically interesting. I am surprised that the difference was measurable, since the BUFHs for a particular region are right beside eachother - but according to the paper, it can be measured. Again, academically interesting - but practically..... not. The only way to choose a BUFH is to manually instantiate it and LOC it to a particular site. The moment you do so, your clock becomes segmented. When the tool automatically instantiates BUFHs, it can do so without assigning a net to the BUFH output - logically, all the segments (on all horizontal spines) are part of the same net (the global clock net).Thus, the router takes care of connecting to all the loads of a global clock regardless of which clock region they end up in (and hence which automatically inserted BUFH is used).
To instantiate a BUFH, you create a new net on the output of each of the BUFHs. So, if you needed a particular clock in 5 clock regions and you wanted to choose the BUFHs you would have to
- determine which loads will end up in which clock regions (something normally done by the placer)
- before synthesis (or via an ECO) split your global clock into 5 different nets, each driven by a different BUFH
- drive all the loads in one clock region with the correct BUFH output
- LOC the 5 BUFHs
Doing so means that you cannot allow the placer to decide where to put the loads - you have to split your loads into different clock regions before placement. Really impractical...