cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
u6113500@anu.edu.au
Adventurer
Adventurer
2,374 Views
Registered: ‎09-10-2018

Ultrascale+ Clocking proiblem (IBUFDS -> BUFG, BUFGCE_DIV -> SERDES)

Jump to solution

Hi all,

I am using a clock capable pin a system like this:

Untitled Diagram.jpg

 

 

 

 

 

 

 

 

Which is copied from the example in XAPP1324 (v1.1) August 23, 2018 (Figure 6).

CLKIO : IBUFDS port map (I => XDR_P, IB => XDR_N, O => IOClk_i);
CLKBUF : BUFG port map (I => IOClk_i, O => IOClk);

CLKDIV : BUFGCE_DIV
generic map (
BUFGCE_DIVIDE => 2,
-- Programmable Inversion Attributes: Specifies built-in programmable inversion on specific pins
IS_CE_INVERTED => '0', -- Optional inversion for CE
IS_CLR_INVERTED => '0', -- Optional inversion for CLR
IS_I_INVERTED => '0' -- Optional inversion for I
)
port map (
O => adc_clk_2, -- 1-bit output: Buffer
CE => '1', -- 1-bit input: Buffer enable
CLR => ResetDly, -- 1-bit input: Asynchronous clear
I => IOClk_i -- 1-bit input: Buffer
);

adc_clk_2 goes to the IDELAY and SERDES, IOClk is a high speed clock for the SERDES. 

There is nothing else connected to the XDR_P/XDR_N

 

This is the error that I am getting:
Screenshot from 2019-11-06 10-09-35.png

 

Any Ideas?

Screenshot from 2019-11-06 10-09-35.png
0 Kudos
1 Solution

Accepted Solutions
avrumw
Guide
Guide
2,331 Views
Registered: ‎01-23-2009

So, it is confusing, but you are mixing terms associated with the two "styles" of interfaces in UltraScale.

What you are using is "Component mode" - this is the legacy mode that is "comaptible" with the clocking structure of earlier Xilinx families (7 series, Virtex-6 and some earlier ones). In component mode, the concept of the nibble/byte group and the DBC/QBC are irrelevant, the only things that matter are the "GC" notations and the concept of the I/O bank (as a whole). Within each bank there are four differential GC pairs (for up to 4 single ended or differential "global" clocks). Each of these has a connection to the global clocking structures (the BUFG/BUFGCE/BUFGCE_DIV).

The other mode is "Native mode". In Native mode, you don't use the global clocking, and hence the "GC" notations are meaningless. Here, the clocking is managed by the RX_BITSLICE, which uses the DBC/QBC clocks to capture data (directly - with no buffer).

Some pins (namely one of the 4 sets of GC pairs) also doubles as a QBC pair.

(Ah, now I see your problem)...

Since you are using component mode, you must use the GC pins. For one of your pin pairs, you are using the pair that is both the GC and QBC pins, so this one works. But it seems that for the other you are using the pin that is only a QBC pair. The QBC pair cannot reach the BUFGs, which is why you are getting this error.

So, this is a problem. Is your board already designed? Can you move the other clock to one of the other GC pairs (i.e. AH23/AH22)? If not, then things are going to get messy...

Avrum

View solution in original post

12 Replies
avrumw
Guide
Guide
2,353 Views
Registered: ‎01-23-2009

There is nothing inherently wrong with this set of connections.

Are you sure the CLK_P and CLK_N are set to PACKAGE_PIN locations that correspond to a global clock capable pin - one that has "GC" in the pin name (not a DBC/QBC pin). The error message seems to imply that it is, but you often get the "Sub-optimal placement..." message if the input clock is not on a global clock capable pin.

If it isn't this, then there is some reason why the BUFG and BUFGCE_DIV are not placed on the sites that can be reached from this pin. Maybe the clocking resources in this bank are used by something else? Maybe the BUFG/BUFGCE_DIV are LOC'ed by a user or IP generated constraint? Maybe the output of this IBUFDS is being used somewhere else as well (maybe erroneously - like you meant to use the output of the BUFG or BUFGCE, but accidentally used the name of the net that is the output of the IBUFDS).

So

  • review your RTL code to make sure that all connections are as they should be,
  • take a look at all the XDC files associated with the project and the IP for a LOC constraint.
    • I am not sure if the LOC constraints would show up in one of the GUI windows (try Window -> Physical Constraints).
  • take a look at what else is going on in this I/O bank - are there any other clocks coming in and clocking resources used?

Those are some ideas... Good luck!

Avrum

0 Kudos
u6113500@anu.edu.au
Adventurer
Adventurer
2,341 Views
Registered: ‎09-10-2018

I am confused. 

In my design CLK_P/CLK_N are used to clock out data from the ADC. It is an 8-bit 4 channel ADC.

So, there are 4 clock buses and 4 data buses (all of them are using the same setup).

Untitled Diagram.jpg


Clock 1&2, data bus 1&2 are placed in bank 65.
Clock 3&4 data bus3&4 are placed in bank 66.

However, there is only one GC/QBC pin per bank. 

From UG571 (v1.12) August 28, 2019

"The two central byte groups (1 and 2) each contain clocks quad byte clock (QBC) and global clock (GC)-capable input pins or pin pairs. The QBC pins can be used as capture clock inputs for the nibble or byte group they are placed in, but they can also deliver a capture clock through a dedicated clock backbone to all other nibbles and byte groups in the I/O bank. The GC pins are clock inputs that can drive MMCM and/or PLL primitives. Some of these clock-capable inputs have dual function capabilities—QBC and GC. The upper and lower byte groups each contain dedicated byte clock (DBC) clock-capable input pins (pin pairs) that can be used for clocking inside the byte group but do not have the capability to drive a capture clock to other byte groups in the I/O bank or to drive MMCM or PLLs in the I/O bank."

Why can't I use QBC (not GC/QBC) pin to clock out the data from the byte group associated with that clock?

Screenshot from 2019-11-06 13-25-15.png
0 Kudos
avrumw
Guide
Guide
2,332 Views
Registered: ‎01-23-2009

So, it is confusing, but you are mixing terms associated with the two "styles" of interfaces in UltraScale.

What you are using is "Component mode" - this is the legacy mode that is "comaptible" with the clocking structure of earlier Xilinx families (7 series, Virtex-6 and some earlier ones). In component mode, the concept of the nibble/byte group and the DBC/QBC are irrelevant, the only things that matter are the "GC" notations and the concept of the I/O bank (as a whole). Within each bank there are four differential GC pairs (for up to 4 single ended or differential "global" clocks). Each of these has a connection to the global clocking structures (the BUFG/BUFGCE/BUFGCE_DIV).

The other mode is "Native mode". In Native mode, you don't use the global clocking, and hence the "GC" notations are meaningless. Here, the clocking is managed by the RX_BITSLICE, which uses the DBC/QBC clocks to capture data (directly - with no buffer).

Some pins (namely one of the 4 sets of GC pairs) also doubles as a QBC pair.

(Ah, now I see your problem)...

Since you are using component mode, you must use the GC pins. For one of your pin pairs, you are using the pair that is both the GC and QBC pins, so this one works. But it seems that for the other you are using the pin that is only a QBC pair. The QBC pair cannot reach the BUFGs, which is why you are getting this error.

So, this is a problem. Is your board already designed? Can you move the other clock to one of the other GC pairs (i.e. AH23/AH22)? If not, then things are going to get messy...

Avrum

View solution in original post

u6113500@anu.edu.au
Adventurer
Adventurer
2,319 Views
Registered: ‎09-10-2018

Thanks for help!

So the solutuion would be either to change the layout or move to the native mode?

Anton

0 Kudos
avrumw
Guide
Guide
2,302 Views
Registered: ‎01-23-2009

So the solutuion would be either to change the layout or move to the native mode?

Yes.

But don't minimize how different using Native mode will be - it's not a "simple replacement" for component mode...

Also, before you go any further, have you analyzed the timing of this interface? Is it going to be captured statically, or are you planning to use dynamic calibration. If it is done statically, have you mocked up the interface, written accurate constraints and performed timing analysis?

Avrum

0 Kudos
u6113500@anu.edu.au
Adventurer
Adventurer
2,299 Views
Registered: ‎09-10-2018

The data is captured statically at 1250 mbps (625 MHz clk + DDR)

At this point I have constrained the clock  (create_clock -period 1.600 -name IOClk0 -waveform {0.000 0.800} [get_ports ADC0_CLK_P])

and set this rule for the clocks connected to the QBC pins

set_property CLOCK_DEDICATED_ROUTE FALSE [get_nets ADC0/CLKIO/O]
set_property CLOCK_DEDICATED_ROUTE FALSE [get_nets ADC1/CLKIO/O]

The design meets timing and works. I am not sure what to expect from the sub-optimal clock placement...

 

0 Kudos
avrumw
Guide
Guide
2,213 Views
Registered: ‎01-23-2009

Whoa...

You did not mention the set_input_delay commands on these pins. Without these (and without them being completely accurate) the fact that the "design meets timing" is meaningless. If you have not constrained the inputs, the tools will not time them.

But, before you go and start working on how to constrain them (or changing the board layout), I can tell you that 1250Mbps/pin is WAY too fast for static capture using component mode. Typically you need at least a 1.5ns (or possibly more - especially on UltraScale/UltraScale+, and significantly more so with the CLOCK_DEDICATED_ROUTE=FALSE) stable data valid window for static capture. At 1250Mbps your total bit period is only 800ps - way too small.

So you need to use dynamic capture. The question now becomes how? Depending on two things, this may be a good candidate for using Native mode:

  • What is the clock/data phase relationship coming from your external device?
    • Native mode with Built-In Self Calibration (BISC) works best with perfectly edge aligned or perfectly center aligned interfaces
  • Can you tolerate some unknown latency in the input capture mechanism?
    • Do your different interfaces need to remain tightly temporarlly aligned, or can you tolerate a 1 or possibly 2 sample difference in latency

Avrum

0 Kudos
alexis_jp
Explorer
Explorer
2,174 Views
Registered: ‎09-10-2019

@avrumwAny sources?

ds923 : v1.13

image.png

I understand 1250Mbps for component mode might be too high but why do you say at least 1.5ns? That's <666MHz way smaller than the max 1250Mbps using IOSERDES DDR.

Also, if his clock isn't connected to GC PINs, game over with component mode. Unless he can get patterns from the ADC to calibrate an MMCM fabric clock to be used as capture clock.

What do you mean by statically/dynamically calibration? (other than BISC for native mode)

Thanks

0 Kudos
avrumw
Guide
Guide
2,151 Views
Registered: ‎01-23-2009

One of the basic concepts we use in the design of synchronous digital circuits is static timing analysis (STA). For STA, the design is considered to be a set of timing paths, where each path starts at a clocked element, ends at a clocked element, and passes through any number of combinatorial cells and the nets between them. In STA we have two main checks - setup checks and hold checks.

In digital circuits, every delay is dependent on three main variables - together called PVT:

  • Process:
    • The manufacturing variability from device to device
    • The range of delays is bounded by what Xilinx allows for the different speed grades
  • Voltage:
    • The VCC voltage of the transistors
    • This is constrained to within a certain percentage variability from the nominal supply voltage
  • Temperature
    • The temperature of the transistors
    • This is constrained to be within a legal limit defined by the commercial grade of the device (commercial, industrial, automotive)

Together these form a cube of possible delays for each cell. For a design to "close static timing", the path must meet both its setup and hold check requirement for any delay within this cube (and "reasonable" combinations of them from cell to cell - I won't cover that here).

This is true for all internal paths, but also for paths that start or end outside the FPGA - including input interfaces.

If you properly constrain your inputs; providing the actual minimum and maximum arrival time of the input with respect to the clock (taking into account the source device - including it's variation accross PVT, board delays and signal integrity issues) then the tool has what it needs to perform STA.

If the setup check and hold check on these input paths meet timing with a given fixed configuration of the FPGA (i.e. any programmable delays used are set statically by the bitstream and are not changed), then the interface is statically captured. With the statically chosen clocking structure and programmable delay values, this interface will work at any combination of PVT for both the FPGA and the external part. This is the best situation, because you have basically proven that the interface will be reliable on all boards as long as they are operating within the legal limits of voltage and temperature.

To determine if an interface can be statically captured, you use the tools (Vivado). You design the clocking structure, set any programmable delays to fixed values, apply proper clocking (including jitter) and input constraints, and perform STA. If it passes, you are done.

From experimentation and experience you can tell what is feasible in different devices with different speedgrades (and with different clocking structures). There is even some guidance given in the Xilinx datasheets for the static setup and hold requirements for a given clocking style in each FPGA - you can take a look at this post on clocking structures for more information (although it hasn't been updated for UltraScale/UltraScale+ which I am hoping to do soon) - it is worth noting, though, that the results from Vivado supercede the numbers in the datasheet (and they don't match - PET PEEVE!). The net result is that if you have less than about 1.5ns of valid data window after the uncertainties of the source device and the board, you cannot statically capture an interface (and you need more or significantly more depending on your clocking structure and exactly what the clock/data phase relationship is).

If you can't capture an interface statically (i.e. you don't have enough of a stable static data window), then you need to use dynamic capture. In dynamic capture (also called dynamic phase adjustment or DPA), the programmable delays available in the FPGA (the phase shift of the MMCM or the tap delays of the IDELAY) are dynamically changed to "hunt" for the data window.

The external device defines the largest data valid window that is the superposition of every possible device at every possible PVT value. Any given device (at a given PVT) will provide at least this window, but it may also have valid data outside this window; for example if it is at a fast PVT, the earliest arrival time may be earlier than the datasheet value, but the earliest removal time will be at the minimum value defined by the datasheet. If it is a slow device the opposite is true.

For an FPGA, the setup/hold time reported by the tool is also the worst case over PVT. However, the actual "sample window" for a given device is a smaller subset of this time. Which subset, though, varies accross PVT, so for static capture you need to account for the whole setup/hold window.

In dynamic capture, a control circuit will tune the programmable delays using feedback from the interface (and there are many ways of doing this) so that the sample window of the FPGA will be moved dynamically so that it overlaps with (and a good dynamic capture mechanism will place it in the center of) the provided data valid window provided by the device. As the voltage and temperature of the system drifts (changing these delays), the dynamic calibration mechanism will adapt and continue moving the sample window into the best position (if the calibration is "full time dynamic calibration", which continually adjusts for PVT; you can also do power on dynamic calibration, which only compensates for process and initial voltage/temperature, but then voltage and temperature variation will move the windows over time which won't be compensated).

With dynamic calibration, you can sample MUCH faster interfaces - the actual sample window of a Xilinx flip-flop is very small (a few hundred picoseconds), so in theory, you can capture very fast interface.

If your sending device is close to ideal, and you have the "best possible" dynamic calibration mechanism (and that will vary from design to design), then you can capture interfaces at the speed shown in the table 23 you showed. But, again, this is only with dynamic calibration - the maximum frequency you can capture is MUCH slower...

But, the major disadvantage of dynamic capture mechanisms is that they can't be "proven" - static timing analysis is useless here since the actual requirements are based on the dynamic capture control circuit. So the only way to "validate" it is in the lab, and even in the lab you don't have control over the full range of PVT so you can't actually guarantee that your dynamic capture will work over all combinations of external device and FPGA PVT; at some point you just have to take a leap of faith, and that is something that digital designers are often uncomfortable with - sending a device or product into the field without knowing that you have proven it will work at all legal PVT corners... (Although this may not be the case with the Built In Self Calibration of the UltraScale I/O in native mode).

Avrum

alexis_jp
Explorer
Explorer
2,090 Views
Registered: ‎09-10-2019

That post is useful for next viewers.

At that frequency, I believe it's quite impossible to do static analysis or you need to control the PCB design of your board and basicaly be sure the board's traces are quasi-perfect.

Moreover, I've read in this post the dynamic calibration isn't necessary when used with Native mode. I don't understand why. The BISC is present in both modes.

Is it because the Native mode (HSSIO only or even if we instanciate primitives?) allows to select a QBC PIN for clock/strobe to be used as the capture clock? I read somewhere HSSIO can make the dynamic calibration automatically by selecting Center/Align, but how can HSSIO do it if there is no specific pattern?

That feature isn't very well documented to clearly understand what it's doing underneath.

Btw, how amazing the HSSIO can be, I gave it up since simulators give different results in the latency of the serialization.

0 Kudos
avrumw
Guide
Guide
2,028 Views
Registered: ‎01-23-2009

At that frequency, I believe it's quite impossible to do static analysis or you need to control the PCB design of your board and basicaly be sure the board's traces are quasi-perfect.

The traces aren't really the problem here - the interface is just too fast for the FPGA. And, to be clear, you can do static analysis, but you will fail timing! So you can't do static capture - you need dynamic capture.

Moreover, I've read in this post the dynamic calibration isn't necessary when used with Native mode. I don't understand why. The BISC is present in both modes.

To be clear, BISC is dynamic calibration - it is "Built-in Self-Calibration", which could have as easily been described as "Built-in Dynamic Calibration" (but BISC is much easier to say than BIDC). So you are doing dynamic calibration, but the entire system is controlled by hard logic, rather than needing some fabric based control logic that tunes the delay taps (the BISC logic controls the delay taps for you) - although it is a bit of a different kind of dynamic calibration (see below). While the documentation does seem to imply there is BISC in component mode, I have no idea how it works - the documentation (and answer records) talk mostly/only about BISC in native mode.

Is it because the Native mode (HSSIO only or even if we instanciate primitives?) allows to select a QBC PIN for clock/strobe to be used as the capture clock?

So the documentation is a bit confusing. There are a number of differences between component mode and native mode. One of the more important ones is that the clocking of the capture flip-flops is not done using fabric (global) clocks - the DBC/QBC clocks remain entirely in the BITSLICE_RX and clock the capture flip-flops directly. The downside of that is that you cannot directly access the sampled data from the fabric (since the DBC/QBC clocks are not available in the fabric). To solve this, the sampled data is placed in the native mode FIFO, which is a (minimal) clock crossing FIFO; you can read the data out of this FIFO with a fabric clock that is mesochronous (or "system synchronous") to the divided down DBC/QBC clock. But the penalty of this is that it adds latency  and the latency is not fixed - it can vary by up to one parallel word clock (which is why I asked if latency or channel coherency were important - the FIFO causes a problem for both of these).

I read somewhere HSSIO can make the dynamic calibration automatically by selecting Center/Align, but how can HSSIO do it if there is no specific pattern?

First, don't conflate Native mode with BISC. BISC is an option that can be used in Native mode, but it isn't required (although there is little documentation and even less information on timing regarding using Native mode without BISC).

Also, you need to understand what BISC is doing. It isn't doing what "normal" dynamic calibration is doing (or DPA), which is dynamically trying to find the center of the valid data window (wherever it is). To do that you do need a known pattern (or at least a known transition density). BISC is doing something different.

One of the main sources in uncertainty that affects the speed of input interfaces are the internal delays. While a clock/data timing relationship may be known at the pins of the FPGA, there are some uncertainties on both the clock and data paths between the FPGA pins and the actual capture flip-flop itself. It is because of these uncertainties that the "pin-to-pin" setup and hold time requirements are so large (needing the >1.5ns of stable data).

In BISC, the device is cancelling out all these PVT variable delays so that the relationship at the pins is exactly equal to the relationship at the sample flip-flops. Said another way, the BISC ensures that the sample flip-flops sample the data at the exact point that corresponds to the edges of the clock when viewed from the point of view of the FPGA pins. Looking at answer record 68618, the resulting timing is "known" - the (LVDS) data at the pins needs to be stable for 169.1ns centered around the edges of the clock (assuming RX_CLK_PHASE_P=SHIFT_0) or exactly in the middle of the clock period (if RX_CLK_PHASE_P=SHIFT_90). So that works great for perfectly center aligned or perfectly edge aligned interfaces, but not necessarily for interfaces that are not perfectly edge/center aligned. That's why I asked about the clock/data phase relationship of your signals - are the close enough to perfectly edge or center aligned?

But, clearly, if the input timing is edge/center aligned and latency and channel coherency are not critical to you, then using native mode with BISC allows for substantially faster input interfaces than component mode.

Avrum

Tags (2)
alexis_jp
Explorer
Explorer
1,967 Views
Registered: ‎09-10-2019

@avrumwI would have been happy if I could have found everything you said in the documentation...

But the penalty of this is that it adds latency and the latency is not fixed

For memory controller that is a huge problem. I believe the latency is fixed per power-up. The FF aren't clocked by fabric clock, that is the reason why we can't bypass the fifo in Native mode. Understood!

Thanks for your explanations!

0 Kudos