10-08-2018 01:39 PM
I'm trying to implement a source synchronous edge-aligned input DDR interface with the clock rate at 300 MHz. I route the clock thru an IBUFDS -> BUFG -> MMCM -> BUFG with the MMCM feedback path also routed thru a BUFG. The MMCM is set to multiply up and divide down by three so the VCO runs at 900 MHz.
The data gets routed thru an IDELAY -> IDDR with the IDDR set to same edge pipelined.
In the timing analyzer I am seeing the clock time from the BUFG to the IDDR vary by over 2 ns across the Min at Slow Process Corner to the Max at Fast Process Corner. The Min / Max input delays are +/- 100 ps.
The 2 ns variance is larger than the 1.67 ns data eye width, excluding setup / hold requirements which makes the situation worse. Based on the timing results, it appears that I can never get this to work without some sort of alignment that runs based on temperature.
This seems wrong. I've done these sort of interfaces before in older less capable parts and never seen this kind of clock delay variance. What gives?
10-08-2018 05:33 PM
10-08-2018 05:38 PM
First, you should not have the first BUFG in the clock path - the clock should go
IBUFDS -> MMCM/CLKIN -> MMCM/CLKOUTx -> BUFG -> IDDR/C
(the IBUFDS must be on a clock capable pin in the same region as the MMCM).
The extra BUFG is going to add LOTS of uncertainty.
Second, looking at the min to max variation of the clock alone is not enough. The MMCM will cancel out much of the clock path, and there will also be some uncertainty removed using the "pessimism removal".
All that being said, 300MHz DDR in a source synchronous interface is "difficult". The data valid window is 1.67ns (as you mentioned), which may be too small to capture statically - in the 7 series this was too fast for MMCM/BUFG capture, but was just on the edge of what could be done with "ChipSync" capture (BUFIO/BUFR). Take a look at this post on input capture clocking architectures.
But UltraScale/UltraScale+ has no direct analogue to the BUFIO. In theory the IBUFDS->BUFGCE/BUFGCE_DIV can be used instead (without an MMCM), but I have no experience with this, so don't know what the practical clock speed limit is.
If you can't get the interface to meet timing with these architectures, then you will need to consider dynamic capture...
10-08-2018 08:15 PM
Thanks for responding to my question. I appreciate the responses! Also, thanks for the link. So to answer some questions (from both responses):
1. The clock is edge aligned with the data.
2. The data arrives at the input pins within +/- 100 ps. of the clock arrival at it's input pins.
3. The clock is on a clock capable I/O pin.
4. In Figure 3-9 of UG572 (Ultrascale Clocking Resources) which shows how to deskew a clock the clock input goes thru an "IBUFG", which I interpreted to be a BUFG. Good to know that an "IBUFG" is really an IBUF or in my case an IBUFDS.
5. I've actually tried to get this interface to work without the MMCM using just an IBUFDS -> BUFG and I get very similar results, where the clock delay from the output of the final BUFG to the IDDR clock input has a very large variation over the process corners.
I will remove the input BUFG but I'm going to guess that it isn't going to make much difference.
I guess what I need to know is whether or not this is a normal amount of variation. If so, then so be it. I just don't remember seeing nearly this much before.
10-09-2018 01:22 PM
I got rid of the BUFG in front of the MMCM and it made very little difference.
I guess that the timing analyzer is telling me the truth, and that there are large differences in clock delay from the BUFG output to the IDDR clock inputs over PVT.
So, with no local clocking buffers (BUFIO or BUFR), input DDR interfaces in Ultrascale perform worse than in older part families.
10-10-2018 05:10 PM
@mbarnard "So, with no local clocking buffers (BUFIO or BUFR), input DDR interfaces in Ultrascale perform worse than in older part families."
I've seen this as well - without the dedicated I/O clocking paths, you are at the mercy of how the "ASIC class" clock distribution network gets assembled, which tends to be much more variable build-to-build and instance-to-instance, as Xilinx's algorithms do not appear to take I/O clocking constraints into account when placing elements and building the resulting clock tree.
That said, I've had good success with moderate rate ( < 320 MHz SDR, 160 MHz DDR ) designs in Ultrascale by manually constraining the tools to create a clock distribution tree spanning only one clock region, with the clock root in the same region.
See the following thread for more info:
" The only way I've found to get predictable Ultrascale timing for the equivalent of
" BUFIO=>I/O DDR, BUFR=>FABRIC is to manually force all of the following:
" - LOC the input clock buffer to the clock region with the I/O DDR flops in question
" - Force the clock root into that same clock region with USER_CLOCK_ROOT
" - create a PBLOCK to force anything on that same clock net into that same clock region
The latest tools have added a CLOCK_LOW_FANOUT constraint, see UG949 page 101, that hopefully would make this process simpler than the USER_CLOCK_ROOT + pblock approach I describe above.
XAPP1324 is also a good reference for designs using Ultrascale 'component mode' I/O primitives.
 "Native mode" has an internal direct strobe path, but I haven't done any designs using it yet
10-10-2018 06:19 PM
Thank you so much for the info. brimdavis! I'm glad that I'm not imagining this. Much obliged.