UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

Reply
Adventurer
Posts: 57
Registered: ‎05-15-2014
Accepted Solution

SelectIO output bus width of ISERDES

I'm preparing to interface an AD9222 ADC to a Z010 FPGA.  This is my first experience using ISERDES and it seems overly complex and somewhat confusing.  I've looked at xapp524 which seems to have a lot more complexity than necessary, and ug471 which covers a lot of that same complexity while falling short of showing what is actually going on in ISERDES.

 

Since the ADC provides a bit clock and a frame clock, and isn't that insanely fast, I fail to see the need for dynamically aligning the clock and data.  Yes, the clock path is probably a bit longer than the data path, but that should be determinable from timing calculations and a small IDelay in each data path should solve that once and for all without training the clock with a state machine.

 

Likewise BitSlip seems, at least for this application to be a huge complexity with no benefit.  We have a frame clock.  What more do we need?

 

Finally, my biggest concern is the 10 and 14 bit limitation of cascaded ISERDES DDR.  The ADC is 12 bit.  It would seem to me that this should not be a problem, but everything I read makes a big deal about it.  Fundamentally, isn't ISERDES a shift register followed by a parallel load register?  So if it is set up for 14 bit deserialization and the frame clock is 1/6 the bit clock isn't it going to decode 12 bits with two extra Q outputs that aren't needed?  What am I missing here?  If it was part of a self-aligning scheme I could see the need for a match and for BitSlip, but how hard is it to latch 14 bits of data (two of which are garbage) every six frame clock cycles?

 

 


Accepted Solutions
Adventurer
Posts: 57
Registered: ‎05-15-2014

Re: SelectIO output bus width of ISERDES

I promised an update on this.  So now that I have a working design, I will share what I learned.  The task was to interface an AD9222 12 bit 8 channel ADC to the Zynq.  The input clock to the ADC is 40 MHz coming from FPGA CLK 0 from the PS.  It is not used by the Zynq in any other manner.  The ADC generates a 240 MHz bit clock from this and outputs DDR data at that rate.  It supplies a bit clock output with transitions at the center of each data period, and a frame clock output with a positive edge crossing aligned with the start of the first bit of each data word.

 

It was not necessary to use any dynamic timing to accurately deserialize this data.  ISERDES did not appear to be a useful primitive, partly because of its failure to support 12 bit data, and partly because it simply isn't designed to take advantage of the frame clock.  It is a valuable primitive for situations where it is needed, where fabric couldn't handle the data rate, and where dynamic timing adjustments are essential to successful data recovery.  But anyone looking at ISERDES as I was, as a useful primitive to avoid using fabric is likely to find that it takes more fabric to use it than it does to just do the job.

 

The solution starts with timing.  Vivado puts much more emphasis on timing than ISE did, because timing constraints drive place and route decisions--which is actually a welcome improvement.  There are two clocks that have to be considered here.  One is the bit clock the ADC provides.  The other is the internal one in the ADC that causes the data to be shifted out.  They are both of the same frequency, but the timing (phase) relationship is important.  The output clock has been 90 degree phase shifted to align with the middle of the data cell.  In addition, Vivado assumes that source clocks are what causes data changes, so the edge prior to the data change is compared to the destination clock edge that captures the change.  While there are multiple ways to represent this, I did not (as other posts have done) find it necessary to use the "multicycle" constraint to represent this.  I represented it by the relative timing values in the waveform parameter.  Note that there is a 1/2 cycle discrepancy in the following numbers that will be addressed later.  Here are the clock constraints:

 

#virtual clock driving data from ADC
create_clock -period 4.167 -name ADC_Data_Clk -waveform {-3.125 -1.042}
#Actual 90 degree shifted bit clock from ADC
create_clock -period 4.167 -name ADCBITCLK -waveform {-2.083 0} [get_ports ADCBITCLK_v_p]

 

The first is a "virtual clock" that describes what causes the data to be generated.  The second is the actual clock that receives it.  This next line just assures Vivado that there is no point in looking for a relationship between the clock driving the ADC and the clocks being constrained:

 

set_clock_groups -asynchronous -group [get_clocks ADCBITCLK] -group [get_clocks clk_fpga_0]

 

Then the next four lines (because this is DDR) tell Vivado that there is no source - destination relationship between the positive edge of one clock and the negative edge of the other:

 

set_false_path -setup -fall_from [get_clocks ADC_Data_Clk] -rise_to [get_clocks ADCBITCLK]
set_false_path -setup -rise_from [get_clocks ADC_Data_Clk] -fall_to [get_clocks ADCBITCLK]
set_false_path -hold -fall_from [get_clocks ADC_Data_Clk] -rise_to [get_clocks ADCBITCLK]
set_false_path -hold -rise_from [get_clocks ADC_Data_Clk] -fall_to [get_clocks ADCBITCLK]

 

And finally, these lines document the variability possible in the data, as defined in the ADC data sheet.  Note that this is an 8 channel ADC, but I'm only showing a single channel to keep it readable:

 

set_input_delay -clock [get_clocks ADC_Data_Clk] -clock_fall -min -add_delay -0.300 [get_ports ADCA_v_n]
set_input_delay -clock [get_clocks ADC_Data_Clk] -clock_fall -max -add_delay 0.300 [get_ports ADCA_v_n]
set_input_delay -clock [get_clocks ADC_Data_Clk] -min -add_delay -0.300 [get_ports ADCA_v_n]
set_input_delay -clock [get_clocks ADC_Data_Clk] -max -add_delay 0.300 [get_ports ADCA_v_n]

 

The actual implementation consisted of bringing the clock in on an IBUFDS, then through IDELAYE2 and IBUFR.  The data comes in on IBUFDS and IDELAYE2 to IDDR.  The frame clock is brought in like data, but using only one output of the IDDR.  There does not appear to be anything like an IFD, so IDDR was used.  Because IDELAY was used, IDELAYCTRL has to be instantiated and given a 200 MHz clock.  Since this is the only used of IDELAY, grouping was not needed.

 

   IDELAYCTRL IDELAYCTRL_inst (
       .RDY(),       // 1-bit output: Ready output
       .REFCLK(DlyClk), // 1-bit input: Reference clock input
       .RST(!S_AXI_ARESETN)        // 1-bit input: Active high reset input
    );

    IBUFDS IBUFDS_Bit (
       .O(bitclk),  // Buffer output
       .I(ADCbitClk_v_p),  // Diff_p buffer input (connect directly to top-level port)
       .IB(ADCbitClk_v_n) // Diff_n buffer input (connect directly to top-level port)
    );
   
   IDELAYE2 #(
      .CINVCTRL_SEL("FALSE"),          // Enable dynamic clock inversion (FALSE, TRUE)
      .DELAY_SRC("IDATAIN"),           // Delay input (IDATAIN, DATAIN)
      .HIGH_PERFORMANCE_MODE("FALSE"), // Reduced jitter ("TRUE"), Reduced power ("FALSE")
      .IDELAY_TYPE("FIXED"),           // FIXED, VARIABLE, VAR_LOAD, VAR_LOAD_PIPE
      .IDELAY_VALUE(DelayCount),                // Input delay tap setting (0-31)
      .PIPE_SEL("FALSE"),              // Select pipelined mode, FALSE, TRUE
      .REFCLK_FREQUENCY(200.0),        // IDELAYCTRL clock input frequency in MHz (190.0-210.0, 290.0-310.0).
      .SIGNAL_PATTERN("CLOCK")          // DATA, CLOCK input signal
   )
   IDELAYE2_C (
      .CNTVALUEOUT(), // 5-bit output: Counter value output
      .DATAOUT(bitclkd),         // 1-bit output: Delayed data output
      .C(1'b0),                     // 1-bit input: Clock input
      .CE(1'b0),                   // 1-bit input: Active high enable increment/decrement input
      .CINVCTRL(1'b0),       // 1-bit input: Dynamic clock inversion input
      .CNTVALUEIN(5'b0),   // 5-bit input: Counter value input
      .DATAIN(1'b0),           // 1-bit input: Internal delay data input
      .IDATAIN(bitclk),         // 1-bit input: Data input from the I/O
      .INC(1'b0),                 // 1-bit input: Increment / Decrement tap delay input
      .LD(1'b0),                   // 1-bit input: Load IDELAY_VALUE input
      .LDPIPEEN(1'b0),       // 1-bit input: Enable PIPELINE register to load data input
      .REGRST(1'b0)            // 1-bit input: Active-high reset tap-delay input
   );
  
   BUFR #(
       .BUFR_DIVIDE("BYPASS"),   // Values: "BYPASS, 1, 2, 3, 4, 5, 6, 7, 8"
       .SIM_DEVICE("7SERIES")  // Must be set to "7SERIES"
    )
    BUFR_BitClk (
       .O(bitclkb),     // 1-bit output: Clock output port
       .CE(1'b1),   // 1-bit input: Active high, clock enable (Divided modes only)
       .CLR(1'b0), // 1-bit input: Active high, asynchronous clear (Divided modes only)
       .I(bitclkd)      // 1-bit input: Clock buffer input driven by an IBUF, MMCM or local interconnect
    );

    IBUFDS IBUFDS_Frame (
       .O(frameclk),  // Buffer output
       .I(ADCframeClk_v_p),  // Diff_p buffer input (connect directly to top-level port)
       .IB(ADCframeClk_v_n) // Diff_n buffer input (connect directly to top-level port)
    );

   IDELAYE2 #(
      .CINVCTRL_SEL("FALSE"),          // Enable dynamic clock inversion (FALSE, TRUE)
      .DELAY_SRC("IDATAIN"),           // Delay input (IDATAIN, DATAIN)
      .HIGH_PERFORMANCE_MODE("FALSE"), // Reduced jitter ("TRUE"), Reduced power ("FALSE")
      .IDELAY_TYPE("FIXED"),           // FIXED, VARIABLE, VAR_LOAD, VAR_LOAD_PIPE
      .IDELAY_VALUE(0),                // Input delay tap setting (0-31)
      .PIPE_SEL("FALSE"),              // Select pipelined mode, FALSE, TRUE
      .REFCLK_FREQUENCY(200.0),        // IDELAYCTRL clock input frequency in MHz (190.0-210.0, 290.0-310.0).
      .SIGNAL_PATTERN("CLOCK")          // DATA, CLOCK input signal
   )
   IDELAYE2_F (
      .CNTVALUEOUT(), // 5-bit output: Counter value output
      .DATAOUT(frameclkd),         // 1-bit output: Delayed data output
      .C(1'b0),                     // 1-bit input: Clock input
      .CE(1'b0),                   // 1-bit input: Active high enable increment/decrement input
      .CINVCTRL(1'b0),       // 1-bit input: Dynamic clock inversion input
      .CNTVALUEIN(1'b0),   // 5-bit input: Counter value input
      .DATAIN(1'b0),           // 1-bit input: Internal delay data input
      .IDATAIN(frameclk),         // 1-bit input: Data input from the I/O
      .INC(1'b0),                 // 1-bit input: Increment / Decrement tap delay input
      .LD(1'b0),                   // 1-bit input: Load IDELAY_VALUE input
      .LDPIPEEN(1'b0),       // 1-bit input: Enable PIPELINE register to load data input
      .REGRST(1'b0)            // 1-bit input: Active-high reset tap-delay input
   );
  
    IDDR #(
        .DDR_CLK_EDGE("SAME_EDGE") // "OPPOSITE_EDGE", "SAME_EDGE"
                                             //    or "SAME_EDGE_PIPELINED"
    ) IDDR_Frame (
        .Q1(), // 1-bit output for positive edge of clock
        .Q2(frameclk1), // 1-bit output for negative edge of clock
        .C(bitclkb),   // 1-bit clock input
        .CE(1'b1), // 1-bit clock enable input
        .D(frameclkd),   // 1-bit DDR data input
        .R(1'b0),   // 1-bit reset
        .S(1'b0)    // 1-bit set
    );

    IBUFDS IBUFDS_ADC0 (
       .O(ADC0data),  // Buffer output
       .I(ADC0_v_p),  // Diff_p buffer input (connect directly to top-level port)
       .IB(ADC0_v_n) // Diff_n buffer input (connect directly to top-level port)
    );

   IDELAYE2 #(
      .CINVCTRL_SEL("FALSE"),          // Enable dynamic clock inversion (FALSE, TRUE)
      .DELAY_SRC("IDATAIN"),           // Delay input (IDATAIN, DATAIN)
      .HIGH_PERFORMANCE_MODE("FALSE"), // Reduced jitter ("TRUE"), Reduced power ("FALSE")
      .IDELAY_TYPE("FIXED"),           // FIXED, VARIABLE, VAR_LOAD, VAR_LOAD_PIPE
      .IDELAY_VALUE(0),                // Input delay tap setting (0-31)
      .PIPE_SEL("FALSE"),              // Select pipelined mode, FALSE, TRUE
      .REFCLK_FREQUENCY(200.0),        // IDELAYCTRL clock input frequency in MHz (190.0-210.0, 290.0-310.0).
      .SIGNAL_PATTERN("DATA")          // DATA, CLOCK input signal
   )
   IDELAYE2_0 (
      .CNTVALUEOUT(), // 5-bit output: Counter value output
      .DATAOUT(ADC0datad),         // 1-bit output: Delayed data output
      .C(1'b0),                     // 1-bit input: Clock input
      .CE(1'b0),                   // 1-bit input: Active high enable increment/decrement input
      .CINVCTRL(1'b0),       // 1-bit input: Dynamic clock inversion input
      .CNTVALUEIN(1'b0),   // 5-bit input: Counter value input
      .DATAIN(1'b0),           // 1-bit input: Internal delay data input
      .IDATAIN(ADC0data),         // 1-bit input: Data input from the I/O
      .INC(1'b0),                 // 1-bit input: Increment / Decrement tap delay input
      .LD(1'b0),                   // 1-bit input: Load IDELAY_VALUE input
      .LDPIPEEN(1'b0),       // 1-bit input: Enable PIPELINE register to load data input
      .REGRST(1'b0)            // 1-bit input: Active-high reset tap-delay input
   );
  
    IDDR #(
        .DDR_CLK_EDGE("SAME_EDGE") // "OPPOSITE_EDGE", "SAME_EDGE"
                                             //    or "SAME_EDGE_PIPELINED"
    ) IDDR_D0 (
        .Q1(ADC0dataE), // 1-bit output for positive edge of clock
        .Q2(ADC0dataO), // 1-bit output for negative edge of clock
        .C(bitclkb),   // 1-bit clock input
        .CE(1'b1), // 1-bit clock enable input
        .D(ADC0datad),   // 1-bit DDR data input
        .R(1'b0),   // 1-bit reset
        .S(1'b0)    // 1-bit set
    );

 

The outputs if the IDDRs go to shift registers:

 

    always @ (posedge bitclkb) begin           // Shift odd bits on positive clock edge (DDR)
        frameclk2 <= frameclk1;
        for (i = 5; i > 0; i = i - 1) begin    // eight shift registers
            ADC0Odd[i] <= ADC0Odd[i-1];
            ADC0Even[i] <= ADC0Even[i-1];
        end
        ADC0Odd[0] <= ADC0dataO;                // shifting in the new bit
        ADC0Even[0] <= ADC0dataE;                // shifting in the new bit
        if (frameclk1 & !frameclk2) begin
            tempframe <= frameclki;
            for (i = 5; i >= 0; i = i - 1) begin    // demux odd and even bits
                ADC0Word[2 * i + 1] <= ADC0Odd[i];
                ADC0Word[2 * i] <= ADC0Even[i];
            end
        end
    end

 

The last part above is a second (latching register)  frameclk2 is a copy of frameclk1, delayed one clock cycle.  During the clock cycle when frameclk1 is high and frameclk2 is low, the shift registers are transferred to the holding register, de-interleaved at the same time.

 

IDELAY was put on all inputs.  The data inputs (including frame clock) were set to tap 0, and the clock was ultimately set to tap 9 to center it in the data window.  There were two reasons for putting IDELAY in the data path.  First, even at tap 0, it introduces some delay, and without it, the clock would probably have had to be set at tap 0, leaving no room for adjustment.  The initial version used a loadable delay that was controlled by RS-232 input to the program running on the PS.  Minimum and maximum values that produced correct output were determined.  The second reason is that it creates a balanced signal flow path, which cancels out variations in IDELAY time.  IDELAY could have been used with the data (taps other than 0) since the clock path is longer already.  However there is a data dependent jitter that doesn't occur in a clock path, because it is repetitious.

 

IDDR has three output options.  OPPOSITE EDGE gives the least delay, but at some point the data is going to need to be accessible from a single clock (edge), and in my opinion, the sooner, the better.  SAME EDGE solves this problem with the addition of a buffer on one of the paths to retime it, at a cost of half a clock cycle.  But the two data bits coming out can be from consecutive words, which isn't terribly useful at first glance.  SAME EDGE PIPELINE introduces additional buffering to resolve that problem at the expense of an extra clock cycle delay.  SAME EDGE can the best of all worlds though if the clock is simply inverted.  In this case it isn't being physically inverted, but because the clock path in longer, it is easy to create the inversion by simply accepting a half cycle of delay and specifying the clock timing 1/2 cycle off.

 

Tests were  run over the range of -40 to +85 C.  The data was stable in all cases over a clock tap range of 2 to 26 (a 1.8 ns range) and the minimum and maximum changed by only 2 taps over the entire range.  That is quite good, considering that the theoretical data window is 2 ns.  This was on a sample of 2, so actual device to device variations would undoubtedly reduce it some.  Also, while the two ICs were only about two inches apart, all lines were controlled impedance, differential pairs of equal length.

 

Here are some general things learned from this exercise.  It was a "trial by fire" of zynq and Vivado, but resulted in a lot of learning.

 

ISERDES.  When you need it you need it.  When you don't, you don't.  Anyone who sees (as I initially did) it as some useful "hard logic" to save PL resources is likely to be disappointed.  If fabric can deserialize with static timing, it will probably take less fabric than what ISERDES requires to support it.

 

BUFG is not your friend.  It is a valuable primitive when a clock needs to be widely distributed across the fabric, particularly if it originates internally, but it isn't the optimum solution for something like this.  BUFGs are located in the middle of the die and routing delays in and out of them can introduce several ns. of delay.  This can be compensated for by using a PLL with feedback, but BUFG should only be used when really needed.  BUFR and BUFIO are better choices for a more localized situation such as this.  BUFR includes a divider, but can be used in BYPASS mode.  It has less delay than BUFIO, and BUFIO can only clock IOLOGIC, whereas BUFR can source regional clock lines.

 

Some primitives, such as IBUFGDS aren't really primitives.  IBUFGDS will get implemented as IBUFDS feeding BUFG.  This can be confusing, such as in timing reports where routing delays show up that one hadn't anticipated.  In fact, in one case, the implementation schematic took the input and output nets for the BUFG outside the Verilog module (as I/O) and placed the BUFG at a higher level of the hierarchy.  Just a quirk of the tools, but one that can be confusing.

 

There is also some overlap in primitives.  For instance it would appear that IDDR is actually a degenerate case if ISERDES.  It is well worth while to look at "implemented design" and see how things are implemented.  There is so much on the die that it takes a lot of zooming to get the details, but it is worth the effort.  It is easier to design efficiently when one understands the hardware.  There appear to be four I/O related blocks:  IOB is the pin driver/receiver, including linking two together to form a differential input.  IODELAY is a separate block.  IOLOGIC is a separate block that implements IDDR and ISERDES.  In fact it does not appear that there is an equivalent of IFB, so IDDR fulfills that function.  For clock capable pins there is the block that contains BUFR and BUFIO.  Related to that is the dedicated clock routing fabric.  While it isn't zero delay, the delay is small and consistent.

 

Worst case timing isn't as bad as one might suspect.  Computing timing manually is tedious and fraught with error.  Maximum data delay and minimum clock delay will never occur simultaneously, and the program knows that.  It only used limits that can occur together.  The variations in delay over temperature, voltage and process are higher than one might expect, but they mostly track.  My paper analysis said that timing closure was not possible.  The timing report said it had 1 ns of slack.  Our measurements showed 1.8 ns of slack (but only for a sample of 2). 

 

Based on the results of this project and the testing done, I am of the opinion that any DDR serial interface with a clock that isn't too fast for the fabric to handle can be successfully implemented in fabric with static timing and no ISERDES.  Only if speeds are too high for the fabric to handle, or timing variations too large so that dynamic timing is required is ISERDES needed, and probably is the only situation where it is of any use or value.

 

View solution in original post


All Replies
Instructor
Posts: 3,702
Registered: ‎01-23-2009

Re: SelectIO output bus width of ISERDES

It would help if you told us what "not insanely fast" is.

 

So what you are saying is true - the ISERDES is nothing but a shift register with parallel load. The only reason it exists in the FPGA is to allow you to isolate the "insanely fast" part of a serial interface to the input/output block and associated clocking structures, and not have to deal with an "insanely fast" internal clock.

 

So, if your serial clock is less than what can be easily handled in the FPGA (and, depending on family and speed grade, this could be anywhere from 300MHz to 600MHz), then you can implement everything you need using the IDDR and some simple logic in the fabric.

 

The BITSLIP logic is also not "necessary" - the equivalent function can be done in the fabric using a barrel shifter and some flip-flops for storage.

 

However, regardless of whether you use the ISERDES or the IDDR to capture the incoming data, you must ensure reliable capture. The complexity of this (again) depends on how "not insanely fast" your input data is. Xilinx FPGAs need a fairly sizeable stable data setup/hold window. If, after taking into account all the PVT variations of your system, you can ensure that the window is "big enough" for static capture, then training is not needed - fixed IDELAY values will suffice. This is by far the preferred situation. If, however, the window is too small for static capture, then you need to do training.

 

As for the ISERDES not doing 12bits - that's just the way it is. It's not that 12 is any more complex than 14 or 10, but the ISERDES just doesn't support it - and any attempt to fake it (like setting the deserialization to 14:1 and then using a 1/6 clock) is not supported... If you need 12:1, then

  - use the IDDR and do the 6:1 in a barrel shifter

  - use the ISERDES in 6:1 mode and then do the remaining 2:1 in the fabric

     - this requires an intermediate clock running at 1/3 the incoming frequency

 

Avrum

Adventurer
Posts: 57
Registered: ‎05-15-2014

Re: SelectIO output bus width of ISERDES

"insanely fast" The chip in question is z010-1, The data in question has a bit clock of 240 MHz.

 

To follow up on bitslip, I understand the barrel shifter, but what I don't understand is why in situations (such as mine) where a frame clock is available one would ever need bitslip.  If the frame clock is doing its job, the bits should already be in the correct position.

 

The other half I guess I'll have to take on faith.  If ISERDES is basically a shift register and a parallel load register, I haven't a clue why it even cares how many bits wide the data is as long as it isn't wider than the registers.  If I tell it that it is 14 bits and the frame clock transfers after 12, the upper 2 bits will be whatever the hardware makes them (don't care).

 

If the above isn't true for some reason that defies the rules of Boolean algebra, then I guess XAPP 524 with its two SDR SERDES on opposite clocks is the best approach for my needs.

Instructor
Posts: 3,702
Registered: ‎01-23-2009

Re: SelectIO output bus width of ISERDES

insanely fast" The chip in question is z010-1, The data in question has a bit clock of 240 MHz.

 

Is that 240Mbps (so a bit interval of 4.166ns) or 240MHz DDR (for a bit interval of 2.0833ns).

 

At 4.166ns, it should be relatively easy to statically capture data in a Zynq-7010-1 (Which is an Artix-7 fabric). At 2.0833ns, its much less easy, and depending on the characteristics of the sending device, may not even be possible statically...

 

To follow up on bitslip, I understand the barrel shifter, but what I don't understand is why in situations (such as mine) where a frame clock is available one would ever need bitslip.  If the frame clock is doing its job, the bits should already be in the correct position.

 

This question leads me to believe that there is something you are missing... The frame "clock" is rarely used as a clock in the FPGA. The normal way of capturing the data is to use the bit clock driving a BUFIO and a BUFR. The BUFR is used to generate the word clock, which is the bit clock divided by 12 (SDR) or by 6 (DDR). The BUFIO and BUFR clocks are used to drive CLK and CLKDIV of the ISERDES. An ISERDES is used to capture both the incoming data and to sample the frame "clock".

 

You then use a single BITSLIP that drives both (or all if the data is more than one bit wide) ISERDES. You assert BITSLIP one at a time until the 0->1 transition on the sampled frame "clock" is between words from the ISERDES. When this is done, your data is now framed.

 

You cannot use the frame "clock" as the CLKDIV of the ISERDES - it is not guaranteed to have the required phase relationship between the CLK and CLKDIV for the ISERDES to operate.

 

Look at the 7-series Select I/O User Guide (v1.6), page 153 (below figure 3-6) - this describes the legal clocking structures for the ISERDES... Using two externally sourced clocks (one for CLK and one for CLKDIV) is not one of the legal combinations...

 

As for the ISERDES not doing 12 - its not just a simple shift register. The interaction of the BITSLIP, the memory mode (which you are not using) and (probably most importantly) the cascading between the master and slave ISERDES make this block more complex that it seems on the surface. I suspect (but have absolutely no data to back this up) that the "I don't do 12x" comes from some timing issue involved in the master/slave cascade...

 

Avrum

Adventurer
Posts: 57
Registered: ‎05-15-2014

Re: SelectIO output bus width of ISERDES

OK, further clarification.  4.166 ns is the correct number.  This is an 8 channel 12 bit ADC running at 40 Ms/s with one DDR stream for each of the 8 channels.  So the bit clock is 240 MHz, 6 x the frame clock, and one of the 12 bits is transferred on each edge of it.  As such it falls into your "relatively easy" category, which is what I suspected.

 

I think part of the disconnect is that ISERDES is really designed for situations where the data rate is too high for the fabric to handle, so this highly optimized, hard logic subsystem is there to parallelize it into something the fabric can handle.  In my case, I probably don't NEED ISERDES, but since it is there, I thought I would use it.  As you mention, its exact internals is a Xilinx secret, and you are probably correct that the way it is implemented in order to meet the speed requirements is not straightforward.

 

I am aware of the "sampling the frame clock" approach shown in some of the ap notes and understand why it may be useful in some situations.  I fail to see its value when the frame clock is 40 MHz.  If we can't achieve reasonable timing closure to within a fraction of a bit time at 40 MHz, the whole chip is a hopeless disaster.

 

This plays into the reasons for automatically adjusting delay and bit slip--probably essential at high enough data rates, but hugely overkill for what I need.  Related to this is the fact that many serial protocols have a "training header" and self-clocking data that eliminate the clock lanes entirely.  Clearly both clock phase and bit slip are critical in such cases.  But in the comparatively slow, explicitly qualified case I'm dealing with, none of this makes any sense.  I have all the information necessary to get it right the first time coming into the chip, and any dynamic "playing" can only reduce the reliability of the capture.  Appropriate fixed delays should be the worst case need.

 

I need to explore more the part about external frame clock not being a legal clocking option.  Both that and the concept that the path delay of the frame clock can't be adequately characterized and controlled with a fixed delay don't sound reasonable to me, but there may be something I'm missing.  Unfortunately this is my first encounter with Artix-7 and the hardware that I need to test with is at least two weeks away from being available.  I'm using a Zybo for initial development, but I can't appropriately bring this signal into the Zynq on the Zybo with any signal integrity, so any testing I did with that lashup would probably generate more spurious information that helpful data. 

 

Both the clocks are connected to clock capable inputs--therefore can drive BUFGs, so it would seem that this complies with the third specified valid clocking arrangement.  That was my intent.  There is a whole section of logic around this ADC.  My plan was to drive the ADC with a 40 MHz clock (no particular phase relationship to anything) and then receive the 40 MHz frame clock and 240 MHz bit clock using them to drive the 8 pairs of ISERDES, and also using the received frame clock as the processing clock for all subsequent logic, which ends in one port of a BRAM.  The CPU has an AXI interface to the other port of the BRAM, thus creating a clock boundary there.  If I need a faster clock for intermediate logic states it would be derived from the received bit clock.

 

Thanks

Wilton

Instructor
Posts: 3,702
Registered: ‎01-23-2009

Re: SelectIO output bus width of ISERDES

OK, further clarification.  4.166 ns is the correct number.  This is an 8 channel 12 bit ADC running at 40 Ms/s with one DDR stream for each of the 8 channels.  So the bit clock is 240 MHz, 6 x the frame clock, and one of the 12 bits is transferred on each edge of it. 

 

40MS/s at 12 bits/sample is 480Mbps, thus 240MHz DDR. Thus the bit time is not 4.166ns, but 2.083ns - which is not relatively easy... And regardless of the ease, you must have proper constraints on the interface and ensure that it meets timing after implementation. You can take a look at this post on constraining edge aligned source synchronous interfaces.

 

Both the clocks are connected to clock capable inputs--therefore can drive BUFGs, so it would seem that this complies with the third specified valid clocking arrangement.  That was my intent.

 

I am pretty sure that what you are describing isn't legal. And even if it was, this still doesn't guarantee that the resulting parallel data out of the ISERDES is framed; there is nothing in the specification that guarantees any relationship between "the bit that arrives just after the rising edge of CLKDIV" and any particular output of the ISERDES Q1-Q8. What I described in my previous post is known to work...

 

My plan was to drive the ADC with a 40 MHz clock (no particular phase relationship to anything) and then receive the 40 MHz frame clock and 240 MHz bit clock using them to drive the 8 pairs of ISERDES, and also using the received frame clock as the processing clock for all subsequent logic, which ends in one port of a BRAM. 

 

There is nothing wrong with basic idea, but the 40MHz clock used internally should be the 240MHz clock divided by 6 using the BUFR - not the frame clock from the ADC...

 

Avrum

Adventurer
Posts: 57
Registered: ‎05-15-2014

Re: SelectIO output bus width of ISERDES

OK.  First my mistake.  The clock period is 4 ns, but you are correct that the data is changing on each clock edge or 2 ns.  No FF has to clock at 2 ns, but the setup and hold times of a properly centered clock would have to be 1 ns ea.  The theoretical maximum eye width is 2 ns.

 

It appears that there are four possible ways to do this (with my ADC):

  1.  Ignore ISERDES and use fabric.  Not sure that's reasonable at this data rate.

  2. Use two unlinked SDR ISERDES one on each clock edge.  There is a sample of that available.  It would seem that this might be more complex because it could lead to backwards interleaving, requiring more complex bit swapping logic.

  3. Use one DDR ISERDES and do a 2:1 demux in fabric.  That seems like the least complex solution.

  4. Tell the ADC to stuff a couple of extra zeros in the word (which it can do) and do a linked DDR ISERDES of 14 bits.  This, of course, reduces timing tolerances a bit more, and increases power consumption a bit, due to the bit clock now being 280 MHz., but the SERDES and bitslip become straightforward.

 

Any comments on the relative merits of these options?

 

Wilton

Instructor
Posts: 3,702
Registered: ‎01-23-2009

Re: SelectIO output bus width of ISERDES

At 240 DDR, all of these are possible - you don't need the ISERDES, since the BUFIO and BUFR can both run at 240MHz internally. Or you use the ISERDES to reduce the internal frequency if you want. You probably don't need the dual SDR ISERDES, but you can use it if you want...

 

The real question is "Can you capture this interface statically". If yes, then any of these mechanisms will do, If no, then you need dynamic calibration. Pushing this to 280MHz only makes this worse, so I would not pursue that last option (artificially increase the serialization to 14).

 

You need to write the constraints and set up the clock structure for capture and see if you can make it pass. If it passes with the IDDR, then it will pass with the ISERDES (they are the same capture resource). At these frequencies, the constraints have to be perfect - you need to take into account duty cycle imbalance, jitter, board delay imbalance, as well as the worst case properties of your source device...

 

Avrum

Adventurer
Posts: 57
Registered: ‎05-15-2014

Re: SelectIO output bus width of ISERDES

OK.  I figured the 14 bit solution would only make matters worse.  I'm also coming to the conclusion that ISERDES is only valuable for cases where it is badly needed.  I could receive this data cleanly with a couple of fast CMOS ICs with no issues, so I should be able to capture it in a straightforward manner with fabric logic.  In my estimation at present, it will take more fabric to make ISERDES work for this application than it will to deserialize directly in fabric.  It seems to me that in order for ISERDES to work its magic in cases where it is needed that it has significantly more restrictions than the straightforward implementation needed for this particular case would.  Even assuming that my hunch about directly using the frame clock worked, I can't do 12 bit, and any other variation is going to require modification of the frame clock, which opens a whole can of worms for timing because it introduces delays that don't match the bit clock path.  And if I can't directly use frame clock, it appears that I have to implement a state machine for bitslip.

 

I will follow up once I have hardware and can test.

Adventurer
Posts: 57
Registered: ‎05-15-2014

Re: SelectIO output bus width of ISERDES

I promised an update on this.  So now that I have a working design, I will share what I learned.  The task was to interface an AD9222 12 bit 8 channel ADC to the Zynq.  The input clock to the ADC is 40 MHz coming from FPGA CLK 0 from the PS.  It is not used by the Zynq in any other manner.  The ADC generates a 240 MHz bit clock from this and outputs DDR data at that rate.  It supplies a bit clock output with transitions at the center of each data period, and a frame clock output with a positive edge crossing aligned with the start of the first bit of each data word.

 

It was not necessary to use any dynamic timing to accurately deserialize this data.  ISERDES did not appear to be a useful primitive, partly because of its failure to support 12 bit data, and partly because it simply isn't designed to take advantage of the frame clock.  It is a valuable primitive for situations where it is needed, where fabric couldn't handle the data rate, and where dynamic timing adjustments are essential to successful data recovery.  But anyone looking at ISERDES as I was, as a useful primitive to avoid using fabric is likely to find that it takes more fabric to use it than it does to just do the job.

 

The solution starts with timing.  Vivado puts much more emphasis on timing than ISE did, because timing constraints drive place and route decisions--which is actually a welcome improvement.  There are two clocks that have to be considered here.  One is the bit clock the ADC provides.  The other is the internal one in the ADC that causes the data to be shifted out.  They are both of the same frequency, but the timing (phase) relationship is important.  The output clock has been 90 degree phase shifted to align with the middle of the data cell.  In addition, Vivado assumes that source clocks are what causes data changes, so the edge prior to the data change is compared to the destination clock edge that captures the change.  While there are multiple ways to represent this, I did not (as other posts have done) find it necessary to use the "multicycle" constraint to represent this.  I represented it by the relative timing values in the waveform parameter.  Note that there is a 1/2 cycle discrepancy in the following numbers that will be addressed later.  Here are the clock constraints:

 

#virtual clock driving data from ADC
create_clock -period 4.167 -name ADC_Data_Clk -waveform {-3.125 -1.042}
#Actual 90 degree shifted bit clock from ADC
create_clock -period 4.167 -name ADCBITCLK -waveform {-2.083 0} [get_ports ADCBITCLK_v_p]

 

The first is a "virtual clock" that describes what causes the data to be generated.  The second is the actual clock that receives it.  This next line just assures Vivado that there is no point in looking for a relationship between the clock driving the ADC and the clocks being constrained:

 

set_clock_groups -asynchronous -group [get_clocks ADCBITCLK] -group [get_clocks clk_fpga_0]

 

Then the next four lines (because this is DDR) tell Vivado that there is no source - destination relationship between the positive edge of one clock and the negative edge of the other:

 

set_false_path -setup -fall_from [get_clocks ADC_Data_Clk] -rise_to [get_clocks ADCBITCLK]
set_false_path -setup -rise_from [get_clocks ADC_Data_Clk] -fall_to [get_clocks ADCBITCLK]
set_false_path -hold -fall_from [get_clocks ADC_Data_Clk] -rise_to [get_clocks ADCBITCLK]
set_false_path -hold -rise_from [get_clocks ADC_Data_Clk] -fall_to [get_clocks ADCBITCLK]

 

And finally, these lines document the variability possible in the data, as defined in the ADC data sheet.  Note that this is an 8 channel ADC, but I'm only showing a single channel to keep it readable:

 

set_input_delay -clock [get_clocks ADC_Data_Clk] -clock_fall -min -add_delay -0.300 [get_ports ADCA_v_n]
set_input_delay -clock [get_clocks ADC_Data_Clk] -clock_fall -max -add_delay 0.300 [get_ports ADCA_v_n]
set_input_delay -clock [get_clocks ADC_Data_Clk] -min -add_delay -0.300 [get_ports ADCA_v_n]
set_input_delay -clock [get_clocks ADC_Data_Clk] -max -add_delay 0.300 [get_ports ADCA_v_n]

 

The actual implementation consisted of bringing the clock in on an IBUFDS, then through IDELAYE2 and IBUFR.  The data comes in on IBUFDS and IDELAYE2 to IDDR.  The frame clock is brought in like data, but using only one output of the IDDR.  There does not appear to be anything like an IFD, so IDDR was used.  Because IDELAY was used, IDELAYCTRL has to be instantiated and given a 200 MHz clock.  Since this is the only used of IDELAY, grouping was not needed.

 

   IDELAYCTRL IDELAYCTRL_inst (
       .RDY(),       // 1-bit output: Ready output
       .REFCLK(DlyClk), // 1-bit input: Reference clock input
       .RST(!S_AXI_ARESETN)        // 1-bit input: Active high reset input
    );

    IBUFDS IBUFDS_Bit (
       .O(bitclk),  // Buffer output
       .I(ADCbitClk_v_p),  // Diff_p buffer input (connect directly to top-level port)
       .IB(ADCbitClk_v_n) // Diff_n buffer input (connect directly to top-level port)
    );
   
   IDELAYE2 #(
      .CINVCTRL_SEL("FALSE"),          // Enable dynamic clock inversion (FALSE, TRUE)
      .DELAY_SRC("IDATAIN"),           // Delay input (IDATAIN, DATAIN)
      .HIGH_PERFORMANCE_MODE("FALSE"), // Reduced jitter ("TRUE"), Reduced power ("FALSE")
      .IDELAY_TYPE("FIXED"),           // FIXED, VARIABLE, VAR_LOAD, VAR_LOAD_PIPE
      .IDELAY_VALUE(DelayCount),                // Input delay tap setting (0-31)
      .PIPE_SEL("FALSE"),              // Select pipelined mode, FALSE, TRUE
      .REFCLK_FREQUENCY(200.0),        // IDELAYCTRL clock input frequency in MHz (190.0-210.0, 290.0-310.0).
      .SIGNAL_PATTERN("CLOCK")          // DATA, CLOCK input signal
   )
   IDELAYE2_C (
      .CNTVALUEOUT(), // 5-bit output: Counter value output
      .DATAOUT(bitclkd),         // 1-bit output: Delayed data output
      .C(1'b0),                     // 1-bit input: Clock input
      .CE(1'b0),                   // 1-bit input: Active high enable increment/decrement input
      .CINVCTRL(1'b0),       // 1-bit input: Dynamic clock inversion input
      .CNTVALUEIN(5'b0),   // 5-bit input: Counter value input
      .DATAIN(1'b0),           // 1-bit input: Internal delay data input
      .IDATAIN(bitclk),         // 1-bit input: Data input from the I/O
      .INC(1'b0),                 // 1-bit input: Increment / Decrement tap delay input
      .LD(1'b0),                   // 1-bit input: Load IDELAY_VALUE input
      .LDPIPEEN(1'b0),       // 1-bit input: Enable PIPELINE register to load data input
      .REGRST(1'b0)            // 1-bit input: Active-high reset tap-delay input
   );
  
   BUFR #(
       .BUFR_DIVIDE("BYPASS"),   // Values: "BYPASS, 1, 2, 3, 4, 5, 6, 7, 8"
       .SIM_DEVICE("7SERIES")  // Must be set to "7SERIES"
    )
    BUFR_BitClk (
       .O(bitclkb),     // 1-bit output: Clock output port
       .CE(1'b1),   // 1-bit input: Active high, clock enable (Divided modes only)
       .CLR(1'b0), // 1-bit input: Active high, asynchronous clear (Divided modes only)
       .I(bitclkd)      // 1-bit input: Clock buffer input driven by an IBUF, MMCM or local interconnect
    );

    IBUFDS IBUFDS_Frame (
       .O(frameclk),  // Buffer output
       .I(ADCframeClk_v_p),  // Diff_p buffer input (connect directly to top-level port)
       .IB(ADCframeClk_v_n) // Diff_n buffer input (connect directly to top-level port)
    );

   IDELAYE2 #(
      .CINVCTRL_SEL("FALSE"),          // Enable dynamic clock inversion (FALSE, TRUE)
      .DELAY_SRC("IDATAIN"),           // Delay input (IDATAIN, DATAIN)
      .HIGH_PERFORMANCE_MODE("FALSE"), // Reduced jitter ("TRUE"), Reduced power ("FALSE")
      .IDELAY_TYPE("FIXED"),           // FIXED, VARIABLE, VAR_LOAD, VAR_LOAD_PIPE
      .IDELAY_VALUE(0),                // Input delay tap setting (0-31)
      .PIPE_SEL("FALSE"),              // Select pipelined mode, FALSE, TRUE
      .REFCLK_FREQUENCY(200.0),        // IDELAYCTRL clock input frequency in MHz (190.0-210.0, 290.0-310.0).
      .SIGNAL_PATTERN("CLOCK")          // DATA, CLOCK input signal
   )
   IDELAYE2_F (
      .CNTVALUEOUT(), // 5-bit output: Counter value output
      .DATAOUT(frameclkd),         // 1-bit output: Delayed data output
      .C(1'b0),                     // 1-bit input: Clock input
      .CE(1'b0),                   // 1-bit input: Active high enable increment/decrement input
      .CINVCTRL(1'b0),       // 1-bit input: Dynamic clock inversion input
      .CNTVALUEIN(1'b0),   // 5-bit input: Counter value input
      .DATAIN(1'b0),           // 1-bit input: Internal delay data input
      .IDATAIN(frameclk),         // 1-bit input: Data input from the I/O
      .INC(1'b0),                 // 1-bit input: Increment / Decrement tap delay input
      .LD(1'b0),                   // 1-bit input: Load IDELAY_VALUE input
      .LDPIPEEN(1'b0),       // 1-bit input: Enable PIPELINE register to load data input
      .REGRST(1'b0)            // 1-bit input: Active-high reset tap-delay input
   );
  
    IDDR #(
        .DDR_CLK_EDGE("SAME_EDGE") // "OPPOSITE_EDGE", "SAME_EDGE"
                                             //    or "SAME_EDGE_PIPELINED"
    ) IDDR_Frame (
        .Q1(), // 1-bit output for positive edge of clock
        .Q2(frameclk1), // 1-bit output for negative edge of clock
        .C(bitclkb),   // 1-bit clock input
        .CE(1'b1), // 1-bit clock enable input
        .D(frameclkd),   // 1-bit DDR data input
        .R(1'b0),   // 1-bit reset
        .S(1'b0)    // 1-bit set
    );

    IBUFDS IBUFDS_ADC0 (
       .O(ADC0data),  // Buffer output
       .I(ADC0_v_p),  // Diff_p buffer input (connect directly to top-level port)
       .IB(ADC0_v_n) // Diff_n buffer input (connect directly to top-level port)
    );

   IDELAYE2 #(
      .CINVCTRL_SEL("FALSE"),          // Enable dynamic clock inversion (FALSE, TRUE)
      .DELAY_SRC("IDATAIN"),           // Delay input (IDATAIN, DATAIN)
      .HIGH_PERFORMANCE_MODE("FALSE"), // Reduced jitter ("TRUE"), Reduced power ("FALSE")
      .IDELAY_TYPE("FIXED"),           // FIXED, VARIABLE, VAR_LOAD, VAR_LOAD_PIPE
      .IDELAY_VALUE(0),                // Input delay tap setting (0-31)
      .PIPE_SEL("FALSE"),              // Select pipelined mode, FALSE, TRUE
      .REFCLK_FREQUENCY(200.0),        // IDELAYCTRL clock input frequency in MHz (190.0-210.0, 290.0-310.0).
      .SIGNAL_PATTERN("DATA")          // DATA, CLOCK input signal
   )
   IDELAYE2_0 (
      .CNTVALUEOUT(), // 5-bit output: Counter value output
      .DATAOUT(ADC0datad),         // 1-bit output: Delayed data output
      .C(1'b0),                     // 1-bit input: Clock input
      .CE(1'b0),                   // 1-bit input: Active high enable increment/decrement input
      .CINVCTRL(1'b0),       // 1-bit input: Dynamic clock inversion input
      .CNTVALUEIN(1'b0),   // 5-bit input: Counter value input
      .DATAIN(1'b0),           // 1-bit input: Internal delay data input
      .IDATAIN(ADC0data),         // 1-bit input: Data input from the I/O
      .INC(1'b0),                 // 1-bit input: Increment / Decrement tap delay input
      .LD(1'b0),                   // 1-bit input: Load IDELAY_VALUE input
      .LDPIPEEN(1'b0),       // 1-bit input: Enable PIPELINE register to load data input
      .REGRST(1'b0)            // 1-bit input: Active-high reset tap-delay input
   );
  
    IDDR #(
        .DDR_CLK_EDGE("SAME_EDGE") // "OPPOSITE_EDGE", "SAME_EDGE"
                                             //    or "SAME_EDGE_PIPELINED"
    ) IDDR_D0 (
        .Q1(ADC0dataE), // 1-bit output for positive edge of clock
        .Q2(ADC0dataO), // 1-bit output for negative edge of clock
        .C(bitclkb),   // 1-bit clock input
        .CE(1'b1), // 1-bit clock enable input
        .D(ADC0datad),   // 1-bit DDR data input
        .R(1'b0),   // 1-bit reset
        .S(1'b0)    // 1-bit set
    );

 

The outputs if the IDDRs go to shift registers:

 

    always @ (posedge bitclkb) begin           // Shift odd bits on positive clock edge (DDR)
        frameclk2 <= frameclk1;
        for (i = 5; i > 0; i = i - 1) begin    // eight shift registers
            ADC0Odd[i] <= ADC0Odd[i-1];
            ADC0Even[i] <= ADC0Even[i-1];
        end
        ADC0Odd[0] <= ADC0dataO;                // shifting in the new bit
        ADC0Even[0] <= ADC0dataE;                // shifting in the new bit
        if (frameclk1 & !frameclk2) begin
            tempframe <= frameclki;
            for (i = 5; i >= 0; i = i - 1) begin    // demux odd and even bits
                ADC0Word[2 * i + 1] <= ADC0Odd[i];
                ADC0Word[2 * i] <= ADC0Even[i];
            end
        end
    end

 

The last part above is a second (latching register)  frameclk2 is a copy of frameclk1, delayed one clock cycle.  During the clock cycle when frameclk1 is high and frameclk2 is low, the shift registers are transferred to the holding register, de-interleaved at the same time.

 

IDELAY was put on all inputs.  The data inputs (including frame clock) were set to tap 0, and the clock was ultimately set to tap 9 to center it in the data window.  There were two reasons for putting IDELAY in the data path.  First, even at tap 0, it introduces some delay, and without it, the clock would probably have had to be set at tap 0, leaving no room for adjustment.  The initial version used a loadable delay that was controlled by RS-232 input to the program running on the PS.  Minimum and maximum values that produced correct output were determined.  The second reason is that it creates a balanced signal flow path, which cancels out variations in IDELAY time.  IDELAY could have been used with the data (taps other than 0) since the clock path is longer already.  However there is a data dependent jitter that doesn't occur in a clock path, because it is repetitious.

 

IDDR has three output options.  OPPOSITE EDGE gives the least delay, but at some point the data is going to need to be accessible from a single clock (edge), and in my opinion, the sooner, the better.  SAME EDGE solves this problem with the addition of a buffer on one of the paths to retime it, at a cost of half a clock cycle.  But the two data bits coming out can be from consecutive words, which isn't terribly useful at first glance.  SAME EDGE PIPELINE introduces additional buffering to resolve that problem at the expense of an extra clock cycle delay.  SAME EDGE can the best of all worlds though if the clock is simply inverted.  In this case it isn't being physically inverted, but because the clock path in longer, it is easy to create the inversion by simply accepting a half cycle of delay and specifying the clock timing 1/2 cycle off.

 

Tests were  run over the range of -40 to +85 C.  The data was stable in all cases over a clock tap range of 2 to 26 (a 1.8 ns range) and the minimum and maximum changed by only 2 taps over the entire range.  That is quite good, considering that the theoretical data window is 2 ns.  This was on a sample of 2, so actual device to device variations would undoubtedly reduce it some.  Also, while the two ICs were only about two inches apart, all lines were controlled impedance, differential pairs of equal length.

 

Here are some general things learned from this exercise.  It was a "trial by fire" of zynq and Vivado, but resulted in a lot of learning.

 

ISERDES.  When you need it you need it.  When you don't, you don't.  Anyone who sees (as I initially did) it as some useful "hard logic" to save PL resources is likely to be disappointed.  If fabric can deserialize with static timing, it will probably take less fabric than what ISERDES requires to support it.

 

BUFG is not your friend.  It is a valuable primitive when a clock needs to be widely distributed across the fabric, particularly if it originates internally, but it isn't the optimum solution for something like this.  BUFGs are located in the middle of the die and routing delays in and out of them can introduce several ns. of delay.  This can be compensated for by using a PLL with feedback, but BUFG should only be used when really needed.  BUFR and BUFIO are better choices for a more localized situation such as this.  BUFR includes a divider, but can be used in BYPASS mode.  It has less delay than BUFIO, and BUFIO can only clock IOLOGIC, whereas BUFR can source regional clock lines.

 

Some primitives, such as IBUFGDS aren't really primitives.  IBUFGDS will get implemented as IBUFDS feeding BUFG.  This can be confusing, such as in timing reports where routing delays show up that one hadn't anticipated.  In fact, in one case, the implementation schematic took the input and output nets for the BUFG outside the Verilog module (as I/O) and placed the BUFG at a higher level of the hierarchy.  Just a quirk of the tools, but one that can be confusing.

 

There is also some overlap in primitives.  For instance it would appear that IDDR is actually a degenerate case if ISERDES.  It is well worth while to look at "implemented design" and see how things are implemented.  There is so much on the die that it takes a lot of zooming to get the details, but it is worth the effort.  It is easier to design efficiently when one understands the hardware.  There appear to be four I/O related blocks:  IOB is the pin driver/receiver, including linking two together to form a differential input.  IODELAY is a separate block.  IOLOGIC is a separate block that implements IDDR and ISERDES.  In fact it does not appear that there is an equivalent of IFB, so IDDR fulfills that function.  For clock capable pins there is the block that contains BUFR and BUFIO.  Related to that is the dedicated clock routing fabric.  While it isn't zero delay, the delay is small and consistent.

 

Worst case timing isn't as bad as one might suspect.  Computing timing manually is tedious and fraught with error.  Maximum data delay and minimum clock delay will never occur simultaneously, and the program knows that.  It only used limits that can occur together.  The variations in delay over temperature, voltage and process are higher than one might expect, but they mostly track.  My paper analysis said that timing closure was not possible.  The timing report said it had 1 ns of slack.  Our measurements showed 1.8 ns of slack (but only for a sample of 2). 

 

Based on the results of this project and the testing done, I am of the opinion that any DDR serial interface with a clock that isn't too fast for the fabric to handle can be successfully implemented in fabric with static timing and no ISERDES.  Only if speeds are too high for the fabric to handle, or timing variations too large so that dynamic timing is required is ISERDES needed, and probably is the only situation where it is of any use or value.