10-08-2020 09:00 AM
Hi everybody, I was hoping to get some help with an issue we are facing. I'll try to explain as clearly as I can and please let me know if there are any further questions that could help us try to get to the bottom of what I am experiencing.
We implemented our own version of XAPP523 running on a Zynq7020 and an ArtixA100T to communicate between two boards using plastic optical fiber at 250Mbaud. Here is a diagram of the overall architecture:
The design has been tested and validated and it works fine with some implementations. Some times when we make a change and get a new implementation the design would stop working even if the logic that was changed had nothing to do with the communication IP itself. Maybe just adding some ILA's or other small "benign" changes.
We also noticed that we could generate two implementations with the exact same source code and exact same block diagram in two different computers and one implementation operated properly and other didn't. Which triggered the subsequent investigation explained below.
While looking at one implemented design that worked and one that didn't I noticed something interesting in the placement of the logic. Here is another diagram to explain the inner part of the FPGA:
We ended up writing our own DRU logic and really only used the concepts from XAPP523 to use the ISERDES block in OVERSAMPLE mode and the IDELAY blocks to do 4X Oversampling of the asynchronous data stream.
Here there's a picture of an implemented design where the IP does not operate as expected, highlighted are the leaf cells of the IP and you can notice that both TX and RX differential pairs, IDELAY and ISERDES blocks are at the top left of the ARTIX Chip (X0Y3 region). We were able to figure out that the issue was not in the Zynq7020 device, but the issue was in the ARTIX device:
Versus an implemented design where the design will operate correctly:
Notice that in both implemented designs timing closed.
I noticed that in the design that worked the logic appeared to be closer to the ISERDES and IDELAY blocks, so I decided to experiment with PBLOCK constraints and force the placer to get our IP close to the differential input pairs, ISERDES and IDELAY blocks:
Once I added these constraints, the design appears to work every time we implement, so I decided to do another experiment.
I created a PBLOCK constraint where the placement if very far away from the differential input pairs, ISERDES and IDELAY blocks:
This implemented design always fails in hardware, so this led me to conclude that the placement of the IP in the chip really makes a difference.
Now for us the real problem here appears to be some missed path in the timing analysis, since all of the examples shown above close timing according to the tool. I should note that we do not have any input or output constraints on this design.
I am struggling to see why we would need input or output delay constraints since the input and output pins for this communication scheme are completely asynchronous to any clock in the FPGA.
We are a bit nervous with using the PBLOCK constraints without understanding why do they matter. Is there a timing constraint that will force the design to place the logic closer to the IO without the need to use floor planning?
It is my understanding that floor planning is only necessary when you are not closing timing, but in our case the tool seems to think that timing always closes.
Hopefully someone can shed some light here, and please let me know if I can clarify anything or have any question. The actual IP design is proprietary and I can't really share the code in the forums.
10-11-2020 10:21 AM
Designs that work after some implementation runs and don't after others are almost always related to incorrect or incomplete timing constraints; the design will work when it happens to have timing close to what the real constraints should be, but will not work if the timing is between what the real constraints should be and what the current constraints are. This often has to do with placement - a placement that would otherwise be "illegal" (would fail timing) with the correct constraints, but is not discarded by the tools since it doesn't violate the current constraints.
Since the timing of the inputs is effectively asynchronous (since you are oversampling) and all the clocks come from the same MMCM, there should be few opportunities for incorrect constraints (I presume you don't have any timing exceptions on paths between the clocks).
However, one thing that could be an issue is the PHASESHIFT_MODE of the MMCM. For this design the PHASESHIFT_MODE MUST be set to WAVEFORM - if it is set to LATENCY, then the requirement on the paths between the 0 and 90 degree shifted clocks will be incorrect (the paths will be significantly underconstrained, which can cause the problems you are seeing). Take a look at this post on the PHASESHIFT_MODE and the dangers when using a shifted clock.
Now, you say this is a 7 series device, and in the 7 series, the default mode for the MMCM is WAVEFORM (which is the mode you want), but it can be overridden either in the RTL code, the XDC files, or, if the MMCM is generated by the clocking wizard, then there too. Make sure you check...
That being said, there are simpler ways of oversampling at these low frequencies. If you have a 250Mbps signal and you want to oversample it by 4x, then just use the MMCM to generate a 500MHz clock (on CLKOUT0-CLKOUT3), route that directly to a BUFIO and a BUFR and have those drive an ISERDES in DDR mode. This will sample the incoming signal at 1Gsps. If you want the result to be 8 bits at 125MHz (which is what it looks like your current design is doing), then set the ISERDES deserialization to 8 and the BUFR divide to 4 and you have 8 samples at 125Hz which you can use for your DRU. This is a pretty foolproof implementation - far less complicated than the oversample mode, and especially with your "double" oversampling mode with 45 degree separation (which is clever, but not necessary). It also ensures that the 8 samples are evenly distributed; the mechanism you are using relies on the precision of the IDELAY to generate the 45 degree phase (which isn't guaranteed to be exactly 1/8th of your clock period).
You can even take this further - this solution will work up to around 1.6Gsps (depending on speedgrade) so you can do more than 4x oversampling with this solution.
The only disadvantage is that the DRU and the resulting data are on a BUFR clock (not a global clock). At these frequencies you can use BUFG clocking for the ISERDES (but the maximum frequency is lower) - generate both the 500MHz and 125MHz clocks from two different outputs of your MMCM and use BUFGs for both (and still use the ISERDES in 8x deserialization and DDR mode). As I mentioned, this is limited to the speed of the BUFG which is lower than that of the BUFIO (but is still OK at 500MHz).
The other option is to use the BUFIO and BUFR, run your DRU on the BUFR clock and then use a clock crossing FIFO to bring the resulting data into a 250MHz domain clocked by a BUFG; the same MMCM that generates the 500MHz clock to the BUFIO/BUFR would also generate a second 125MHz output and use a BUFG. The two 125MHz clocks (the BUFR and the BUFG) should be considered to be "mesochronous" - a simple clock crossing FIFO is sufficient to cross between these two domains.
10-12-2020 08:58 AM
Thanks for your reply. I did check the MMCM and it appears to be set to WAVEFORM mode. Also thanks for the recommendation on the oversampling of the data, we did spend quite a bit of time to get this working so a re-development of the IP is unlikely unless we really have to.
Actually I have run a couple more experiments and I think I have convinced myself that the issue is not in the receive path. I fully agree with you that this looks like a path that it is not properly constrained but I have some evidence indicating that the responsible path may be in the transmitter rather than the receiver.
So the experiment I ran: I created a PBLOCK constraint in the bottom of the chip and one in the top of the chip. I asked the placer to put the receiver circuit at the bottom of the chip and the transmitter at the top. When I implemented that, the design operated properly . Then I did the opposite. I placed the transmitter circuit at the bottom of the chip and the receiver at the top of the chip and once again the circuit doesn't operate as expected.
I do not have any timing exceptions between paths and I fully agree that there aren't many opportunities for the timing analyzer to fail to explore a path since all comes from the same MMCM (Configured with the clk_wiz IP). We do have a cross clock domain area in the transmitter where data is loaded in the 125Mhz domain as a ten bit word, and the 250Mhz domain shifts the data out of the chip. I did ran the report_cdc tcl command and the tool thinks all the cdc paths have been safely handled.
If the issue is indeed clock cross domain, I just don't understand why placement would matter at that point. If the issue is a signal not being properly transferred between domains, it shouldn't matter where the circuit is placed, it should always be a problem shouldn't it? Maybe I'm not thinking about something.
I appreciate the help.