05-07-2019 12:42 PM - edited 05-23-2019 12:59 PM
We are attempting to get a Zynq XC7Z045 to talk to an NVMe card, specifically a Samsung 960 Pro SSD. The '045 is part of a Knowres KRM-3Z7045 module. Both that module and the NVMe card are plugged into a custom carrier board.
Our current setup instantiates just the processor core and the AXI PCIe bridge as a root complex, as shown in this FPGA Developer Blog post. We're running single lane, and we've tried both Gen2 and Gen1 speeds. We are using Vivado 2018.1.
We are using ChipScope to look at the LTSSM in the AXI PCIe bridge. We see that the LTSSM cycles between states 0, 2, 4, 5, 2D (timeout), and back to 0.
When in state 4, we see that the RX data appears to be a string of TS1 symbols. The LTSSM transitions to state 5 and the TX data then switches from TS1 to TS2 symbols. We never see TS2 symbols from the NVMe card, and eventually timeout and cycle.
Here is a picture of state 4, demonstrating TS1 on the RX channel, and the transition to sending TS1 on the TX channel.
Here is a picture of transitioning to state 5, and showing the TS1 to TS2 transition on the TX channel. Note that RX is still TS1.
Here is a picture of the timeout from state 5 to 2D, then back to 0. Note that the SSD is still sending TS1.
We suspect that the NVMe card is not achieving bit/symbol lock, and so its LTSSM is stuck in state 4. Our root complex is achieving lock, transitioning to state 5, and timing out while waiting for TS2 from the NVMe.
Our question is: Why?
One possibility is a signal integrity issue. It is, after all, our custom carrier board.
We've looked into using IBERT, but the AXI-PCIE block (v2.8) only lists the JTAG debugger in its debug options. Also, XAPP1198 says that "If the link does not train to any speed, including gen 1 speeds (2.5 Gb/s), then using Eye Scan is not recommended, and using the ltssm signal from the core is a better option."
This picture does demonstrate another infrequent issue we've seen, and we are unsure if it's related to the larger problem. Here, we've lost rxcdrlock, and the resulting data is messed up. Then we get 8B/10B and receive errors on rx_status. This is an intermittent error, and we're not certain what causes it or what it indicates. What does this clue mean?
We would appreciate any suggestions.
05-19-2019 02:19 AM
Update on this problem (I'm also working on it).
We put a 4 GHz scope on the PCIe signals on the board, in both directions (FPGA->NVMe, and NVMe->FPGA), and the signal integrity is fine. So that's not the problem. We think we have some kind of configuration problem and are mystified about what it might be.
06-14-2019 11:26 AM
The fix turned out to be simple.
The Samsung NVMe will not come out of reset with a x1 link, but it comes up just fine with a x4 link.
It's possible this is because we were talking to the wrong lane when we tried x1. We don't know, we haven't tried different lanes.