11-20-2019 03:40 PM - edited 11-20-2019 04:07 PM
I am working on a PCIe design on Ultrascale and Ultrascale+. I'm using the bare PCIe gen 3 core, with a custom DMA engine. I am running in to an interference issue between incoming and outgoing transfers. This appears to be the result of the core deasserting s_axis_rq_tready due to running out of buffer space for read completions. So I am attempting to use the transmit flow control interface to stall issuing read requests based on pcie_tfc_nph_av and pcie_tfc_npd_av. My initial attempt is to stall read request traffic when pcie_tfc_nph_av is less than 8. However, these signals are poorly documented and moreover do not seem to be operating correctly. Namely, the ILA reports that pcie_tfc_npd_av is stuck at 2, and pcie_tfc_nph_av gets stuck at 0 even with no activity on the interface, causing the design to hang.
Does anyone from Xilinx have any insight into pcie_tfc_nph_av and pcie_tfc_npd_av and why the core doesn't seem to be generating these signals correctly? Presumably I'm not doing something correctly, but pcie_tfc_npd_av only ever reading 2 makes no sense and seems to disagree with the documentation, and pcie_tfc_nph_av reading 0 when the core seems to be idle (both s_axis_rq_tvalid and m_axis_rc_tvalid low for an extended time) also makes no sense.
11-20-2019 05:59 PM
Alright, maybe scratch that about pcie_tfc_npd_av, according to the PCIe spec it seems that only applies to the data in a non-posted request itself, which only applies to IO and config writes and atomic op requests. So it seems only pcie_tfc_nph_av is the only thing I should be looking at for read requests. I may be mistaken in assuming pcie_tfc_npd_av was for the completion data associated with non-posted read requests.
Some additional observations: with cfg_fc_sel tied to 3'b100 to select available transmit credits, cfg_fc_npd is also always 2 and cfg_fc_nph seems to sit at 0x3d when idle. So pcie_tfc_npd_av being 2 seems to follow from cfg_fc_npd being 2. However, the behavior of pcie_tfc_nph_av does not make a great deal of sense. When not saturated at 0xf, it does not seem to consistently decrement when new read requests enter the core.
This is an ILA capture of the PCIe RQ interface, along with some flow control signals. Trigger point was pcie_tfc_nph_av == 0, which should never happen presuming pcie_tfc_nph_av is working "correctly" as no new read requests are sent to the core if pcie_tfc_nph_av is less than or equal to 8.
The following is what I think is going on here. Starting from about sample 1,820, the interface is idle and cfg_fc_nph is 0x3d. Then three back-to-back DMA reads are started, each producing a number of read request TLPs. As the read requests enter the core, the core sends them out over the link, and cfg_fc_nph drops accordingly. After sending 32 TLPs, cfg_fc_nph stops decrementing at 0x1d, which is exactly 32 less than 0x3d, so presumably the PCIe core has 32 slots for completion data, and they're all accounted for at that point. Then transmit buffer space starts being consumed. Once that gets below 16 slots, pcie_tfc_nph_av starts to decrement. Again, so far, so good. Once pcie_tfc_nph_av is no longer greater than 8, the DMA engine suspends issuing read requests, with pcie_tfc_nph_av holding at 7. Some time later, the link partner releases some flow control credits, and cfg_fc_nph rises. This makes sense. However, pcie_tfc_nph_av also rises, which does not make sense, as it seems like it should be limited by buffer space and not by header credits. As pcie_tfc_nph_av rises, the DMA engine again starts sending read requests. It sends exactly 7 read requests, then the core lowers s_axis_rq_tready. This would make perfect sense, presuming pcie_tfc_nph_av did not change. In my understanding, this should not be possible: s_axis_rq_tready should be high if pcie_tfc_nph_av is nonzero and the core has not received any write requests (all requests so far are read requests, indicated by tkeep = 0x000f). At any rate, s_axis_rq_tready rises again, then pcie_tfc_nph_av drops from f to 0, even though far less than 15 read requests were passed to the core in that time.
Eventually pcie_tfc_nph_av somehow gets stuck at 0 or 8, causing the design to hang as the DMA engine thinks there is no buffer space available.
I also tried another test: with ext tag turned off, limiting the DMA engine to 32 in-flight tags and 32 in-flight read requests, presumably the completion buffer will always have room. However, pcie_tfc_nph_av still changes in this case, getting stuck at 8 and causing the design to hang.
I will try this same test again (ext tags disabled) with my flow control connection disabled as well. If that works, it does raise the question of whether it makes any sense to use extended tags at all if the core only has buffer space for 32 in-flight read requests.
11-22-2019 01:28 PM
So I ran another test with ext tags disabled via PCIe config space and no flow control. No hangs, as expected, and similar throughput as with ext tags enabled. However, pcie_tfc_npd_av is all over the place. I would assume that since I never have more than 32 requests in flight, and the PCIe core seems to have buffer space for 32 completions, that no transmit buffer space for non-posted operations should ever be used and as such pcie_tfc_npd_av should always read as 0xf. However, running the ILA shows it getting stuck at 0 occasionally. I also took a look at the verilog files for the PCIe IP core, and it looks like pcie_tfc_npd_av is directly driven by the PCIE40E4 primitive.
Is behavior this a bug in the PCIe hard IP core itself? If so, is there any way to get a reliable indication of transmit buffer availability, or am I going to have to limit this manually in the DMA engine? And if so, what should this limit be - total number of in-flight requests? It seems like the core can handle completions for 32 in-flight requests of 512 bytes each, would that number change with a different max_read_request setting? Does the limit need to be computed somehow based on the size of the requests?
11-26-2019 04:20 PM
As a workaround, I have implemented a scheme to limit the number of read requests in the core transmit queue using on the RQ transmit sequence number interface. If the number of non-posted requests in the non-posted TX queue is limited to less than the TX queue size, which seems to be around 25 requests, then the tready signal never falls due to the non-posted TX queue filling up.
It would still be nice to hear from someone at Xilinx about the strange behavior of pcie_tfc_nph_av, as the signal does not seem to be correctly indicating how many requests can be sent. Even with this scheme implemented and the tready signal never falling, the ILA still reports pcie_tfc_nph_av regularly stuck at 0.