cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
chili.chips
Participant
Participant
1,465 Views
Registered: ‎09-01-2014

PCIE RQ TREADY dynamics?!

Jump to solution

This is for 'pcie3u' core in EndPoint config in the X0Y4 position with Quad231+Quad232 (Gen3x8), on xcku085-flva1517-2-i.

We are observing that the PCIE HM accepts more commands on its RQ interface when PF0_DEV_CAP_MAX_PAYLOAD_SIZE parameter is set lower, such as:

   - For MAX_PAYLOAD_SIZE=128bytes it starts pushing back after 15 enqueued commands

   - For MAX_PAYLOAD_SIZE=256bytes it starts pushing back after 12 enqueued commands

   - For MAX_PAYLOAD_SIZE=512bytes it starts pushing back after 10 enqueued commands

   - For MAX_PAYLOAD_SIZE=1024bytes it starts pushing back after 8 enqueued commands

That has direct implications on the MEM_RD throughput, which is the highest for MAX_PAYLOAD_SIZE=128. But, such a minimal payload size lowers the MEM_WR throughput.

Any suggestions on how to make the PCIE core accept more requests for larger MAX_PAYLOAD_SIZE (512 is our ideal target), or insights into interactions between motherboard and PCIE core that are at play in this case?

 

PCIE-6.PNG

0 Kudos
1 Solution

Accepted Solutions
aforencich
Voyager
Voyager
1,349 Views
Registered: ‎08-14-2013
The PCIe core has an internal completion buffer. It reserves space in this buffer to store completions associated with outgoing read requests. If there is not enough space in the buffer, then the core does not release the read request over the bus. The size of this buffer is fixed, so the larger the request you issue, the fewer can be active on the bus at any point in time. I'm not privy to the details of exactly how this buffer is managed, but that's the jist of it. It has nothing to do with the host in this case. I would also recommend monitoring the flow control credit counts in the ILA, specifically the non-posted header credits available and the posted data credits available counters. That should provide some more insight into where the stall is coming from - PCIe core vs. upstream switch - and when the PCIe core is actually sending packets over the link.

View solution in original post

9 Replies
aforencich
Voyager
Voyager
1,350 Views
Registered: ‎08-14-2013
The PCIe core has an internal completion buffer. It reserves space in this buffer to store completions associated with outgoing read requests. If there is not enough space in the buffer, then the core does not release the read request over the bus. The size of this buffer is fixed, so the larger the request you issue, the fewer can be active on the bus at any point in time. I'm not privy to the details of exactly how this buffer is managed, but that's the jist of it. It has nothing to do with the host in this case. I would also recommend monitoring the flow control credit counts in the ILA, specifically the non-posted header credits available and the posted data credits available counters. That should provide some more insight into where the stall is coming from - PCIe core vs. upstream switch - and when the PCIe core is actually sending packets over the link.

View solution in original post

chili.chips
Participant
Participant
1,343 Views
Registered: ‎09-01-2014

Thank you.

  - That stall must have been coming from the upstream switch as we've managed to "fix" it by upgrading the mobo.

 

0 Kudos
aforencich
Voyager
Voyager
1,316 Views
Registered: ‎08-14-2013
Hah, I see. Well, it might be a good idea to pull up those credit counts on the ILA and take a peek anyway. I have noticed some odd behavior here and there with flow control credits not being released right away by the upstream device. In my case, I am building a NIC, so this just ended up causing a bit of a traffic jam and resulted in a few dropped packets, but it's definitely something worth looking at. Also, I recommend checking the credit counts before issuing requests to the core, with the overall goal being to prevent the core from ever deasserting tready. Otherwise you can have a bunch of read requests get blocked by a write request if there aren't enough flow control credits available to send the write request. Or similarly, you can get a bunch of writes blocked by read requests stacking up in the core.
chili.chips
Participant
Participant
1,303 Views
Registered: ‎09-01-2014

there is no mixing of read and write requests in this particular test -- It is solely and exclusively MEM_RD DMA request from Xilinx EP into x86_64 mobo RC.

  • What would be the advantage of driving TREADY=0 from my logic, based on credits returned by the PCIE core, as opposed to letting the PCIE core drive it directly?

Is the core 'recovery' from TREADY=0 slow, thus the designer should strive to avoid putting it to such state?

 

0 Kudos
aforencich
Voyager
Voyager
1,298 Views
Registered: ‎08-14-2013
If you're only sending read requests and no write requests, then there is no problem. The issue is head of line blocking: there is only one RQ interface to handle both reads and writes, and the PCIe flow control for reads and writes is separate. It's a choke-point, a bottleneck. You might have two different components generating read and write requests, these need to be merged, sent to the PCIe core, and sent over the PCIe bus. Now, the core can only send write requests if it has sufficient flow control credits, and it can only send read requests if it has enough flow control credits and enough buffer space for the completions. As a result, it is possible to get into a situation where the core can send one type of request but not the other. If you're in that situation and you try to send that type of request, the core will block by deasserting tready and then you can't send any requests at all until flow control credits are released over the PCIe link, even if you have the other type of request to send and there are sufficient flow control credits to send it. So, what you should do instead is not issue requests if there are insufficient credits, preventing other types of requests from being blocked unnecessarily. If you're only issuing one type of request, then it doesn't matter.
chili.chips
Participant
Participant
1,292 Views
Registered: ‎09-01-2014

got it, tks., it's good to know this subtlety.

So, with my completion buffers set in Xilinx IP Catalog to 'Extreme Performance', i.e. the maximum that the PCIE core offers, is there anything I can do within PCIE core config to also maximize the flow control credits, primarily for MEM_RD RQ?

Or, is that a system issue? (which my empirical success by swapping in the newer mobo seems to indicate)

 

 

0 Kudos
aforencich
Voyager
Voyager
1,288 Views
Registered: ‎08-14-2013
No. The credits are determined by the link partner. I suggest setting cfg_fc_sel = 3'b100, and monitoring cfg_fc_nph, cfg_fc_ph, and cfg_fc_pd on the ILA to see how many credits are available.
mdkramer37
Visitor
Visitor
698 Views
Registered: ‎10-18-2019

Do you know if lack of credits on the rq interface could cause a lockup on cc? I am running into an issue with the PCIe core where the root is throwing a completion timeout error, presumably from a non-posted request sent the the fpga. During this time there was lots of data being mastered by the fpga on the rq interface, and I might not have been respecting the flow control credits correctly. In the documentation it mentions head of line blocking, where a lack of non-posted credits can block posted requests behind it. But does the same apply to completions? Can a non-posted request on rq blocked by credits prevent a completion from being sent from the cc interface?

0 Kudos
aforencich
Voyager
Voyager
463 Views
Registered: ‎08-14-2013

IIRC completions cannot be blocked by anything but other completions.  There are six types of credits, with separate header and data credits for posted requests, non-posted requests, and completions.  So as long as you have completion credits, you can send completions.  The main head of line blocking that you need to worry about is on the RQ interface only, and this shouldn't cause any issues other than blocking that interface.  I suppose that is one of the advantages of the RQ+RC/CQ+CC split in the ultrascale/ultrascale+ core, the core can exert backpressure against RQ while not blocking CC.  If you were to do this muxing outside of the core (as in older Xilinx parts and all Intel parts that I am familiar with that have a single TX interface and a single RX interface), then this could be an issue. 

Edit: almost forgot about the PCIe ordering model.  Completions cannot pass posted requests.  However, I'm not sure if this applies in the Xilinx core until the posted request is actually sent on the wire.  I'm leaning towards you may need to enforce this yourself in some way by looking at TX sequence numbers if it's important. 

At any rate, what you can do is set cfg_fc_sel = 3'b100 and then monitor cfg_fc_nph, cfg_fc_ph, and cfg_fc_pd.  If cfg_fc_nph falls below some small number - say, 4, or 8 - then don't issue any read requests.  Similarly, if cfg_fc_ph or cfg_fc_pd fall too low, don't send any write requests.  One unit of cfg_fc_pd is 4 DWORDs (16 aligned bytes).  If you do that right, then RQ should never block.  Ah, that may not be sufficient actually - the core does some management of read requests internally, and if there isn't sufficient completion buffer space, then read requests will not be released from the core, and if they aren't released then no non-posted credits will be consumed.  Ostensibly that's what pcie_tfc_nph_av/pcie_tfc_npd_av are for, but these don't work, so the only option left is transmit sequence numbers.  So, what you need to do is mark read requests and limit the number of outstanding read requests.  What I do is set the MSB of the transmit sequence number and then use a counter to keep track of in-flight operations, incremented when issuing the request and decremented based on pcie_rq_seq_num/pcie_rq_seq_num_vld.  I have found that a limit of 16 in-flight operations is reasonable.  Once there are 16 requests sitting in the hard IP core, no new requests are issued until the core releases some of these towards the host.