06-26-2017 09:04 AM
I am using the DMA Subsystem (v2.0) with Vivado 2016.2 and PCIe 3.0 x8, 256-bit AXI-Stream @250MHz and host PC running CentOS 6.9, kernel 2.6.32. My design is using the descriptor bypass interface to send data straight to a PC host DDR4 memory. I allocated the contiguous memory for DMA using the mem="XX" boot parameter at known physical address so I can set that address as the starting destination address on the FPGA (VCU108 eval board). The data is continuous ADC data at 2GB/s. I would need about 200 GB of data per transfer so that is why I use the mem boot parameter.
What I am seeing is for the first 16 us, the transmit is going fine. But after that, the s_axis_rq_tready started going low for long period (~300 ns) before going high again. The throughput went way down, at less than 100 MB/s if I calculated it correctly.
Now I have searched the forums and found the following thread: https://forums.xilinx.com/t5/PCI-Express/Throughput-issues-with-DMA-and-tx-buf-av-in-7-Series-Integrated/m-p/383997/highlight/true#M4854
@markzak stated that this is likely the link partner poor response time causing transmission stalls. Basically, I run out of transmit credit? The only solution he suggested is using a huge buffer to buffer through the stall.
Is this the only way to workaround this issue? The DMA Subsystem does not provide interface to the flow control interface so I cannot check to make sure that was the issue.
06-27-2017 02:55 PM
Over time, i have realized that the motherboard itself is mostly to blame for this type of behavior. Some chip-sets & manufacturers will show these issues, and worse. Yet placing the exact same card into another system get you flawless performance. It's very strange, and has forced us to try numerous different motherboard/processor combinations until we found one that worked consistently. ASUS seems to be one of the better ones.
Also, i would highly recommend that you evenly fill all DDR memory slots on the motherboard equally. The memory controller throughput suffers greatly if all ranks & channels are not populated.
06-28-2017 06:23 AM
I'm using the ASUS X99-E-10G WS motherboard. It has 2 PEX 8747 PCIe switches. I make sure to connect the FPGA to the slot that is in different switch from the video card but it could be part of the problem. What motherboard end up working the best for you?
As for filling out DDR memory slot equally, what do you mean by that? I have all 8 slots filled in the motherboard, 32GB each. Do you mean I should split my DMA target address so that it fill different block of RAM?
06-30-2017 12:01 AM
Could you tell us the mother board you were using before that you think affected the throughput severely ?
Doers changing to a host machine with a different mother board alone solved your low throughput issue?
07-10-2017 10:50 AM
The only server motherboard that i found that works at extremely high PCIe bandwidth (>4GB/s) was the ASUS Z9PA-U8. Unfortunately it's DDR3, and hence this was obviously a few years back.
Last year, we attempted to build up a new DDR4 system using a SuperMicro MBD-X10SRA-O ATX Server Motherboard. That FAILED. I then attempted to do the same with the next generation of the exact same ASUS motherboard, the Z10PA-U8. Surprisingly, that also FAILED.
So i went back to Amazon, ordered another ASUS Z9PA-U8, built-up another DDR3 system, and that worked flawlessly. All three motherboards were tried with exactly the same Xilinx PCIe card, same FPGA firmware, etc. There's got to be a deeper root cause somewhere, but i don't have the $200K to purchase a PCIe bus analyzer.
As far as memory, I was specifically referring to physically populating all DDR slots on the MB. It sounds like you are doing that correctly.
03-07-2018 06:39 AM - edited 03-07-2018 06:41 AM
I have quite similar setup: DMA Subsystem for PCIe (v4.0), 256-bit AXI-Stream with descriptor bypass interface. Unlike you, I would need to send data to another FPGA board (using Bridge Subsystem for PCIe and DDR4 memory for data storage).
Since I'm newbie in PCI express world, I have difficulties to start with the design. Xilinx design example (dma_stream0 test, page 97 of PG195) gives me ERROR (---***ERROR*** C2H Transfer Data MISMATCH ---) after simulation is finished. This boosts my confusion.
Would you be so kind to share your project design with me or at least to give me some hints on how to start ?
Thanks in advance for your time and effort. Really appreciate it.