04-27-2018 12:47 AM
We have built a DAQ around PCIe DMA Subsystem and XDMA drivers. Our system is basically connects to two dual port BRAMs interfaced to AXI MM interface via AXI-BRAM interface converter core. The system has been implemented on Ultrascale, Ultrascale+ and Series7 devices and work perfectly fine with Ultrascale, Ultrascale+, while crashing frequently when using Series 7 (Kintex7 325T, Artix7 35T) devices. Upon looking into driver log I noticed that the c2h and h2c engines actually crash when DMA transfer sizes are changed dynamically while looping over multiple transfers. E.g. If I do 10 M transfers of same size, it works perfectly fine but if I change the size of the transfer, it fails in just couple of transfers. One word DMA transfers also fail.
Upon looking closer at the driver log, it seems that when the engine stops responding, DMA transfer completion interrupt has been received from wrong channel. I am attaching annotated driver log.
I have tried following:
1) Look at the driver log with debug mode (dmesg) as well as data log to find out if for any particular pattern when failure happens. So far I did not find any particular pattern.
2) Applied tactical patches available (I am using 2017.3) and behavior is similar.
3) The user-level firmware (except XIlinx IPs) have been tested and verified independently.
Any suggestions would be of great help.
04-30-2018 12:06 PM
The 7-series XDMA core has had several fixes added in the last few builds. Please try the 2018.1 IP with Vivado 2018.1 and see if that resolves your issues.
05-02-2018 03:05 AM
Thank you for your reply. I thought that latest XDMA patch from Xilinx will be at par with 2018.1. In any case, I will try that and get back to you.
05-23-2018 05:15 AM
I have finally tested the PCIe firmware built with newer Vivado 2018.1. As you had correctly pointed out many bugs have been fixed. Many thanks for the suggestion. Although, some bugs remain and couple of new issues have been observed. I have been able to circumvent them with software fixes to make sure that situation causing bugs does not occur but it would be much better if there is a firmware fix. My observations are as follows:
1) Many times 1 DW DMA transfer (c2h) fails. In this case (2018.1) DMA engines do not crash but give wrong data.
2) DMA (c2h) transfers with length of 129 DWs and its multiple also fails giving garbage data. Generally only one DW is garbage. General observation is that when we have lengths such as 129 which can be broken into nTLPs+1 DW issue persists, probably because of issues with 1 DW transfer.
3) Strangely, DMA (c2h) transfers also read garbage data when transfer size is (nTLP - 3 DWs), e.g transfer size of 509, 1021 etc. In other way, if (transfer_size - 1) % 32 == 28, erroneous data is returned. It is not obvious to me why this could happen.
I have been able to mitigate these issues in software by padding transfers by few DWs whenever such above mentioned transfer lengths occur. Kindly let me know if it can fixed in the firmware.
Thanks and Regards,
05-23-2018 10:02 AM
I would like to do some in-house testing on this as you have described. Would it be possible for you to provide the XCI for your core so I can make sure and run it the same way?
05-24-2018 11:46 PM