11-15-2019 10:39 AM
We are encountering a strange problem and are having a very hard time finding the clue as why two seemingly unrelated things are having interactions, as I will explain.
We have an application where we are using both RPUs, the APU and the PL alltogether. The APU acts pretty much like a gateway for ethernet traffic and high level decisions and retreives images from a Basler USB3 camera using their Pylon 5 library and sends them over ethernet. The RPUs are interfacing custom PL axi peripherals.
When our system is fully running, we get a lot of this message in linux console :
[ 858.577889] xhci-hcd xhci-hcd.0.auto: WARN Cannot submit Set TR Deq Ptr [ 858.584495] xhci-hcd xhci-hcd.0.auto: A Set TR Deq Ptr command is pending.
and less often this one is interleaved :
[ 858.594917] xhci-hcd xhci-hcd.0.auto: bad transfer trb length 16384 in event trb
They are related to the USB3 controller in linux and more specificly the xhci driver. Our code running on the APU reports image grab fails.
It is hard to believe the linux xhci driver code is at fault given how mature it is.
Here are several things we noted while trying different scenarios :
- When no RPUs are running and with our APU code and the PL programmed with our bitstream. Our client software running on different computer linked by ethernet is getting all its images from the APU, at the frame rate requested. No warnings, no grab fails. Everything works fine alltogether.
- The RPUs have their Memory Protection Units enabled and carefully configured as to only allow access memory for their code and data, as well as the needed peripherals. All memory for USB controllers are explicitely blocked. We can pretty much rule out rogue RPU code accessing unwanted regions and messing up with the APU. This has been tested several times.
-We established a direct correlation between when the RPU processors are running our code, and when we have the messages and grab fails.
- When running simple 'hello world type' code on both RPUs, the image acquisition works fine. The problem is not related to the fact that RPUs are up and running. Since the USB0 w/ DMA is in the LPD domain as well as the RPUs, and the fact that they all go through the LPD Main Switch, we tried abusing the memory bandwith in our test code to see if the USB controller is choking in some form. No matter how much memory congestion we caused (with I and D cache disabled, looping infinitely accross megabytes of DDR doing memcopys), the image acquisition alongside was still holding on perfectly.
- We stripped down parts of our RPU code incrementaly to find out at which point the image acquisition starts to run without any problem. We found out that when RPU code makes access to the PL (through any master interface, be it the HPM0_LPD, HPM0_FPD or HPM1_FPD. We tested them all) then the buggy behavior starts to appear.
- The piece of code causing the bug (when at idle, waiting for events) was : At each second, fill some memory-mapped BRAM in the FPGA with status values. Nothing fancier.
- Turned out that with this 'feature' turned off, we have no problem whatsoever. BUT, as soon as we started to fully use the system, the bug would start to manifest itself again (because accesses to PL axi registers are made to our peripherals).
We cannot think of any reason as to why accessing the PL from the RPUs have any effect on the USB subsystem and the APU.
Thank you for any suggestions!
10-29-2020 03:31 AM
Hi Simon, we get the same error signature using a Basler USB3 camera too (Basler daA1600-60uc). Did you find the root cause ? We'd be highly interested if any !