07-17-2019 04:53 PM - edited 07-17-2019 04:53 PM
We are having an issue in which the MM2S DMA transfer in SG mode on the AXI DMA IP core hangs. We are running on Linux on a Zynq, and what we see is that the call to the transmit operation randomly hangs after a large number of successful operations. When we look at the DMA registers they indicate that the transfer is complete and no error occured but, it seems, no completion interrupt is generated. Our IRQThreshold is set to 1 and we are sending 1024 bytes of data at a time in a single SG buffer. (All of our transfers are of a single SG buffer so it is a degenerate case)
We started down the path of looking at Tlast and the interrupt signal. We assumed (perhaps incorrectly) that for every Tlast there should be a matching interrupt and at first it looked like when the system hung there was a Tlast but no interrupt following. Then we did some more investigation and we thought we saw that several Tlasts may precede an interrupt and we maybe even saw in some cases the interrupt precedes Tlast.
So now we are quite confused. Do interrupts and Tlasts map one-to-one? Is it the case that only Tlast can trigger a completion interrupt (or more simply should an interrupt ALWAYS follow Tlast - or I suppose Tlasts)?
Thanks for any clarifications
07-24-2019 02:42 AM
I had some tests on tlast vs idle bit for AXI DMA SG mode before. And it turns out that idle bit of MM2S status register asserts once mm interface is done, even stream interface is still running.
So back to your case, I think interrupt is similar to idle bit. That's why you can see the interrupt precedes Tlast sometimes.
07-29-2019 08:00 AM
That is a little helpful. Basically what we are seeing is that the transmit side (MM2S) hangs after running "for a while". It looks like the transaction completes (judged by examining the registers) but that the interrupt is not delivered. Or maybe there is some race condition and the interrrupt is missed or blocked since the software driver is hanging on some other semaphore or spin lock. It would be helpful to understand or get a pointer to some document that described how we need to manage the AXI signals to make the IP work. The AXI DMA datasheet indicates that the we need to manage the signals but doesn't state how or which signals.
07-29-2019 08:16 AM
My assumption is there must be one and only one interrupt for Tlast. Otherwise, they signal different things.
You mention semaphores, so I assume you have an RTOS and you play around with interrupts, disabling/ enabling/ clearing them. Well, that's a good environment for getting lost. I'd suggest you revise your interrupts, things like having a slow process with ints disabled, or clearing all (or more than one) somewhere.
Either an extra interrupt or a lack of it, I think it won't hang your code, but would produce some anomalous thing and the code would keep running. Software hangs usually because of bad jumps because of corrupted pointers. Stack and heap overflows are many times the last thing one thinks about when running into problems, check that!
07-29-2019 10:47 AM
We are running Linux. What happens on the software side is that the transfer call hangs. When we examine the IP registers it says the transfer is complete. We believe that the transfer call is waiting for the completion interrupt or being held in some spin lock owing to some race condition between the AXI signals and the driver. The code does not crash, the stack and the heap are not corrupt, there is no stack trace or kernel panic. It just hangs. The AXI signal management is left as an exercise to the end user (according to the datasheet) so I am trying to get some clarity on the exact nature of this management. Which signals should be managed Tidle? Tlast? and how should they be managed?
07-30-2019 01:19 AM
Man... interrupts are to not wait for events among other reasons. what do you mean by 'waiting for the completion interrupt'? Checking the flag? That's not how you (best) use interrupts. An interrupt jumps to its ISR and there you do what it has to be done. Is there an ISR and you are checking the int flag? In that case bear in mind, the flag is cleared after the ISR so if it jumps to it before you check, you miss it, could that be the case?
07-30-2019 07:27 AM
We are using the linux-supplied driver. We did not write anything new. I know the driver gets an interrupt on transmitcompletion and I know there is asome code in the kernel that traps it and that there are some spin locks that synchronize activity on the DMA. I know our call to transmit hangs. So I am speculating that it is hung on a spin lock and maybe that spin lock is waiting foran interrupt. This all kernel magic that I am loathe to wade into especially since my feeling is the problem is on the FPGA implmentation side and specifically related to AXI signal management. Tha is the motivation for my question
07-31-2019 12:59 AM
It would be good if you could share specific details. What Linux (source and version), what platform, what application, etc. I used the DMA example code from Xilinx without a problem, but it was a bare metal app.
07-31-2019 04:56 PM
08-01-2019 12:40 AM
My suspicion is always on the software. As a general rule. Software is terribly more complex than hardware so chances of a bug there are greater. Simple statistics.
Philosophy apart, what you want is to sort that out. Ways that come to my mind:
- Try the DMA with a bare-metal software. does it work? That could imply something in the Linux OS is messing up.
- Use the ILA to watch the interrupt and maybe other data to check if it's missing
08-01-2019 03:09 PM
The problem is that we don't know what we are looking for. We have used ILAs and we sometimes see Tlast and the interrupt match up and sometimes we don't. But there is also Tready and Tidle. They play a role but we don't know what role. And we don't know the expected sequencing. The only thing we know for sure that is MM2S hangs and we can see all the signals but have no idea what constitutes correct signalling.