05-13-2020 12:04 PM
Using Vivado 2018.2 with XDMA 4.1(Rev 71634). I'm doing DMA transfers from a DIMM to the PCIe (Card to Host). All looks good for a bunch of transfers (one descriptor at a time) until a cycle where the DIMM takes a bit longer to provide the data on the AXI bus. Typically it all works when ARVALID falling to RVALID rising is ~148 ns. For the transfer that stops the process, this time is ~452 ns. Is there a timeout happening? Also, I've tried doing the dma_bridge_reset, making sure to wait until after bus activity is complete, and the DMA remains locked up.
05-13-2020 05:53 PM - edited 05-13-2020 05:53 PM
That really sounds like the slave is somehow misbehaving. Resetting a master is problematic, so I wouldn't trust doing so in general. Perhaps you might want to try including a firewall between the slave and the interconnect? A good firewall should give you a signal you can then trigger your logic analyzer on in order to determine which IP is at fault--the slave or the master.
05-14-2020 04:17 AM
I don't believe the slave is "misbehaving." The handshakes are all there. It's just that sometimes the slave takes longer to respond, and that's when the XDMA seems to lock up. I want to know if the XDMA is timing out on the slave and locking up. The slave is only taking 452 ns in the "slow" case. Is there a way to adjust the timeout of the XDMA to be sure that's not what's happening?
05-14-2020 05:52 AM
"Timing out" would be a violation of the AXI protocol. AXI has no way of recovering from a slave that fails to respond within a given time window, so any good master shouldn't be depending upon such. I'm assuming here that the XDMA is a "good master" that follows protocol--although I've been wrong before ... At any rate, that's why I'm looking for and asking about any other possible causes.
I know of only two cores that implement AXI timeouts. One is Xilinx's firewall core. If a slave fails to respond properly within a given time window, that core will take the slave off line and return bus errors for any pending or future requests. The other is my own bus fault isolator. As far as I can tell, that core is similar if not identical to Xilinx's firewall with one exceptional feature: on a protocol error, such as a timeout, it can configured to optionally reset the downstream slave and so return it to operation. Both cores should be able to provide a signal that can be used by some form of logic scope to see what was going on when the bug took place. Unfortunately, using such a firewall might also make the bug disappear ....
As another alternative, I've been told that you can also connect Xilinx's VIP to the channel in a monitor mode to look for problems--but I have no experience with doing this myself.
I am also aware that several of Xilinx's cores have had problems with their bus interface logic which could cause the bus to lock up as you have described. In particular, I've seen several examples where a Xilinx core would accept one or more requests, but not return responses for all of those requests. At one time I was assured the problem was limited to their example/demonstration AXI cores. Since then, I've come to find problems in 2-3 other cores as well. Fixes have been promised in 2020.1 which ... hasn't happened yet. While I am not aware of any such problem in the XDMA, the conditions generating this sort of bus error seem to be holes in Xilinx's IP testing framework, so that leaves me wondering.
Either way, I'd recommend counting requests and responses under the assumption that the slave isn't responding to one (or more) requests at all--leaving the master (and possibly the interconnect) frozen in a situation where no further transactions can take place.
05-15-2020 08:29 AM
In simulation, the AXI bus is showing that the transfer to the DMA is completing (see attached picture), yet the DMA never goes non-busy and the descriptor count remains at zero.
05-16-2020 01:21 PM
Hmm ... looks pretty fishy from here. From here, it sure looks like the XDMA has a bug in it. Still ... I wonder if the RLAST signal was set correctly? I can't read ARLEN from here nor can I count the ARVALID's to know if RLAST was appropriately set. It might be that the channel was taken down due to RVALID && RREADY && RLAST, but also that RLAST was sent too early. A bus fault isolator would detect that. If the bus fault isolator doesn't detect any bus problems, then you know the XDMA core is at fault, and that the bug lies somewhere within it.
05-17-2020 06:29 AM
ARLEN is fixed at 3; there are 4 RVALIDs with RLAST set only for the last one. This is also the case for the previous 13 burst transfers that work before the XDMA locks up on this one. The slave is Xilinx's DIMM IP. I agree that it looks like an XDMA bug.
05-17-2020 07:27 AM
05-17-2020 06:09 PM
So .. I took a look at the XDMA spec this evening. I was surprised to find that there is a configurable timeout on card to host write transactions. Have you tried to set this timeout value to zero in order to disable it?
05-18-2020 04:32 AM
Where did you see this timeout option in the spec? The only timeouts I found are:
1) The 50 ms / 50 us PCIe timeout (my response is less than a us)
2) Config Flush Timeout which is for AXI Stream only, I'm using Memory-mapped. Also, the spec says this timeout is supposed to close the descriptor (that's not happening). I suppose I can try this and see what happens.
05-18-2020 07:24 AM
Just to be sure, I set the timeout to 0 (though shouldn't apply, since using memory-mapped) and put in the AXI protocol checker. This made no difference and the protocol checker reported nothing wrong.
05-18-2020 06:44 PM
@garywkowalskiI think you're looking in all the right places. First off, there is no timeout in the XDMA; it should be able to hold forever. Also, as mentioned elsewhere, you shouldn't have to ever reset the IP, but maybe it's a good data point to have.
There are timeouts built into the PCIe protocol. This may have some relationship to the failure, but certainly shouldn't have resulted in a lockup.
The XDMA went through a pretty major overhaul right around that time. I know there were a lot of bug fixes. I would like to recommend that you try a newer version of the IP. I suspect that this behavior has been resolved.
05-19-2020 04:16 AM
@crohrer I am currently using version 4.1(Rev 71634) which is shown as being up-to-date and the recommended version for Vivado 2018.2. Are you saying that I will need to move to a more recent version of Vivado?
06-11-2020 06:27 AM
I upgraded to Vivado 2019.2 and Questa 2019.3, and updated all the IP (DMA now at version 4.1, rev 4. The behavior has not changed - DMA works a bunch of times, then goes busy, DMA transfer completes but DMA never goes not-busy and descriptor count never goes to 1. AXI protocol checker reports no problems.