cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
garywkowalski
Visitor
Visitor
1,515 Views
Registered: ‎07-15-2016

XDMA Simulation works for many transfers, then descriptor count = 0 & dma always busy

Using Vivado 2018.2 with XDMA 4.1(Rev 71634). I'm doing DMA transfers from a DIMM to the PCIe (Card to Host). All looks good for a bunch of transfers (one descriptor at a time) until a cycle where the DIMM takes a bit longer to provide the data on the AXI bus. Typically it all works when ARVALID falling to RVALID rising is ~148 ns. For the transfer that stops the process, this time is ~452 ns. Is there a timeout happening? Also, I've tried doing the dma_bridge_reset, making sure to wait until after bus activity is complete, and the DMA remains locked up.

Suggestions?

0 Kudos
14 Replies
dgisselq
Scholar
Scholar
1,487 Views
Registered: ‎05-21-2015

@garywkowalski,

That really sounds like the slave is somehow misbehaving.  Resetting a master is problematic, so I wouldn't trust doing so in general.  Perhaps you might want to try including a firewall between the slave and the interconnect?  A good firewall should give you a signal you can then trigger your logic analyzer on in order to determine which IP is at fault--the slave or the master.

Dan

0 Kudos
garywkowalski
Visitor
Visitor
1,452 Views
Registered: ‎07-15-2016

I don't believe the slave is "misbehaving." The handshakes are all there. It's just that sometimes the slave takes longer to respond, and that's when the XDMA seems to lock up. I want to know if the XDMA is timing out on the slave and locking up. The slave is only taking 452 ns in the "slow" case. Is there a way to adjust the timeout of the XDMA to be sure that's not what's happening?

0 Kudos
dgisselq
Scholar
Scholar
1,438 Views
Registered: ‎05-21-2015

@garywkowalski,

"Timing out" would be a violation of the AXI protocol.  AXI has no way of recovering from a slave that fails to respond within a given time window, so any good master shouldn't be depending upon such.  I'm assuming here that the XDMA is a "good master" that follows protocol--although I've been wrong before ...  At any rate, that's why I'm looking for and asking about any other possible causes.

I know of only two cores that implement AXI timeouts.  One is Xilinx's firewall core.  If a slave fails to respond properly within a given time window, that core will take the slave off line and return bus errors for any pending or future requests.  The other is my own bus fault isolator.  As far as I can tell, that core is similar if not identical to Xilinx's firewall with one exceptional feature: on a protocol error, such as a timeout, it can configured to optionally reset the downstream slave and so return it to operation.  Both cores should be able to provide a signal that can be used by some form of logic scope to see what was going on when the bug took place.  Unfortunately, using such a firewall might also make the bug disappear ....

As another alternative, I've been told that you can also connect Xilinx's VIP to the channel in a monitor mode to look for problems--but I have no experience with doing this myself.

I am also aware that several of Xilinx's cores have had problems with their bus interface logic which could cause the bus to lock up as you have described.  In particular, I've seen several examples where a Xilinx core would accept one or more requests, but not return responses for all of those requests.  At one time I was assured the problem was limited to their example/demonstration AXI cores.  Since then, I've come to find problems in 2-3 other cores as well.  Fixes have been promised in 2020.1 which ... hasn't happened yet.  While I am not aware of any such problem in the XDMA, the conditions generating this sort of bus error seem to be holes in Xilinx's IP testing framework, so that leaves me wondering.

Either way, I'd recommend counting requests and responses under the assumption that the slave isn't responding to one (or more) requests at all--leaving the master (and possibly the interconnect) frozen in a situation where no further transactions can take place.

Dan

0 Kudos
garywkowalski
Visitor
Visitor
1,385 Views
Registered: ‎07-15-2016

In simulation, the AXI bus is showing that the transfer to the DMA is completing (see attached picture), yet the DMA never goes non-busy and the descriptor count remains at zero.

 

dma_axi.PNG
0 Kudos
dgisselq
Scholar
Scholar
1,331 Views
Registered: ‎05-21-2015

@garywkowalski,

Hmm ... looks pretty fishy from here.  From here, it sure looks like the XDMA has a bug in it.  Still ... I wonder if the RLAST signal was set correctly?  I can't read ARLEN from here nor can I count the ARVALID's to know if RLAST was appropriately set.  It might be that the channel was taken down due to RVALID && RREADY && RLAST, but also that RLAST was sent too early.  A bus fault isolator would detect that.  If the bus fault isolator doesn't detect any bus problems, then you know the XDMA core is at fault, and that the bug lies somewhere within it.

Dan

0 Kudos
garywkowalski
Visitor
Visitor
1,280 Views
Registered: ‎07-15-2016

ARLEN is fixed at 3; there are 4 RVALIDs with RLAST set only for the last one. This is also the case for the previous 13 burst transfers that work before the XDMA locks up on this one. The slave is Xilinx's DIMM IP. I agree that it looks like an XDMA bug.

0 Kudos
dgisselq
Scholar
Scholar
1,274 Views
Registered: ‎05-21-2015

@garywkowalski,

This might give you some thoughts.

I'd offer you one of my own DMA's to use and try instead, but none of them handle PCIe, so I'm afraid I'm at a loss of where to suggest you might go next.

Dan

0 Kudos
dgisselq
Scholar
Scholar
1,207 Views
Registered: ‎05-21-2015

@garywkowalski,

So ..  I took a look at the XDMA spec this evening.  I was surprised to find that there is a configurable timeout on card to host write transactions.  Have you tried to set this timeout value to zero in order to disable it?

Dan

0 Kudos
garywkowalski
Visitor
Visitor
1,180 Views
Registered: ‎07-15-2016

Where did you see this timeout option in the spec? The only timeouts I found are:

1) The 50 ms / 50 us PCIe timeout (my response is less than a us)

2) Config Flush Timeout which is for AXI Stream only, I'm using Memory-mapped. Also, the spec says this timeout is supposed to close the descriptor (that's not happening). I suppose I can try this and see what happens.

0 Kudos
garywkowalski
Visitor
Visitor
1,173 Views
Registered: ‎07-15-2016

The spec says the config flush timeout (which probably doesn't apply, since I'm using memory-mapped) defaults to zero (disabled) anyway.

0 Kudos
garywkowalski
Visitor
Visitor
1,159 Views
Registered: ‎07-15-2016

Just to be sure, I set the timeout to 0 (though shouldn't apply, since using memory-mapped) and put in the AXI protocol checker. This made no difference and the protocol checker reported nothing wrong.

I'm stuck.

0 Kudos
crohrer
Xilinx Employee
Xilinx Employee
1,112 Views
Registered: ‎01-04-2018

@garywkowalskiI think you're looking in all the right places. First off, there is no timeout in the XDMA; it should be able to hold forever. Also, as mentioned elsewhere, you shouldn't have to ever reset the IP, but maybe it's a good data point to have.

There are timeouts built into the PCIe protocol. This may have some relationship to the failure, but certainly shouldn't have resulted in a lockup.

The XDMA went through a pretty major overhaul right around that time. I know there were a lot of bug fixes. I would like to recommend that you try a newer version of the IP. I suspect that this behavior has been resolved.

0 Kudos
garywkowalski
Visitor
Visitor
1,092 Views
Registered: ‎07-15-2016

@crohrer I am currently using version 4.1(Rev 71634) which is shown as being up-to-date and the recommended version for Vivado 2018.2. Are you saying that I will need to move to a more recent version of Vivado?

0 Kudos
garywkowalski
Visitor
Visitor
953 Views
Registered: ‎07-15-2016

I upgraded to Vivado 2019.2 and Questa 2019.3, and updated all the IP (DMA now at version 4.1, rev 4. The behavior has not changed - DMA works a bunch of times, then goes busy, DMA transfer completes but DMA never goes not-busy and descriptor count never goes to 1. AXI protocol checker reports no problems.

Tags (1)
0 Kudos