cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Scholar
Scholar
12,519 Views
Registered: ‎04-04-2014

AXI DMA Cyclic BD not working

Jump to solution

Hi,

 

I am trying to get the cyclic descriptor functionality working with the axi_dma core in scatter gather mode.

I have tried this in simulation and so far it just stops at the last descriptor in the chain.

 

My simulation does the following in this order:

 

- Write descriptor table to memory

- Write first descriptor address to CURR_PTR

- Set Cyclic BD bit in control register

- Set Interrupt enable bits and the DMA engine start bit in the control register (separate write to previous step)

- Read back control reg and check this has had effect.

- Write address of last descriptor to TAIL_PTR

 

I have also tried setting the tail pointer to an address not in the chain (the example gives 0x50 as a suggestion), and this doesn't seem to work either.

 

I am using Vivado 2014.2 and have checked the changelog up to 2016.2 for anything related to this (post 2014.2) and see nothing.

 

Help?

 

Thanks

 

 

0 Kudos
Reply
1 Solution

Accepted Solutions
Highlighted
Scholar
Scholar
20,185 Views
Registered: ‎04-04-2014

Ok, I have managed to make the DMA engine do what I want, but I have cheated slightly.

 

In my implementation:

- The DMA engine wraps around to the first descriptor once the last has been processed.

- At any point if the DMA engine reaches a stale descriptor it stops.

 

To do this you need to do the following:

- Disable Cyclic BD mode

- Make the last descriptor point back to the first.

- Program the Tail Descriptor with an address outside of the range used by the descriptors. In my case the last descriptor is at address 0x8000_0100 so I make my Tail Pointer 0x8000_0140.

 

Basically in non-cyclic BD mode the engine stops once the current pointer is >= the tail pointer. If the last descriptor points back to the first and the tail is greater than the last descriptor address then this never happens....

 

Worth knowing. 

 

Thanks

View solution in original post

18 Replies
Highlighted
Scholar
Scholar
12,499 Views
Registered: ‎04-04-2014

Could I also clarify what I understand to be the two modes of operations in Scatter Gather mode.

 

1. No cyclic BD

In this I program the start and end pointer to the descriptor chain. When the end of the chain is reach the DMA engine halts and to restart it (by writing to the tail pointer?)I must make the descriptors not stale and then write one to bit DMACR.RS. This is true even if my last pointer points back to the first descriptor and I have cleared that first descriptors completed bit.

 

2. Cyclic BD

Here I set up the chain so that the last descriptor points back to the first. To start the DMA engine, I write an address to the tail pointer that is not part of the descriptor table. The engine wraps round back to the first pointer after it has processed the last and ignores whether any of the descriptors are stale.

 

So, assuming all this is true, I would prefer a third method that doesn't seem to be available.

 

- the DMA engine should not halt when it reaches the tail pointer descriptor, provided I have set it up to cyclicly wrap back round to the first pointer.

- The completed bit's should not be ignored. The engine can halt if it reaches a stale descriptor. 

- My software will clear these bits as the application reads the data already written.

- My software will do this often enough to ensure that the engine does not encounter any stale descriptors.

 

This ensures that the engine carries on without need the software to restart it by writing to the tail pointer once the end of the descriptor chain has been reached, provided the descriptors are refreshed often enough.

 

Can someone confirm that this is not possible with the current version of the core and if so can I request it as a new feature?

 

Thanks

0 Kudos
Reply
Highlighted
Xilinx Employee
Xilinx Employee
12,488 Views
Registered: ‎08-02-2011

Hello,

 

Your descriptions in 1 and 2 are correct. The only caveat may be your last statement in 1.

This is true even if my last pointer points back to the first descriptor and I have cleared that
first descriptors completed bit.

I have never tried this. If you clear the completed bits via SW before the DMA gets back to that descriptor, it may still keep going.

 

 

Essentially, cyclic mode is implemented (to my knowledge) by the DMA simply ignoring the IOC bits in any given BD so if you point the tail descriptor back to the beginning, it'll keep going in cyclic mode.

 

 

So, assuming all this is true, I would prefer a third method that doesn't seem to be available.
...

The completed bit's should not be ignored
...

 

You are correct that this method you described is not possible based on the way I described how cyclic mode works above (with the possible caveat above that might work).

 

Maybe I'm missing it, but what would be the need for this third method? In either case, your software needs to react fast enough to consume the data before the DMA starts overwriting itself (or aborts).

 

Anyway, can you post some screenshots of your simulation showing BD setup and the DMA stopping on the last one?

 

Have you checked status registers for any errors when it halts?

www.xilinx.com
0 Kudos
Reply
Highlighted
Scholar
Scholar
12,439 Views
Registered: ‎04-04-2014

Thank you bwiec for looking into this for me. I'll give as much detail as I ca nto try and help chase it down.

 

Here is my BD layout. Ignore the translational BRAM. This is a leftover from the CDMA approach I started with. It's not use with the standard DMA engine. The descriptors for this don't allow source/dest address pairing to be specified, so I can't insert address descriptors in the middle of my chain to update the translation registers. I don't think I need to do this anyway.

 

BD PCIE.png

Here is my descriptor chain. The chain is stored on the host PC side. The base address for this memory region, translated to the axi side, is 0x8000_0000. The base address for the memory region that will receive the data on the host PC side is 0x6000_0000.

Each descriptor is set up to transfer the same numbe rof bytes, 0x0160, and each packet I attempt to transfer is less than this. the last descriptor points back to the first.

 

Descriptor 1

Addr: 0x8000_0000

Next Descriptor Ptr: 0x 8000_0040

Buffer Addr: 0x6000_0000

Control: 0x000_0160

 

Descriptor 2

Addr: 0x8000_0040

Next Descriptor Ptr: 0x 8000_0080

Buffer Addr: 0x6000_0168

Control: 0x000_0160

 

Descriptor 3

Addr: 0x8000_0080

Next Descriptor Ptr: 0x 8000_00C0

Buffer Addr: 0x6000_02D0

Control: 0x000_0160

 

Descriptor 4

Addr: 0x8000_00C0

Next Descriptor Ptr: 0x 8000_0100

Buffer Addr: 0x6000_0438

Control: 0x000_0160

 

Descriptor 5

Addr: 0x8000_0100

Next Descriptor Ptr: 0x 8000_0000

Buffer Addr: 0x6000_05A0

Control: 0x000_0160

 

Here is my simulation setup with regards to the DMA:

  1. Write to the PCIE address translation registers to map the Host PC Scatter Gather and Data memory regions.
  2. Write descriptors to Host PC SG region.
  3. Write to S2MM_CURDESC register with 0x8000_0000
  4. If I want Cyclic BD mode, set this bit in the S2MM Control Reg now.
  5. In a separate write to 4, set S2MM_DMACR.RS and S2MM_DMACR.Err_IrqEn
  6. Write to S2MM_TAILDESC register, value depends on test, see below...
  7. Scatter Gather begins. DMA engine proceeds through chain correctly.
  8. In a separate process, I poll the descriptor complete bits of each descriptor in numerical order (in Host PC mem region) every 10 clock cycles. If complete, I set to zero and poll the next descriptor.

I will describe the results of my tests below...

0 Kudos
Reply
Highlighted
Scholar
Scholar
12,433 Views
Registered: ‎04-04-2014

@bwiec wrote:

Hello,

 

Your descriptions in 1 and 2 are correct. The only caveat may be your last statement in 1.

This is true even if my last pointer points back to the first descriptor and I have cleared that
first descriptors completed bit.

I have never tried this. If you clear the completed bits via SW before the DMA gets back to that descriptor, it may still keep going.

 

 


This did not work. As per the PG, the engine goes IDLE once it processes the last descriptor, even though it points back to the first and I have cleared the complete bits before it gets there.

Upon idling the CURRDESC and TAILDESC regs are both 0x8000_0100, the Control Reg is 0x0001_4003 and the Status Reg is 0x0001_100A

 

Here is the simulation snapshot. You can see the 5 AXI-stream transactions and their associated AXI4 transactions into the axi-pcie core. No interrupt is raised.

 

This is as I would expect from the description in the Product Guide.

 

non-cyclic.png

0 Kudos
Reply
Highlighted
Scholar
Scholar
12,427 Views
Registered: ‎04-04-2014

bwiec wrote:

Anyway, can you post some screenshots of your simulation showing BD setup and the DMA stopping on the last one?

 

Have you checked status registers for any errors when it halts?



Ok, so I have actually got the cyclic BD mode working in my simulation now.

 

I will now describe what happened in a couple of different scenarios and how I fixed it. Again, I have the last descriptor pointing back to the first, and I am clearing the complete bits as they get set as before.

 

Scenario 1.

TAIL_DESC set to 0x8000_0100 (the last descriptor).

The outcome is actually exactly the same as with the non-cyclic BD example I gave above. The engine IDLEs after the 5 descriptors have been processed. After it idles CURRDESC = TAIL_DESCR = 0x8000_0100, Control = 0x0001_4013, Status = 0x0001_100A.

The screenshot is the same but here it is for reference.

 

cyclic.png

 

Scenario 2.

TAIL_DESC set to 0x8000_0050 (as suggested in the PG). The engine IDLEs after the first 2 descriptors have been processed. I am assuming this is because the tail pointer is between the address for the second and third descriptors. 

 

After it idles CURRDESC = TAIL_DESCR = 0x8000_0400 (the second descriptor), Control = 0x0001_4013, Status = 0x0001_100A.

 

cyclic50.png

 

Scenario 3

TAIL_DESC set to 0x8000_0250 (an address completely outside the descriptor chain). The engine IDLEs after 6 descriptors have been processed. This is all 5 descriptors and then the first descriptor again. The RP seems to have paused waiting for a completion from a read to one of the DMA registers....

 

After it idles CURRDESC = TAIL_DESCR = 0x8000_0400 (the second descriptor), Control = 0x0001_4013, Status = 0x0001_100A.

 

 Scenario 4 (the fix)

TAIL_DESC set to 0x8000_0250 (an address completely outside the descriptor chain). 

I then discovered why the RP halted. By this time the RP was receiving fairly large data packets (0x160 size), read requests for the descriptor table and also completions with data in response to read requests for the DMA registers.

 

This resulted in two back to back TLPs with no gap in between. The problem is that the SOF of the second packet occurred in the same clock cycle as the EOF of the first packet, so the 128-bit data window actually contains data from two adjacent packets. This is allowed but not dealt with by the testbench. In pci_exp_usrapp_rx.vhd you will find:

 

case trn_sofn_eof_n is 
when 11 => ... . some code 
when 01 => ... . some code 
when 10 => ... . some code 
when 00 => null 
end case;

 

That last option is the problem case.

 

I tried to fix the testbench but there is actually a lot of changes needed to make it work. So, for now, I have made the data transfers much smaller. This ensures that the data transfer has finished before another packet comes along to be processed and this situation doesn't occur. Having done that the cyclic mode continues indefinitely as expected....

 

On another note, the PG might need to be a bit clearer when describing what to do with the TAILDESC in cyclic BD mode. It suggests using an address not in the chain, and that it is only used to start the engine (the example is to use 0x50 when the tail is at 0x1C0). However this did not work for me, presumably because 0x50 is in the address range used by my chain, although it isn't one of the start addresses. It actually has to be completely outside of the address range used (0x250 for me), which is not what is suggested. It is not true that "The Tail Descriptor register does not serve any purpose and is used only to trigger the DMA engine" as it says in the PG....I would change this wording.

0 Kudos
Reply
Highlighted
Scholar
Scholar
12,424 Views
Registered: ‎04-04-2014

@bwiec wrote:

 

 

Maybe I'm missing it, but what would be the need for this third method? In either case, your software needs to react fast enough to consume the data before the DMA starts overwriting itself (or aborts).

 

 


I see your point, but I'l try and explain. 

 

At the moment in cyclic mode the chain wraps around but does not stop if a stale descriptor is encountered, which is a situation you might want to avoid.

 

You can only avoid this in non-cyclic mode. But here, once the end of the chain is encountered, the software has to react quickly to reset the engine and allow it to start from the beginning. This is highly time critical, if it isn't done quickly data will be lost at the input to the axi4-stream side of the dma core. In our application the data is real time so a long delay here will lose data (a FIFO will fill up somewhere). 

 

You say that the PC software has to work fast enough to catch up with the descriptors anyway. This is true on average, but it can be bursty. I'm not sure our software would normally guarantee a response time, but with a suitably large chain it doesn't need to (in our case it would take 100s of ms to complete the chain). It only needs to have cleared the complete bits on the first part of the chain, so that the engine is able to wrap round until the end. Effectively it treats the wrap around as if it were in the middle of the chain in a non-cyclic mode.

 

Does this make sense? 

0 Kudos
Reply
Highlighted
Scholar
Scholar
20,186 Views
Registered: ‎04-04-2014

Ok, I have managed to make the DMA engine do what I want, but I have cheated slightly.

 

In my implementation:

- The DMA engine wraps around to the first descriptor once the last has been processed.

- At any point if the DMA engine reaches a stale descriptor it stops.

 

To do this you need to do the following:

- Disable Cyclic BD mode

- Make the last descriptor point back to the first.

- Program the Tail Descriptor with an address outside of the range used by the descriptors. In my case the last descriptor is at address 0x8000_0100 so I make my Tail Pointer 0x8000_0140.

 

Basically in non-cyclic BD mode the engine stops once the current pointer is >= the tail pointer. If the last descriptor points back to the first and the tail is greater than the last descriptor address then this never happens....

 

Worth knowing. 

 

Thanks

View solution in original post

Highlighted
Voyager
Voyager
9,616 Views
Registered: ‎02-17-2009

Hi @mistercoffee,

 

I am having a similar problem and trying to set up a simulation to be able to debug it. I was hoping to be able to see what's going on inside of the AXI_DMA core but it seems that the core sources are somehow treated differently and I am not able to add any of the internal signals to the waveform. Looking at your screenshot, I can see that you were able to view some state machine's state. Can you share how you did it? Also, which state machine is that? I searched through the sources for the states in the diagram (WAIT_FOR_READY, ASSERT_LAST) but couldn't find them anywhere. Or, is this your state machine? 

 

Thanks,

/Mikhail

0 Kudos
Reply
Highlighted
Scholar
Scholar
9,595 Views
Registered: ‎04-04-2014

@mmatusov wrote:

Hi @mistercoffee,

 

I am having a similar problem and trying to set up a simulation to be able to debug it. I was hoping to be able to see what's going on inside of the AXI_DMA core but it seems that the core sources are somehow treated differently and I am not able to add any of the internal signals to the waveform. Looking at your screenshot, I can see that you were able to view some state machine's state. Can you share how you did it? Also, which state machine is that? I searched through the sources for the states in the diagram (WAIT_FOR_READY, ASSERT_LAST) but couldn't find them anywhere. Or, is this your state machine? 

 

Thanks,

/Mikhail


ooh, going back a little and I haven't touched the project much since but I will try and help.

 

I definitely used the AXI to memory mapped PCIE core. But, I think had to use a testbench for the standard 7series Integrated Block as I don't think there was one available for the axi version (at least for PIPE mode which is a must).

 

As far as the core goes I am using it with "Include Shared Logic (Clocking) in the example design" ticked. That may make a difference.

 

The state machine (under DMA AXI Stream) is my own state machine that I use to control data flow into and out of the AXI interface.

 

The other signals in my screenshot look like either external AXI signals, within my block design but outside the PCIE core, or test bench signals (from the example design).

 

I hope that helps.

0 Kudos
Reply
Highlighted
Voyager
Voyager
7,249 Views
Registered: ‎02-17-2009

Hi @mistercoffee,

 

Thank you for your response. At the moment I am not looking at the PCIe side of the things at all. The test bench I am using is basically the Xilinx test bench for the AXI_DMA and the memory attached to it is in BRAM. I am just trying to make sense out of the S2MM DMA behaviour in the simplest case where I have MM2S looped back to S2MM and a single descriptor in each direction without cycling. 

 

I was actually able to add internal AXI_DMA signals to the waveform, so there is a hope that I can figure it out.

 

Thanks,

/Mikhail

0 Kudos
Reply
Highlighted
Observer
Observer
6,875 Views
Registered: ‎09-02-2015

Hi Fellow FPGAers,

 

I am having the same problem.  I am using Vivado 2016.2 with a MicroZed 7Z010 board, got

the AXI DMA loopback "more or less" working.  I modified the example app from the SDK directory

to send packets endlessly.  The trouble is, the loop gets stuck once the lesser of the Tx or Rx Bd

ring runs through all the buffers and reaches the end.

 

The last buffer's next pointer does point to the first buffer.  But it seems the used ones are not

returned to the ring.

 

Please help to examine the attached c file, and advice what I missed.

 

Thank you very much.  Best regards,

 

0 Kudos
Reply
Highlighted
Scholar
Scholar
6,863 Views
Registered: ‎04-04-2014

@fpgaioc wrote:

The trouble is, the loop gets stuck once the lesser of the Tx or Rx Bd

ring runs through all the buffers and reaches the end.

 

The last buffer's next pointer does point to the first buffer.  But it seems the used ones are not

returned to the ring.

 


Are you in non-cyclic mode? If so, are you refreshing the descriptors after you've processed them? I found that in cyclic mode you don't have to do this and the ring never halted.

Highlighted
Observer
Observer
6,849 Views
Registered: ‎09-02-2015

Hi MisterCoffee,

 

Thank you very much for your response.

 

I tried to call, or not call, 

XAxiDma_SelectCyclicMode(TxRingPtr, 1, 1);

XAxiDma_SelectCyclicMode(RxRingPtr, 1, 1);

 

but got the same behavior: the DMA stops once it reaches and uses the last BD in the shorter

ring (Tx or Rx).  It is strange that the longer ring seems to be truncated to the same length

-- I missed to mention this in the first post.

 

So, it seems I am using the cyclic mode, but I am not sure.

 

How do I refresh the descriptors?  Could you please browse my C file and advice?

 

Thank you again, best regards,

0 Kudos
Reply
Highlighted
Scholar
Scholar
6,844 Views
Registered: ‎04-04-2014

@fpgaioc wrote:

Hi MisterCoffee,

 

Thank you very much for your response.

 

I tried to call, or not call, 

XAxiDma_SelectCyclicMode(TxRingPtr, 1, 1);

XAxiDma_SelectCyclicMode(RxRingPtr, 1, 1);

 

but got the same behavior: the DMA stops once it reaches and uses the last BD in the shorter

ring (Tx or Rx).  It is strange that the longer ring seems to be truncated to the same length

-- I missed to mention this in the first post.

 

So, it seems I am using the cyclic mode, but I am not sure.

 

How do I refresh the descriptors?  Could you please browse my C file and advice?

 

Thank you again, best regards,


Hi,

 

I would but I'm afraid I probably wouldn't be much use. I have only been using the core in simulation with VHDL and verilog testbenches at a low level. Sorry!

0 Kudos
Reply
Highlighted
Voyager
Voyager
5,456 Views
Registered: ‎10-31-2016

Hi, 

 

I am trying to make PCIe project. I can utilize some help from your project.

does your project for PCIe works?
Can you forward me the project?

 

thanks 

0 Kudos
Reply
Highlighted
Advisor
Advisor
2,708 Views
Registered: ‎10-10-2014

Hello @mistercoffee , @bwiec 

We're currently trying to get cyclic dma working, and by searching I came accross this very interesting forum post.

I'm not a (V)HDL expert, I can do some basic simulation. For AXI4-Lite I use a BFM that I copied from a Silica tutorial , but I've always wondered how other people do use AXI busses, or access / init memory from their testbenches...

I hope you could explain me how you can simulate that whole BD, setup the DMA descriptors in memory, and how you write to all these registers, as you described in that post : 

  1. Write to the PCIE address translation registers to map the Host PC Scatter Gather and Data memory regions.
  2. Write descriptors to Host PC SG region.
  3. Write to S2MM_CURDESC register with 0x8000_0000
  4. If I want Cyclic BD mode, set this bit in the S2MM Control Reg now.
  5. In a separate write to 4, set S2MM_DMACR.RS and S2MM_DMACR.Err_IrqEn
  6. Write to S2MM_TAILDESC register, value depends on test, see below...
  7. Scatter Gather begins. DMA engine proceeds through chain correctly.
  8. In a separate process, I poll the descriptor complete bits of each descriptor in numerical order (in Host PC mem region) every 10 clock cycles. If complete, I set to zero and poll the next descriptor.

I dont want to ask too much of your time, but if you could give me some hints/clues, I can probably find my way out. Like :

Q1) I expect you first create a hdl wrapper around the bd, and then use that wrapper file as the UUT in your testbench?

Q2) how do you write to all the registers? Like PCIE address translation registers or S2MM_CURDESC ?

Q3) how do you write / init the descriptor list in memory? from a memory init file or also some 'write' function?

** kudo if the answer was helpful. Accept as solution if your question is answered **
0 Kudos
Reply
Highlighted
Xilinx Employee
Xilinx Employee
2,701 Views
Registered: ‎08-02-2011

Hi Ronny,

 

It sounds like you're ultimately trying to figure out how to simulate a moderately complex set of AXI Lite transactions which would normally come from a processor/software.

 

There are a number of approaches to do this like:

- AXI BFM/VIP IP from Xilinx

- AXI Traffic generator IP from Xilinx

- Some custom AXI Lite traffic generator with behavioral HDL (honestly, for quick and dirty register reads/writes, I usually just do this. AXI Lite is not that complex)

- Simulate with microblaze

 

In this case, though, I'd recommend just using the AXI DMA example design that comes with the core :). It is simulatable and shows driving the SG interface. From what I recall, it uses an AXI BRAM on the SG interface. The BRAM contents are loaded with a .coe file that lays out the BD in memory at the correct address offsets that the AXI DMA SG engine is going to look for. Then there's a little state machine to configure/monitor axi lite registers.

www.xilinx.com
0 Kudos
Reply
Highlighted
Voyager
Voyager
2,691 Views
Registered: ‎02-17-2009

@bwiec wrote:

 

There are a number of approaches to do this like:

- AXI BFM/VIP IP from Xilinx

- AXI Traffic generator IP from Xilinx

- Some custom AXI Lite traffic generator with behavioral HDL (honestly, for quick and dirty register reads/writes, I usually just do this. AXI Lite is not that complex)

- Simulate with microblaze

 

In this case, though, I'd recommend just using the AXI DMA example design that comes with the core :). It is simulatable and shows driving the SG interface. From what I recall, it uses an AXI BRAM on the SG interface. The BRAM contents are loaded with a .coe file that lays out the BD in memory at the correct address offsets that the AXI DMA SG engine is going to look for. Then there's a little state machine to configure/monitor axi lite registers.

 


If I remember correctly the AXI DMA example design uses AXI Traffic generator to simulate all AXI transactions. I found that it was very difficult to modify this test bench, which I needed to do when I was debugging cyclic operation. AXI Traffic generator is only good for sending a stream of AXI writes or reads but it can't for example wait for a result of a read operation and do something different based on the read data. Instead I used a free AXI4-Lite BFM from bitvis. I think it is a much simpler alternative to the Xilinx's own BFM and it saves you from writing your own custom AXI4-Lite models. 

 

/Mikhail

 

0 Kudos
Reply