UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Explorer
Explorer
531 Views
Registered: ‎07-03-2014

[2017.4] DMA suddenly stops triggering IRQs

Hi!

This problem is really weird. Yesterday, my design was working flawlessly: it modulates an input signal and sends the data to a DAC. The design was transmitting over 16h in a RAW without any problem.

I run a new implementation removing the ILA I have added to monitor input signals for that 16h-long test, so I didn't change anything in the design. Now, the design doesn't work at all... I added the ILA again, compiled again and still doesn't work, but I found the problem: DMA core stops generating IRQs at some point.

At first, each MM2S IRQ (DDR3 a T2 Framer, the yellow signal) is triggered correctly; you can see when a "Frame_Data_tlast" is '1' (light purple signal), a MM2S IRQ is asserted (yellow signal); it takes approximately 1.1ms from tlast is asserted to IRQ is asserted, but it works:

DMA_IRQ_OK.jpgDMA triggers IRQ correctly

 

 

 

 

 

 

 

 

 

 

 

 

 

 

But suddenly, the DMA core stops driving MM2S IRQ to '1', although there has been a couple of transfers and tlast is correctly asserted at the end of them:

DMA_Stops_IRQ.jpgDMA never triggers IRQ again

DMA is set as Scatter-Gather, one single channel and MM2S and S2MM interfaces; S2MM works fine and always triggers the IRQ after a packet is received. Coalesce is set to 1, and timeout is at default.

Why is this happening? Why the core doesn't work as expected after implementing, if it worked before??

Thanks in advance.

DMA_No_IRQ.jpg
0 Kudos
12 Replies
Xilinx Employee
Xilinx Employee
479 Views
Registered: ‎10-04-2016

Re: [2017.4] DMA suddenly stops triggering IRQs

Hi @alexmoya,

This is peculiar. 

I'm wondering if something is going on with one of the memory mapped ports that is preventing the AXI DMA from completing work on either the previous or the current buffer descriptor. I'd probably start by looking at the M_AXI_SG port to make sure that all of the writes that update BD status are going through as expected. 

Before inserting more ILAs in the system, you might try looking at the BD chain first. If you trace your MM2S buffer descriptor chain do you see that all of the Complete bits are set up to the Current BD pointed to by registers 0x8/0xC? Is the AXI DMA signaling any errors in either the MM2S Status Register (0x4) or the completed BDs?

Are there other masters in the system that share access to the DRAM? Are those masters still behaving correctly?

Regards,

Deanna

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
Explorer
Explorer
446 Views
Registered: ‎07-03-2014

Re: [2017.4] DMA suddenly stops triggering IRQs

@demarco

I checked the DMA core status and it looked like everything was ok; I also checked other masters to DDR3 and they were working well: Microblaze was writing and reading from memory as expected.

I couldn't check if AXI transactions were working correctly because after adding new probes to ILA, the problem solved. I didn't change anything in the design, just added AXI bus signals to ILA and, after implementing, everything started working well.

I can't understand why the implemented design fails sometimes and sometimes it works, without any change in the sources and timing closure being achieved. Why is implementation so random?

 

Thanks for your answer.

0 Kudos
Xilinx Employee
Xilinx Employee
427 Views
Registered: ‎05-08-2012

Re: [2017.4] DMA suddenly stops triggering IRQs

Hi @alexmoya

The symptom of the design failing when removing an ILA and vice versa, sounds like there could be a timing/constraints issue with this design. The ILA will definitely affect the placement and routing of the design, so you will get slightly different delays. I think there are a couple of things that could be tried.

Try setting up another ILA on un-related logic, reproduce in HW (hopefully), and replace the probes post-implementation. This give the best chance of reproducing with the ILA in place, so that you can see what is happening. It would be best to choose the initial probe logic that you think will end up in the same physical area, so there is as little change as possible. Also, reducing the number of probes and options can help to reproduce. The flow is described below.

 http://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_3/ug908-vivado-programming-debugging.pdf#page=264

Also try adding constraints that increase the estimated jitter (set_input_jitter) or uncertainty (set_clock_uncertainty). The idea would be to see if doing so allows the design to function correctly. This would indicate that there could be a constraints or jitter issue. I would think 30 to 50 ps would work.

For logical issues, I would confirm that there are no DRCs or methodologies being suppressed.


-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------

 

0 Kudos
Explorer
Explorer
357 Views
Registered: ‎07-03-2014

Re: [2017.4] DMA suddenly stops triggering IRQs

Hi @marcb

If there would be any timing constraints issue with the design, Vivado should warn about it, right? The design gets timing closure both with ILA and without it, so I don't think it is nothing related to timing. I also reduced clock frequency by 25%, and now the problem appears more often; in fact, it happens even if no ILA is present.

It looks like this is absolutely random, which drives me crazy: how in the earth am I selling a design to a customer that fails randomly?

In the past week, I implemented the design about 30 times, and a third of them, the DMA stops triggering IRQs... Maybe there is a some BD ring corruption which I'm not aware of, but this should happen always (since it is software related) and not only after a random implementation...

 

0 Kudos
Xilinx Employee
Xilinx Employee
340 Views
Registered: ‎05-08-2012

Re: [2017.4] DMA suddenly stops triggering IRQs

Hi @alexmoya.

Those are interesting results. An increased failing rate after reducing the clock frequency would be opposite of what would be expected for a timing/constraints issue. There are methodology checks that would catch many constraints issues, but it is still possible to underconstrain or override constraints. The report_exceptions command can report overridden constraints, but this does not sound like it would be the issue.

Do you have all of the journal files (vivado.jou) related to the changes made that cause the design to go from passing to failing? Also, the before and after ILA XDC might help point out what the difference is.

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------

0 Kudos
Explorer
Explorer
281 Views
Registered: ‎07-03-2014

Re: [2017.4] DMA suddenly stops triggering IRQs

@marcb, @demarco

I have some news about my problem; as I told to @demarco, I had checked DMA status and it was OK, but I didn't pay attention to descriptors (as the design worked well when no ILA was present); now, I just looked at the descriptors and ... there were some of them missing!!

This is DMA0 status just after I send descriptors to hardware first time:

BD_OK.jpg

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

As you can see, the sum of descriptors currently on hardware and free descriptors are equal to "AllCnt", i.e., 64 descriptors.

After the DMA core stops triggering IRQs, this is the descriptor count:

BD_Lost_DMA0.jpg

Where on earth are those missing 10 descriptors?? AS far as I know, descriptors only can be on hardware or, if they are under software control, in states "Free", "Pre" or "Post". But there are 10 descriptors which aren't in any of those states.

The amazing thing about this is that the other DMA core in the design (called DMA1) behaves exactly the same after DMA0 stalls:

BD_Lost_DMA1.jpg

This time, there is just 1 descriptor missing, and I don't know if the core would trigger IRQs beyond this point because the thread which sets its ring is sleep due to DMA1 malfunctioning.

Is there any situation where BDs are removed from the ring?

Thanks!

 

PS: Somebody moved this thread from its original sub-forum to "Timing Analysis" sub-forum, but definitely, this is nothing to do with timing.

BD_OK.jpg
0 Kudos
Xilinx Employee
Xilinx Employee
261 Views
Registered: ‎10-04-2016

Re: [2017.4] DMA suddenly stops triggering IRQs

Hi @alexmoya,

This question changed from the AXI queue to the Timing Analysis when it looked like there were build to build variations. Different Xilinx employees watch the various Forums queues based on their area of expertise.

Are you running your own code or one of the AXI DMA bare metal example designs? 

The thing that is interesting about the interrupts-have-stopped state is that the HwHead and HwTail pointers suggest that there might be work in progress. The HwCnt, however, is 0.

Do these pointers in HwHead and HwTail match the MM2S Current Descriptor and MM2S Tail Descriptor register settings? 

The bare metal driver assigns descriptors that are contiguous in address space. If I subtract the Tail and Head addresses and divide by 0x40 (the size of a BD), I get this:

(0xDAC5_DE80 - 0xDAC5_CEC0)/0x40 = 0x3F -> 63

Given this, I'd expect the HwCnt to be 63. Something has gone awry in the software management of the BDs.

Regards,

Deanna

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
Explorer
Explorer
235 Views
Registered: ‎07-03-2014

Re: [2017.4] DMA suddenly stops triggering IRQs

Are you running your own code or one of the AXI DMA bare metal example designs?

I wrapped baremetal examples into my own routines, but they are basically the same code.

 


The bare metal driver assigns descriptors that are contiguous in address space. If I subtract the Tail and Head addresses and divide by 0x40 (the size of a BD), I get this:

(0xDAC5_DE80 - 0xDAC5_CEC0)/0x40 = 0x3F -> 63

Given this, I'd expect the HwCnt to be 63. Something has gone awry in the software management of the BDs.


Sorry to say your calculations are wrong: you are substracting the head addr from the tail addr, but the tail descriptors IS a BD too; so the calcultions are:

(0xDAC5_DE80 + 0x40 - 0xDAC5_CEC0)/0x40 = 0x40 -> 64

There shouldn't be anything wrong in the software management because this same code has been succesfully used in other projects. Besides, I just moved XAxiDma driver from DDR into BRAM and then the design works perfectly. In those screenshots I posted yesterday, the XAxiDma driver was allocated to DDR3, but then I moved it to BRAM and DMA core now works and BD are not corrupted or lost.

I checked if the DDR memory segment where the driver was allocated was overwritten by some misleading pointer, but it didn't: I filled with 0xFF the address space where XAxiDma driver was allocated and that segment remains untouched till the end of time.


Do these pointers in HwHead and HwTail match the MM2S Current Descriptor and MM2S Tail Descriptor register settings?


I didn't check this time, but last time I did they matched.

 

 

0 Kudos
Xilinx Employee
Xilinx Employee
105 Views
Registered: ‎10-04-2016

Re: [2017.4] DMA suddenly stops triggering IRQs

Hi @alexmoya,

That 0-based counting is bound to catch me sometimes. 

Where are you with this debug? The last update you posted said you were able to get the DMA transfers to work when you targeted BRAM rather than DRAM. Does that mean you moved all of the buffer descriptor chains and data buffers to BRAM? Or did you just move the buffer descriptor chains?

Also, is this a MicroBlaze or a Zynq system?

Do you get any more information about what is going on if you enable debug messages in the driver?

I just find it really strange that when you hit the failing condition, the HwHead and HwTail pointers aren't the same but HwCnt is 0. The mismatched pointers would suggest that the AXI DMA has BDs to work on. 

Regards,

Deanna

Regards,

Deanna

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
Explorer
Explorer
85 Views
Registered: ‎07-03-2014

Re: [2017.4] DMA suddenly stops triggering IRQs


That 0-based counting is bound to catch me sometimes.

No problem XD

 


Where are you with this debug? The last update you posted said you were able to get the DMA transfers to work when you targeted BRAM rather than DRAM. Does that mean you moved all of the buffer descriptor chains and data buffers to BRAM? Or did you just move the buffer descriptor chains?

Also, is this a MicroBlaze or a Zynq system?


I moved XAxiDma object (struct containing driver config) from DDR3 to BRAM: I created the driver as an static variable in C, instead of allocating it on the heap using malloc. Heap is on DDR3 and static variables (BSS section) is on BRAM. All descriptors remain in DDR3, same as transferred data (otherwise, data would be unreachable from DMA's SG_AXI inteface and DMA transfers wouldn't even start).

It's a microblaze based system.

 


Do you get any more information about what is going on if you enable debug messages in the driver?

I just find it really strange that when you hit the failing condition, the HwHead and HwTail pointers aren't the same but HwCnt is 0. The mismatched pointers would suggest that the AXI DMA has BDs to work on.


I will be back over this problem in the next weeks, and I'll try then. So far, moving DMA driver to BRAM is working, so I'm putting effort in new features for my design instead wasting time in discovering the source of the problem.

 

Regards!

 

 

0 Kudos
Explorer
Explorer
40 Views
Registered: ‎07-03-2014

Re: [2017.4] DMA suddenly stops triggering IRQs

@demarco

I checked again the whole design yesterday and made some tests with different number of descriptors. I found out that the driver also lost descriptors when allocated in BRAM.

BD_Lost_DMA1_BRAM.jpg

As you can see, HWTail is pointing 44 descriptors beyond HWHead, which doesn't match with the number of hardware descriptors (41). The missing 3 descriptors should be in free state, as suggests FreeHead, which is 3 descriptors before LastBdAddr (which matchs HWTail).

So it looks like 44 descriptors are on hardware, but driver reports that 41 are actually on hardware and 3 should be free.

IRQ_Tx_DMA1_BRAM.jpg

Taking a look to ILA probes, those 3 IRQs are the last ones before the DMA core stops triggering any new IRQ and before the design stalls, so those BDs are returned from hardware in the Interrupt Service Routine like any other, but they are not changed to free state.

Maybe am I doing something wrong when managing descriptors? I'm using FreeRTOS and maybe there is some issue when pre-empting tasks in the OS, although it shouldn't as any ISR is atomic in microblaze (no other task can preempt the ISR out of the micro) and I use critical sections when sending descriptors to hardware.

Regards.

 

PS: Could you please move this thread to a more suitable subforum in order to more people could help? Thanks!

0 Kudos
Xilinx Employee
Xilinx Employee
26 Views
Registered: ‎10-04-2016

Re: [2017.4] DMA suddenly stops triggering IRQs

Hi @alexmoya,

Xilinx doesn't support debug of FreeRTOS issues, so there isn't a Xilinx Forum that's a good fit to move this thread to.

Customers have had success getting support from the FreeRTOS forums. Please consider opening a thread on their site at https://sourceforge.net/p/freertos/discussion/create_topic/382005/.

If you run in bare metal, do you see the problem?

Regards,

Deanna

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos