cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
703 Views
Registered: ‎12-21-2018

Zynq PS Completely Freezes, suspecting cache problem

Jump to solution

We have been working with Zynq for years and have always seemed to struggle with firmware stability issues, which I have always suspected are related to problems with the cache management. I now have a case where I think it is pretty clear that something is wrong with the cache or compiler, where the Zynq ARM completely locks up, both cores, and JTAG also will not respond (but the PL seems to be running still). I am able to get the problem to go away by either adding:

 

Xil_ICacheDisable();

 

At the start of my program, or by adding a few irrelevant lines of code in the area of my program that has changed recently that should have no impact on the stability of the program. I.e. If I have 2 functions which occur in my event loop like this:

 

if (myClass.myFunction()) {
    myClass.doThis();
}

 

And I add a few meaningless lines of code to make it something like:

 

If (myClass.myFunction()) {
    int x=1;
    X++;
    int y = x;
    y++;
    X = y;
    myClass.doThis();
}

 

Then the crash no longer occurs. myClass.myFunction() is doing almost nothing, it is just checking if 2 counters are equal and returns true if they are not, and in fact in my test case where the problem occurs myClass.myFunction() will always return false, so the inside of the if statement and myClass.doThis() never actually gets executed, so it is very strange that adding this irrelevant code has any effect on the program.

I have been struggling with stability issues like this in the Zynq ARM cores for years, where I change one or two simple lines of code in my firmware and all of a sudden the firmware becomes unstable. Often it takes many runs of my test before the error occurs, which is usually a complete lockup of the PS where even JTAG won’t respond. In the current case, I am finally able to make the PS lockup occur fairly quickly with a reproducible test, and am able to make it go away by adding a few irrelevant lines of code, so it seems to be a good opportunity to ask for help on things that could be tried to fix this properly. (Xil_ICacheDisable() is not a fix since it lowers performance, and adding irrelevant lines of code to make it go away doesn’t work since the lock up issues just keep resurfacing over and over.)

Our system has Petalinux (2015.2.1) running on CPU0, and Baremetal with FreeRTOS running on CPU1 (wth Xilinx SDK 2017.4). Petalinux isn’t actually doing anything during my problems though, and I can actually pause CPU0 with JTAG and still have the lock up problem.

My application has several interrupts happening, but each one returns very quickly. I rewrote all of my interrupt handling a while ago to deal with stability issues, which I think were also cache related, so now the only thing each ISR does is gives a semaphore (using FreeRTOS) and returns immediately so that a FreeRTOS task outside of the ISR will wake up and handle the event. This helped a lot, but I still have stability issues that always creep in, which I really believe must be cache related since I am very certain my code is logically correct. Also, I see these stability issues on multiple boards, we have designed several PCBs which use Zynq 7010/7020 and we see similar issues on all of them, and we also see these issues on off the shelf Microzed boards too, so it is not a hardware design issue.

Any help to debug this is greatly appreciated. Thanks.

0 Kudos
Reply
1 Solution

Accepted Solutions
Xilinx Employee
Xilinx Employee
534 Views
Registered: ‎10-06-2016

Hi jgribben@ajile.ca 

You are right on your investigation, the error message you are facing points out that the processor hang in a memory access due to either accessing and invalid address space or a peripheral/memory that is under reset. When this happens the debugger cannot really control the processor and the debug capability is reduced significantly, however you still can perform some debugging through the XSCT console.

image.png

For example as shown in the above image you can still read the program counter (PC) and check which instruction was executing when the processor hang. Probably the PC will not give you any clue as the error seems to be more related to speculative accesses performed by the processor which take sense to not happen when the instruction cache is disabled.

Just to do some further debugging on that side you could test disabling the branch prediction in the processor and see if that is making the issue disappear.

        // read current SCTLR
        unsigned int sctlr = mfcp(XREG_CP15_SYS_CONTROL);

        // clear branch prediction enable bit
        sctlr &=~ XREG_CP15_CONTROL_Z_BIT;

        mtcp(XREG_CP15_SYS_CONTROL, sctlr);

        dsb();
        isb();

 

Even if the issue can be avoided disabling branch prediction, the right way to handle this is to ensuring your MMU configuration is accurate and you are not allowing the processor to access memory areas that should not be accessing. You will find the default MMU configuration in the translation_table.S file within the standalone BSP code. Can you spot anything wrong there? In theory for CPU0 you should restrict the DDR memory to just the area assigned to CPU0, but look into anything else you think might be problematic.

You can basically configure the MMU either statically on the translation_table.S or use the MMU API to modify properties in the application. I would suggest to use the API and do selective changes.

Regards


Ibai
Don’t forget to reply, kudo, and accept as solution.

View solution in original post

7 Replies
Xilinx Employee
Xilinx Employee
643 Views
Registered: ‎10-06-2016

Hi jgribben@ajile.ca 

Could you provide bit more details about the issue? I mean, what does mean when you say that ARM completely locks up and JTAG does not respond? Could you please connect the debugger and use the targets command?

Regards


Ibai
Don’t forget to reply, kudo, and accept as solution.
0 Kudos
Reply
600 Views
Registered: ‎12-21-2018

Hi @ibaie , thanks for getting back to me.

Whether I run my firmware through JTAG or from flash memory it seems to lock up, as in it no longer responds and it does not seem to be executing code anymore. At that point I cannot attach JTAG or do anything without power cycling the board.

When I am running the firmware through JTAG, I cannot get any further debug information when the lock up occurs because when I try to suspend the program in Xilinx SDK, the following message dialog comes up:

Screenshot from 2020-11-30 16-11-53.png

I get the same error message when I try to suspend either core. Since JTAG is not responding I am not sure what else I can do to further debug the issue. If you have any additional debugging tips for when I encounter such situations let me know.

In my experience, when I have seen ARM lockup problems like this in the past where JTAG will not respond, it has typically been from reading or writing a memory address which is out of the addressable range (not in DDR or any of the mapped AXI peripherals), or reading/writing an AXI Lite peripheral that is not running (e.g. when a PL module is held in reset and I try to read a register from it).

From what I can tell, the following forum post seems to be related to what I am seeing: https://www.reddit.com/r/embedded/comments/bpp8kt/how_to_debug_random_crashes/eo2or1c?utm_source=share&utm_medium=web2x&context=3 . The solution in that post states:

"Full fix: ensure mpu is configured correctly for all peripherals installed and used to ensure it can't try to cache anything it's not allowed to."

If this is indeed the fix, I am not really sure what they mean and how to do this. I assume it has something to do with Xil_SetTlbAttributes (in xil_mmu.h/c), but if so I could use some guidance on what I am supposed to do to ensure that the application doesn't try to cache and use invalid addresses.

Thanks,

0 Kudos
Reply
Adventurer
Adventurer
597 Views
Registered: ‎09-05-2020

I can think of 2 things.

The L2 cache should only be enabled on cpu0.

Make sure the 2 ps7_ddr_0 memory regions don't overlap.

0 Kudos
Reply
580 Views
Registered: ‎12-21-2018

Hi @Rmccarty , thanks for the thoughts.

The 2 memory regions don't overlap, CPU0 (Petalinux) is using address 0 to 0x10000000, and CPU1 is using address 0x10000000 to 0x40000000.

I tried disabling L2 cache on CPU1 by passing -DUSE_AMP=1 to the BSP extra_compiler_flags for ps7_cortexa9_1 (CPU1). This seems to disable L2 cache in xil_cache.c from what I can tell. I also tried calling Xil_L2CacheDisable() at the start of my program on CPU1. Is this what you are suggesting, or if not is there another way to only enable L2 cache on CPU0?

The result of the above seems to have no effect, the PS still locks up unless I pad in some irrelevant lines of code in the new part of my program which should have no effect on things.

0 Kudos
Reply
Xilinx Employee
Xilinx Employee
535 Views
Registered: ‎10-06-2016

Hi jgribben@ajile.ca 

You are right on your investigation, the error message you are facing points out that the processor hang in a memory access due to either accessing and invalid address space or a peripheral/memory that is under reset. When this happens the debugger cannot really control the processor and the debug capability is reduced significantly, however you still can perform some debugging through the XSCT console.

image.png

For example as shown in the above image you can still read the program counter (PC) and check which instruction was executing when the processor hang. Probably the PC will not give you any clue as the error seems to be more related to speculative accesses performed by the processor which take sense to not happen when the instruction cache is disabled.

Just to do some further debugging on that side you could test disabling the branch prediction in the processor and see if that is making the issue disappear.

        // read current SCTLR
        unsigned int sctlr = mfcp(XREG_CP15_SYS_CONTROL);

        // clear branch prediction enable bit
        sctlr &=~ XREG_CP15_CONTROL_Z_BIT;

        mtcp(XREG_CP15_SYS_CONTROL, sctlr);

        dsb();
        isb();

 

Even if the issue can be avoided disabling branch prediction, the right way to handle this is to ensuring your MMU configuration is accurate and you are not allowing the processor to access memory areas that should not be accessing. You will find the default MMU configuration in the translation_table.S file within the standalone BSP code. Can you spot anything wrong there? In theory for CPU0 you should restrict the DDR memory to just the area assigned to CPU0, but look into anything else you think might be problematic.

You can basically configure the MMU either statically on the translation_table.S or use the MMU API to modify properties in the application. I would suggest to use the API and do selective changes.

Regards


Ibai
Don’t forget to reply, kudo, and accept as solution.

View solution in original post

511 Views
Registered: ‎12-21-2018

Hi @ibaie,

Thank you very much for your response. I tried disabling branch prediction with the code that you provided and it was successful, my application now runs perfectly without hanging! Very useful bit of code, I was actually reading about branch prediction based on posts here and here (the second one you actually commented on), and reading the Zynq TRM, but could not figure out how to set the "Z bit in the CP15 c1 Control register to 1", and this does it. It might be useful for Xilinx to add such a function to the standalone bsp in the future, something like Xil_BranchPredictionDisable(), similar to Xil_ICacheDisable() etc, since it may be helpful for others and it it is not straightforward to do, just a thought.

I think you are right that it is something to do with memory access for both CPU0 and CPU1. Yesterday I tried only running my baremetal program on CPU1 without Petalinux running at all on CPU0 and the application ran fine as well. I was surprised, since even when I suspended CPU0 I would have the same problem, so maybe it is something to do with how Petalinux is configured. In my system Petalinux on CPU0 is actually able to access memory for the baremetal system on CPU1 by using the generic-UIO driver and mmap() - I am doing this to for DMA reasons, but maybe this is not a good way do it and is causing problems.

Looking at translation_table.S it does seem that CPU0 memory may be set as DDR cacheable (/* S=b1 TEX=b101 AP=b11, Domain=b1111, C=b0, B=b1 */), even though my lscript.ld does not use that memory space. I assume that the MMU API to modify properties in the application that you refer to is the Xil_SetTlbAttributes() function. I tried doing this on CPU1:

 

for (u32 addr=XPAR_PS7_DDR_0_S_AXI_BASEADDR; addr<0x10000000; addr+=0x00100000) {
  Xil_SetTlbAttributes(addr, 0); // S=b0 TEX=b000 AP=b00, Domain=b0, C=b0, B=b0
}

 

To not allow memory access from 0x00100000 to 0x10000000, which are used by Petalinux, but the lock up problem still persists, so if the above code is correct then I don't think it was that.

Now that I have a better idea of where to look I will debug further by carefully checking all memory accesses by both cores.

Thanks also for your tip on additional debugging possible through XSCT console, very helpful for these cases where the JTAG debugger stops responding.

 

 

 

0 Kudos
Reply
Xilinx Employee
Xilinx Employee
504 Views
Registered: ‎10-06-2016

Hi jgribben@ajile.ca 

Great to hear that my previous post was useful for you

To be honest I'm not really surprised that configuring the DDR as unspecified in the CPU1 MMU configuration is not solving your issue I mean, CPU1 should not access to memory area reserved to CPU0, but DDR is working so I don't see any reason why the transaction would not happen.

As you mentioned I would suggest to disable caching on the different areas and see if you can see any difference.

Regards

 


Ibai
Don’t forget to reply, kudo, and accept as solution.
0 Kudos
Reply