cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
95 Views
Registered: ‎12-21-2018

Zynq PS Completely Freezes, suspecting cache problem

We have been working with Zynq for years and have always seemed to struggle with firmware stability issues, which I have always suspected are related to problems with the cache management. I now have a case where I think it is pretty clear that something is wrong with the cache or compiler, where the Zynq ARM completely locks up, both cores, and JTAG also will not respond (but the PL seems to be running still). I am able to get the problem to go away by either adding:

 

Xil_ICacheDisable();

 

At the start of my program, or by adding a few irrelevant lines of code in the area of my program that has changed recently that should have no impact on the stability of the program. I.e. If I have 2 functions which occur in my event loop like this:

 

if (myClass.myFunction()) {
    myClass.doThis();
}

 

And I add a few meaningless lines of code to make it something like:

 

If (myClass.myFunction()) {
    int x=1;
    X++;
    int y = x;
    y++;
    X = y;
    myClass.doThis();
}

 

Then the crash no longer occurs. myClass.myFunction() is doing almost nothing, it is just checking if 2 counters are equal and returns true if they are not, and in fact in my test case where the problem occurs myClass.myFunction() will always return false, so the inside of the if statement and myClass.doThis() never actually gets executed, so it is very strange that adding this irrelevant code has any effect on the program.

I have been struggling with stability issues like this in the Zynq ARM cores for years, where I change one or two simple lines of code in my firmware and all of a sudden the firmware becomes unstable. Often it takes many runs of my test before the error occurs, which is usually a complete lockup of the PS where even JTAG won’t respond. In the current case, I am finally able to make the PS lockup occur fairly quickly with a reproducible test, and am able to make it go away by adding a few irrelevant lines of code, so it seems to be a good opportunity to ask for help on things that could be tried to fix this properly. (Xil_ICacheDisable() is not a fix since it lowers performance, and adding irrelevant lines of code to make it go away doesn’t work since the lock up issues just keep resurfacing over and over.)

Our system has Petalinux (2015.2.1) running on CPU0, and Baremetal with FreeRTOS running on CPU1 (wth Xilinx SDK 2017.4). Petalinux isn’t actually doing anything during my problems though, and I can actually pause CPU0 with JTAG and still have the lock up problem.

My application has several interrupts happening, but each one returns very quickly. I rewrote all of my interrupt handling a while ago to deal with stability issues, which I think were also cache related, so now the only thing each ISR does is gives a semaphore (using FreeRTOS) and returns immediately so that a FreeRTOS task outside of the ISR will wake up and handle the event. This helped a lot, but I still have stability issues that always creep in, which I really believe must be cache related since I am very certain my code is logically correct. Also, I see these stability issues on multiple boards, we have designed several PCBs which use Zynq 7010/7020 and we see similar issues on all of them, and we also see these issues on off the shelf Microzed boards too, so it is not a hardware design issue.

Any help to debug this is greatly appreciated. Thanks.

0 Kudos
1 Reply
Highlighted
Xilinx Employee
Xilinx Employee
35 Views
Registered: ‎10-06-2016

Hi jgribben@ajile.ca 

Could you provide bit more details about the issue? I mean, what does mean when you say that ARM completely locks up and JTAG does not respond? Could you please connect the debugger and use the targets command?

Regards


Ibai
Don’t forget to reply, kudo, and accept as solution.
0 Kudos