Zynq: Exception and Interrupt Handling in Safety-Critical Systems
In safety critical systems,1 nothing is allowed to go unknown or unacknowledged, every possible condition (behavior) must be accounted for, and the hardware and software must deal with every possible behavior. The Zynq-7000 All Programmable SoC (AP SoC) with its dual ARMTM CortexTM A9 processors, and dual Neon floating point units have a table of addresses where the execution is directed when something happens, and interrupts the currently running process on one or the other CPU, and then goes to the interrupt handling code segment on that CPU.2 When serviced (finished handling the condition), the CPU returns to where it had been when it was interrupted and continues from there.
For exceptions, one needs to deal with all seven possible exceptions in a manner that allows the system to continue or recover, or take specific actions to behave properly.
All possible interrupts require a proper interrupt handler.
There are seven locations where a processor may be directed, if one of the following occurs.
Reset: This exception tells you that execution has started back at the beginning, where the processor starts from a reset. If you were running, this is a very bad sign, as somehow the software has become lost or confused, and you are back at address 0, starting as if for the first time. An exception handler here is required to start over again, as gracefully as possible. You may have to re-initialize everything, or check that whatever should be initialized is initialized and valid. If you were running and you have now landed here, restarting as quickly and safely as possible is the goal. How did you get here? If it is because of software bug, hopefully you caught it in testing and fixed it (you need to have this exception handler so you can find the bug!). If this is a rare result of a soft error causing the processor to start over (perhaps a one in 100,000 year possibility), you still need to deal with the possibility.
Undefined Instruction: Similar to a reset, the processor knows what a valid instruction is and what is not, so if it fetches garbage and tries to execute, it will vector to this exception address. Again, you need the handler to do whatever needs to be done to safely deal with this event, and then get back to work. In testing, software bugs may cause any or all of the exceptions, so having the handler lets you know your software is free of these bugs.
Undefined Software Interrupt: The processor received a software interrupt for which there is no associated interrupt handler. As all such interrupts should be defined, this is a coding bug and needs to be fixed. If there are no coding bugs, then perhaps this is due to a soft error. As above, you need to be able to recover gracefully, or restart, or fix the bug.
Execution from an Undefined Address: The processor is directed to fetch an instruction from a place that is invalid. As above, recover, restart, or fix the bug.
Operating on Data from an Undefined Address: As in four above, but the data case, as opposed to the instruction case.
IRQ Interrupt: An IRQ interrupt has occurred for which there is no associated interrupt handler. This could be a hardware problem, a software problem, or a soft error. If it is not a transient problem, fix the hardware or software. If it is a soft error transient, recover and restart.
Fast Interrupt Exception: As in six above, but for a fast interrupt (FIQ).
Each exception should get logged (all of the details) in a system maintenance log, and then the system should verify its sanity and state, and continue or restart in a safe and reliable, repeatable fashion.
There are exception handler function calls listed in UG647, starting on page 54.3
The processor sub-system (PSS) in Zynq has 96 interrupt vectors.
0-15: These are 16 software generated interrupts (SGI)
16-26: reserved (not used: should not go here, -- an exception will be generated!)
27: Global Timer
28: Fast Interrupt Signal from the programmable logic
29: CPU Private Timer
30: Private Watchdog Timer
31: Interrupt from the programmable logic
32: CPU 0 LT, TLB, BTAC
33: CPU 1 LT, TLB, BTAC
34: L2 cache
37: PMU 0
38: PMU 1
39: XADC (system monitor analog to digital converter)
42-43: TTC 0 timer
45: DMAC abort
46-49: DMAC[3:0] (46=0, 49=3)
50: SMC (memory)
51: Quad SPI
53: USB 0
54: Ethernet 0
55: Ethernet 0 wakeup
56: SDIO 0
57: I2C 0
58: SPI 0
59: UART 0
60: CAN 0
60-63: Programmable logic FPGA [2:0]
64-68: Programmable logic FPGA [7:3]
69-71: TTC 1 timer
72-75: DMAC [7:4]
76: USB 1
77: Ethernet 1
78: Ethernet 1 wakeup
79: SDIO 1
80: IDC 1
81: SPI 1
82: UART 1
83: CAN 1
84-91: Programmable logic FPGA [15:8]
92: SCU parity error
All of the above are documented in UG5854.
Every interrupt that you are to use should have a valid service routine. Every interrupt you do not intend to use should have a vector to a routine that behaves like an exception: log the event, log whatever data one can get, and move on to recover or restart. Do not rely on the exception handler alone. The rule is, “if it can happen, it will happen,” so deal with all possible situations in an intelligent fashion. I know this philosophy is totally foreign to most programmers (why should I care?), but in safety critical systems you must be able to identify what is going on, and safely recover from any situation.
Bare-metal or OS?
Is your system one where you have written all of the software, including its operating system (if it has one)? This is called a “bare-metal” system as what you wrote, you got.
If you have an operating system, it could be public domain (a Linux variant), or a commercially available operating system. In any case, these rules still apply: all exceptions and interrupts must get dealt with properly, and the system response must be such that it meets the requirements for the safety of its use. It may be that it is a feature of a commercial OS to deal with all of this, so you do not have to write all the drivers and handlers yourself. In either event, it all still needs to be verified.
Is the System Safe?
In any testing of a safety critical system,5 you should exhaustively check and test for every exception and every interrupt. The behavior upon encountering any of the above should be completely deterministic and known proper behavior. A good example is if the system is controlling a large motor: the immediate reaction might be to remove power and let the motor idle. Then, you would discover the state of the system and continue from a safe state. That safe state may be to let the motor come to a complete stop and require an operator to restart, or perhaps the system may safely restart before it has stopped. If the failure was the result of a hardware problem (hard failure), the system should be designed to detect these conditions and shut down in a safe fashion.