Zynq: Exception and Interrupt Handling in Safety-Critical Systems
In safety critical systems,1 nothing is allowed to go unknown or unacknowledged, every possible condition (behavior) must be accounted for, and the hardware and software must deal with every possible behavior. The Zynq-7000 All Programmable SoC (AP SoC) with its dual ARMTM CortexTM A9 processors, and dual Neon floating point units have a table of addresses where the execution is directed when something happens, and interrupts the currently running process on one or the other CPU, and then goes to the interrupt handling code segment on that CPU.2 When serviced (finished handling the condition), the CPU returns to where it had been when it was interrupted and continues from there.
For exceptions, one needs to deal with all seven possible exceptions in a manner that allows the system to continue or recover, or take specific actions to behave properly.
All possible interrupts require a proper interrupt handler.
Exceptions
There are seven locations where a processor may be directed, if one of the following occurs.
Each exception should get logged (all of the details) in a system maintenance log, and then the system should verify its sanity and state, and continue or restart in a safe and reliable, repeatable fashion.
There are exception handler function calls listed in UG647, starting on page 54.3
Interrupts
The processor sub-system (PSS) in Zynq has 96 interrupt vectors.
0-15: These are 16 software generated interrupts (SGI)
16-26: reserved (not used: should not go here, -- an exception will be generated!)
27: Global Timer
28: Fast Interrupt Signal from the programmable logic
29: CPU Private Timer
30: Private Watchdog Timer
31: Interrupt from the programmable logic
32: CPU 0 LT, TLB, BTAC
33: CPU 1 LT, TLB, BTAC
34: L2 cache
35: OCM
36: reserved
37: PMU 0
38: PMU 1
39: XADC (system monitor analog to digital converter)
40: DVI
41: SWDT
42-43: TTC 0 timer
44: reserved
45: DMAC abort
46-49: DMAC[3:0] (46=0, 49=3)
50: SMC (memory)
51: Quad SPI
52: GPIO
53: USB 0
54: Ethernet 0
55: Ethernet 0 wakeup
56: SDIO 0
57: I2C 0
58: SPI 0
59: UART 0
60: CAN 0
60-63: Programmable logic FPGA [2:0]
64-68: Programmable logic FPGA [7:3]
69-71: TTC 1 timer
72-75: DMAC [7:4]
76: USB 1
77: Ethernet 1
78: Ethernet 1 wakeup
79: SDIO 1
80: IDC 1
81: SPI 1
82: UART 1
83: CAN 1
84-91: Programmable logic FPGA [15:8]
92: SCU parity error
93-95 reserved
All of the above are documented in UG5854.
Every interrupt that you are to use should have a valid service routine. Every interrupt you do not intend to use should have a vector to a routine that behaves like an exception: log the event, log whatever data one can get, and move on to recover or restart. Do not rely on the exception handler alone. The rule is, “if it can happen, it will happen,” so deal with all possible situations in an intelligent fashion. I know this philosophy is totally foreign to most programmers (why should I care?), but in safety critical systems you must be able to identify what is going on, and safely recover from any situation.
Bare-metal or OS?
Is your system one where you have written all of the software, including its operating system (if it has one)? This is called a “bare-metal” system as what you wrote, you got.
If you have an operating system, it could be public domain (a Linux variant), or a commercially available operating system. In any case, these rules still apply: all exceptions and interrupts must get dealt with properly, and the system response must be such that it meets the requirements for the safety of its use. It may be that it is a feature of a commercial OS to deal with all of this, so you do not have to write all the drivers and handlers yourself. In either event, it all still needs to be verified.
Is the System Safe?
In any testing of a safety critical system,5 you should exhaustively check and test for every exception and every interrupt. The behavior upon encountering any of the above should be completely deterministic and known proper behavior. A good example is if the system is controlling a large motor: the immediate reaction might be to remove power and let the motor idle. Then, you would discover the state of the system and continue from a safe state. That safe state may be to let the motor come to a complete stop and require an operator to restart, or perhaps the system may safely restart before it has stopped. If the failure was the result of a hardware problem (hard failure), the system should be designed to detect these conditions and shut down in a safe fashion.
Footnotes: