10-28-2013 09:36 AM
I'm running Linux kernel 2.6.28 on a Virtex5 ppc440 core. I realize that both the kernel and device are quite old now, but I'm hoping someone on this forum might have an explanation for the behaviour I'm seeing.
Recently I triggered a kernel panic from a "Data Read PLB Error" machine check exception. The machine check exception happened while the kernel was handling a page fault exception. Our kernel is configured to auto-reset on a panic after 5 seconds, but on this particular panic the core completely locked up after attempting the soft reset. We can only recover from this state with a hard reset. I fixed the bug triggering the kernel panic, but what I'd like to know is: What could cause the ppc440 core to completely lock up on a soft reset (DBCR0[RST])? The soft reset works fine any other time, even after triggering kernel panics for other reasons (NULL-pointer, etc). Is there something about this sequence of exceptions (page fault followed closely by machine check) that puts the ppc440 core in an unresettable state?
The program that triggers the panic calls mmap() to map a device into memory (I've tested with several different devices, it doesn't matter which one is mapped). The mmap call incorrectly uses the MAP_PRIVATE flag instead of MAP_SHARED, so the mapping is copy-on-write. It's during the copy-on-write in the page fault handler that the kernel attempts to read a non-existent address, panics the kernel, and hangs on reset. The crash data is copied below.
Oops: Machine check, sig: 7 [#1]
PREEMPT Xilinx Virtex440
Modules linked in: lldma current_image [last unloaded: lldma]
**bleep**: c0011370 LR: c0072f64 CTR: 0000007b
REGS: ef911f10 TRAP: 0214 Not tainted (2.6.28)
MSR: 00029000 <EE,ME,CE> CR: 42000022 XER: 0000005f
TASK = eef747c0 'panic_kernel' THREAD: eee86000
GPR00: 00000004 eee87e40 eef747c0 00000084 4801f02c 00000000 ffffb01c a5a5a5a5
GPR08: 00000000 a5a5a5a5 a5a5a5a5 00000004 22000022 1004b108 00000000 bf99eb98
GPR16: 100029a0 4801f00c 10002bac 00000004 00000013 4801f00c 00000000 84021569
GPR24: eedb4900 eeb383c8 eb1930f8 c0f07340 00000000 ffffb000 eed11a00 eee86000
**bleep** [c0011370] __copy_tofrom_user+0xbc/0x23c
LR [c0072f64] do_wp_page+0x514/0xa84
[eee87e40] [c0072f54] do_wp_page+0x504/0xa84 (unreliable)
[eee87e80] [c000f660] do_page_fault+0x3f8/0x574
[eee87f40] [c000da94] handle_page_fault+0xc/0x80
7c03222c 38630020 4200fff8 7d070050 7ce03b78 7d0903a6 7c03222c 7c0b37ec
80e40004 81040008 8124000c 85440010 <90e60004> 91060008 9126000c 95460010
Kernel panic - not syncing: Fatal exception
Rebooting in 5 seconds..
10-28-2013 11:23 AM
What you are seeing is most likely caused by an insufficient reset sequence in the FPGA design.
The machine check exception is caused by a CPU access to an address at which no peripheral exists or by a CPU access to a peripheral that does not respond. In both cases the bus arbiter times out and responds with a bus error which results in a machine check exception.
Setting DBCR0[RST] does not have an immediate effect. It only asserts the respective reset outputs (core, chip, system) on the PPC440 primitive. The fabric design captures those signals and then asserts the reset inputs (core, chip, system) to the PPC440 primitive. At the same time the reset logic in the fabric should reset the fabric bus infrastructure and the peripherals (and possibly components on the board).
If the reset logic does not reset a peripheral and the peripheral is in an unresponsive state following a machine check exception you would see the behavior you are describing.