06-23-2010 09:00 AM
Gidday there,
I don't know how many people have a chance to use fsck at boot using a microblaze, but I have a problem that only occurs when I have to perform a full fsck on a custom block device (custom logic and driver for SD card). During simple fsck operations and nominal file operations (mounting, reading, writing etc.) everything appears to be fine. I have a silly test where I reboot 60 seconds after booting up, and eventually the maximum number of mounts is exceeded and a full fsck is performed.
This causes the following output and stack trace:
/dev/q5sda2 was not cleanly unmounted, check forced. [ 3.580000] BUG: failure at mm/mmap.c:2187/exit_mmap()! [ 3.580000] Kernel panic - not syncing: BUG! [ 3.580000] About to dump_stack() [ 3.580000] About to show_stack [ 3.580000] [ 3.580000] Stack: [ 3.580000] c6c31d0c 00000001 000017ef c04b3fcc 000017ef c006a27c c02bd0d4 c02e16f0 [ 3.580000] 00000001 000017c7 c04b3fcc 000017c7 c6c31d54 c6c33d4c 00000000 c6c30000 [ 3.580000] c001578c c02c8fc8 c02e16f0 00000001 00001794 c04b3fcc 00001794 c037de38 [ 3.580000] Call Trace: [ 3.580000] [ 3.580000] [<c006a27c>] exit_mmap+0x230/0x248 [ 3.580000] [<c001578c>] mmput+0x64/0x120 [ 3.580000] [<c001a920>] exit_mm+0x134/0x1a4 [ 3.580000] [<c001a914>] exit_mm+0x128/0x1a4 [ 3.580000] [<c0037128>] hrtimer_cancel+0x14/0x30 [ 3.580000] [<c001c298>] do_exit+0x138/0x778 [ 3.580000] [<c001c240>] do_exit+0xe0/0x778 [ 3.580000] [<c001c910>] do_group_exit+0x38/0xe8 [ 3.580000] [<c0028c20>] dequeue_signal+0x28/0x1ec [ 3.580000] [<c000a904>] do_page_fault+0x2e0/0x4b8 [ 3.580000] [<c0028cc0>] dequeue_signal+0xc8/0x1ec [ 3.580000] [<c0065f54>] handle_mm_fault+0x5b4/0x674 [ 3.580000] [<c00298b8>] get_signal_to_deliver+0x244/0x408 [ 3.580000] [<c0005fc4>] do_signal+0x64/0x27c [ 3.580000] [<c0038274>] up_read+0xc/0x20 [ 3.580000] [<c00091b8>] ret_from_exc+0x48/0x1b8 [ 3.580000] [<c0009168>] page_fault_instr_trap+0x1f8/0x200 [ 3.580000] [<c000f414>] __enqueue_entity+0xc0/0xe8 [ 3.580000] [<c001ed04>] do_softirq+0x4c/0x58 [ 3.580000] [<c02b1e28>] _raw_spin_lock+0x1c/0x30 [ 3.580000] [<c02b1d1c>] _raw_spin_unlock_irq+0xc/0x54 [ 3.580000] [<c02af474>] schedule+0x148/0x4dc [ 3.580000] [<c02af53c>] schedule+0x210/0x4dc [ 3.580000] [<c02af448>] schedule+0x11c/0x4dc [ 3.580000] [<c02af4d4>] schedule+0x1a8/0x4dc [ 3.580000] [<c0009514>] ret_from_irq+0x1c/0x48 [ 3.580000] [<c00094f0>] irq_call+0x0/0x8 [ 3.580000] [ 3.580000] Panic Timeout: 60 [ 3.580000] Rebooting in 60 seconds.. [ 3.580000] Machine restart...
Please note, in order to get the stack trace and the automatic reboot, I needed to modify the default code from the git repository. I can post the diffs if anyone is interested.
The driver for our logic (and SD cards) had been running just dandy on PPC405, under 2.6.28 for several years. Unfortunately, the port to microblaze was not a simple re-compile because the block driver API changed before 2.6.33 was released, so I'm dealing with two simultaneous environment changes. All this to say, the problem may well be in my poorly coded block driver.
So, has anyone seen this before?
Thanks
Joshua
06-24-2010 12:45 PM
Hi Joshua,
I don't run a lot of debug all the time, just turning it on when I see something wrong.
I don't have any great advice here.
Sorry.
06-23-2010 09:55 AM
Hi Joshua,
I realize this is yet another permutation, but I pushed out 2.6.34 to the public git server yesterday.
You could give it a try as arch/microblaze continues to change as stuff matures.
Thanks,
John
06-23-2010 01:13 PM
Gidday John,
I updated the kernel to 2.6.34 (which meant that I had to deal with other internal changes such as a change to blk_queue_max_sectors (http://lwn.net/Articles/376508/) as well as a change in the behaviour of the of_compatible formatting or parsing.
Anyway, this didn't fix my problem.
/dev/q5sda2 was not cleanly unmounted, check forced. [ 3.370000] BUG: failure at mm/mmap.c:2265/exit_mmap()! [ 3.370000] Kernel panic - not syncing: BUG! [ 3.370000] [ 3.370000] Stack: [ 3.370000] c70d1d14 c70b9a00 c70d0000 00000001 c70d1d30 c006a5c8 c02c59cc c04c0af0 [ 3.370000] 00000000 c70d1d54 42554721 00000004 c70d1d54 c70d0000 c0014dc0 c02d1b78 [ 3.370000] c02ea87c 00000001 0000169f c04c0fb4 0000169f c0389e38 00000089 00000000 [ 3.370000] Call Trace: [ 3.370000] [ 3.370000] [<c006a5c8>] exit_mmap+0x230/0x248 [ 3.370000] [<c0014dc0>] mmput+0x64/0x120 [ 3.370000] [<c0019e30>] exit_mm+0x134/0x1a4 [ 3.370000] [<c0019e24>] exit_mm+0x128/0x1a4 [ 3.370000] [<c0036828>] hrtimer_cancel+0x14/0x30 [ 3.370000] [<c001b7b8>] do_exit+0x138/0x778 [ 3.370000] [<c001b760>] do_exit+0xe0/0x778 [ 3.370000] [<c001be30>] do_group_exit+0x38/0xe8 [ 3.370000] [<c00282d4>] dequeue_signal+0x28/0x1ec [ 3.370000] [<c0009e18>] do_page_fault+0x2e4/0x478 [ 3.370000] [<c0028374>] dequeue_signal+0xc8/0x1ec [ 3.370000] [<c0066164>] handle_mm_fault+0x5b4/0x674 [ 3.370000] [<c0028f74>] get_signal_to_deliver+0x244/0x408 [ 3.370000] [<c0005664>] do_signal+0x64/0x27c [ 3.370000] [<c0037974>] up_read+0xc/0x20 [ 3.370000] [<c00085b8>] ret_from_exc+0x48/0x1b8 [ 3.370000] [<c0008568>] page_fault_instr_trap+0x1f8/0x200 [ 3.370000] [<c000eaf4>] __enqueue_entity+0xc0/0xe8 [ 3.370000] [<c001e1fc>] do_softirq+0x4c/0x58 [ 3.370000] [<c02bc808>] _raw_spin_lock+0x1c/0x30 [ 3.370000] [<c02bc6fc>] _raw_spin_unlock_irq+0xc/0x54 [ 3.370000] [<c02b9e44>] schedule+0x148/0x4dc [ 3.370000] [<c02b9f0c>] schedule+0x210/0x4dc [ 3.370000] [<c02b9e18>] schedule+0x11c/0x4dc [ 3.370000] [<c02b9ea4>] schedule+0x1a8/0x4dc [ 3.370000] [<c0008914>] ret_from_irq+0x1c/0x48 [ 3.370000] [<c00088f0>] irq_call+0x0/0x8 [ 3.370000] [ 3.370000] Rebooting in 60 seconds.. � 3.370000] Machine restart... Autobooting ELF Kernel in 4 seconds, press SPACE to stop 4 Initializing SD Card... Loading ELF Image... Starting ELF Image...
Normally (with 2.6.33) it would continue past 'Starting ELF Image...' e.g.
Starting ELF Image... [ 0.000000] Linux version 2.6.34 (jpl@megatron.xiphos.ca) (gcc version 4.1.2) #216 PREEMPT Wed Jun 23 15:58:12 EDT 2010
However, something appears to be left over in memory and the system hangs at this point. If I re-init the FPGA and look at the memory from my bootloader, I see the following.
[jpl@megatron linux-2.6-xlnx]$ /home/jpl/dev/exnet/Q6/xilinx/microblaze_v1.0/microblaze-unknown-linux-gnu/bin/mb-linux-objdump -t vmlinux |grep __log_buf c04c0fb4 l O .bss 00020000 __log_buf [Q6]> md 0x804c0fb0 804C0FB0: 00000000 3C353E5B 20202020 302E3030 ....<5>[ 0.00 804C0FC0: 30303030 5D204C69 6E757820 76657273 0000] Linux vers 804C0FD0: 696F6E20 322E362E 33342028 6A706C40 ion 2.6.34 (jpl@ 804C0FE0: 6D656761 74726F6E 2E786970 686F732E megatron.xiphos. 804C0FF0: 63612920 28676363 20766572 73696F6E ca) (gcc version 804C1000: 20342E31 2E322920 23323136 20505245 4.1.2) #216 PRE 804C1010: 454D5054 20576564 204A756E 20323320 EMPT Wed Jun 23 804C1020: 31353A35 383A3132 20454454 20323031 15:58:12 EDT 201 804C1030: 300A3C36 3E5B2020 2020302E 30303030 0.<6>[ 0.0000 804C1040: 30305D20 73657475 705F6370 75696E66 00] setup_cpuinf 804C1050: 6F3A2069 6E697469 616C6973 696E670A o: initialising. 804C1060: 3C343E5B 20202020 302E3030 30303030 <4>[ 0.000000 804C1070: 5D207365 7475705F 63707569 6E666F3A ] setup_cpuinfo: 804C1080: 204E6F20 50565220 73757070 6F72742E No PVR support. 804C1090: 20557369 6E672073 74617469 63204350 Using static CP 804C10A0: 5520696E 666F2066 726F6D20 4644540A U info from FDT. [Q6]> 804C10B0: 3C363E5B 20202020 302E3030 30303030 <6>[ 0.000000 804C10C0: 5D206361 6368653A 2077745F 6D73725F ] cache: wt_msr_ 804C10D0: 6E6F6972 710A0000 00000000 00000000 noirq
John, what debugging options do you use when you are running/testing your kernels? Sometimes I run with a few slab debug and memory poisoning options set. Could this be having an effect? The kernel output after this should be....
[ 0.000000] setup_memory: max_mapnr: 0x8000 [ 0.000000] setup_memory: min_low_pfn: 0x80000 [ 0.000000] setup_memory: max_low_pfn: 0x88000
Thanks in advance for any comments.
Joshua
06-24-2010 12:45 PM
Hi Joshua,
I don't run a lot of debug all the time, just turning it on when I see something wrong.
I don't have any great advice here.
Sorry.
07-06-2010 10:48 AM
Gidday John,
Thank you, yes, it was the kernel debugging options that were causing the havoc. I eventually realised this when looking through the source and seeing a note about the Page Table Entries overflowing if the spinlock debugging was enabled without some sort of PTE locking being configured.
Anyway, I've turned off the kernel debugging and am able to do my full fsck.ext2. I still have a reboot problem between 2.6.33 and 2.6.34, but I'll start a new thread about that once I have some useful information to share.
Thanks again.
Joshua
07-06-2010 12:09 PM
Great! Thanks for the update.