Linux RedHat / Suse Kernel Panic analysis with kdump and crash.

Article:TECH144100  |  Created: 2010-11-12  |  Updated: 2012-08-26  |  Article URL http://www.symantec.com/docs/TECH144100
Article Type
Technical Solution



Issue



Some VxVM commands can be very I/O intensive and occasionally panic the system.  The root cause of the panic can be determined by examining the system core file.  However the system must be previously configured to generate and save the system core file.


Error



If the root cause is an I/O related hang, there may be no indication in the messages file about the hang or panic.  A core analysis may be needed.


Environment



This document uses the following configuration:

Redhat 5.5, SF 5.1, EMC Clariion Disk with multipath.


Solution



If the problem is repeatable, then enable kdump and load crash and kernel debug rpms on your machine.  In this example we are running Red Hat Enterprise Linux Server release 5.5 (Tikanga).  Verify that you have kernel headers, kernel-debuginfo-common and kernel-debuginfo, kdump and crash:


kernel-2.6.18-194.el5
kernel-headers-2.6.18-194.el5
kernel-devel-2.6.18-194.el5
crash-4.1.2-4.el5
system-config-kdump-1.0.14-4.el5
kernel-debuginfo-common-2.6.18-194.el5
kernel-debuginfo-2.6.18-194.el5
 

Some of these RPMs are on the install disk, others must be downloaded from RedHat at: ftp://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/x86_64/Debuginfo/

For kernel-debug and kernel-debug-common rpms.

 First, configure kdump.  Few admins seem to do this on linux system, but it can be done with the graphic GUI tool :   /usr/bin/system-config-kdump.  Using this utility will reserve 128 MB from your system memory for the "crash kernel" that does the dump.

Add or modify /etc/sysctl.conf to add these parameters:

...

# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 1

# Enable auto system reboot after system crash
kernel.panic = 60
...

Set these parameters interactivly on the OS command line if desired:

[]# sysctl -w kernel.sysrq=1
[]# sysctl -w kernel.panic=60
 

-----  After the Crash ----

In this test case, the panic problem was recreated and the system core was written to /var/crash/<date:time>/vmcore.  Crash analysis can begin with the following command:

# crash /boot/System.map-2.6.18-194.el5 /usr/lib/debug/lib/modules/2.6.18-194.el5/vmlinux ./vmcore 
 

This shows that the last command issued was the OS "vol_id" command.  (This information is shown by default when the utility is run.)

 SYSTEM MAP: /boot/System.map-2.6.18-194.el5
DEBUG KERNEL: /usr/lib/debug/lib/modules/2.6.18-194.el5/vmlinux (2.6.18-194.el5)
    DUMPFILE: ./vmcore
        CPUS: 4
        DATE: Thu Nov 11 13:09:34 2010
      UPTIME: 00:05:03
LOAD AVERAGE: 0.05, 0.23, 0.12
       TASKS: 313
    NODENAME: rover.spr.spt.symantec.com
     RELEASE: 2.6.18-194.el5
     VERSION: #1 SMP Tue Mar 16 21:52:39 EDT 2010
     MACHINE: x86_64  (1596 Mhz)
      MEMORY: 2 GB
       PANIC: "Oops: 0000 [1] SMP " (check log for details)
         PID: 7957
     COMMAND: "vol_id"
        TASK: ffff81005214b820  [THREAD_INFO: ffff810051fe8000]
         CPU: 3
       STATE: TASK_RUNNING (PANIC) 
 

  The most useful piece of information is a so called stacktrace, or "backtrace." Typing "bt" at the prompt asks crash/gdb to print one:

crash> bt
PID: 7957   TASK: ffff81005214b820  CPU: 3   COMMAND: "vol_id"
 #0 [ffff810051fe9730] crash_kexec at ffffffff800aeb6b
 #1 [ffff810051fe97f0] __die at ffffffff80066157
 #2 [ffff810051fe9830] do_page_fault at ffffffff80067dd7
 #3 [ffff810051fe9920] error_exit at ffffffff8005ede9
    [exception RIP: part_round_stats+19]
    RIP: ffffffff801447a1  RSP: ffff810051fe99d8  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: ffff81007ab57ac0  RCX: d600000000000000
    RDX: 0000000000000000  RSI: 8000000000000000  RDI: ffff81007ab57ac0
    RBP: 0000000100000d7e   R8: 000000000000000f   R9: 0000000000000000
    R10: ffff810009930388  R11: ffffffff8014c80a  R12: 0000000000000000
    R13: 0000000000000001  R14: 00000000013efd00  R15: 0000000000800032
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #4 [ffff810051fe99f0] drive_stat_acct at ffffffff80144969
...

 

  It shows that the exception occurred in part_round.  Searching RedHat for these codes gives a possible match for a known bugzilla:


https://bugzilla.redhat.com/show_bug.cgi?id=493517

 This concludes the research necessary to find the cause of the crash.  In this case, Redhat will provide a fix or workround to this problem.

 

 

 

 


Supplemental Materials

Description

Additional Useful Commands

 

----------------------------------------------------------------------
crash> ps
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
      0      0   0  ffffffff80308b60  RU   0.0       0      0  [swapper]
>     0      1   1  ffff81019ff12100  RU   0.0       0      0  [swapper]
>     0      1   2  ffff81019ff21080  RU   0.0       0      0  [swapper]
      0      1   3  ffff810105b95100  RU   0.0       0      0  [swapper]
      1      0   2  ffff81019ffad7a0  IN   0.0   10352    696  init
      2      1   0  ffff81019ffad040  IN   0.0       0      0  [migration/0]

truncated..........

    668   8231   2  ffff8101311520c0  UN   0.0   79928   2956  smbd
    670      1   1  ffff81019fae7080  IN   0.0   12684    856  udevd
    699   8231   2  ffff81016a473860  IN   0.0   80212   3256  smbd
    949   8231   1  ffff810080513040  IN   0.0   80448   3372  smbd
   1013   8231   2  ffff81016a48d7e0  UN   0.0   79944   2492  smbd
   1352   8231   2  ffff81010a4ed7a0  UN   0.1   80316   3564  smbd
   1629      1   2  ffff81016f28c080  UN   0.3   34076  19624  bpbkar
-----------------------------------------------------------------------------
crash> log (similar to msgbuf -t in Solaris ScAT)
0x19/0x31
 [<ffffffff8005d229>] tracesys+0x71/0xe0
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

smbd          D ffff81000901d7a0     0 19766   8231         23626 19409 (NOTLB)
 ffff8101571f9c78 0000000000000082 0000000000000066 ffff8101712298c0
 00000000000029e0 0000000000000009 ffff810115f7c860 ffff810105b95100
 0000ca2ad0249d24 0000000000007ff1 ffff810115f7ca48 00000003571f9da0
Call Trace:
 [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
 [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
 [<ffffffff8000cf82>] do_lookup+0x90/0x1e6
 [<ffffffff8000a29c>] __link_path_walk+0xa01/0xf5b
 [<ffffffff8000ea4b>] link_path_walk+0x42/0xb2
 [<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1
 [<ffffffff80012877>] getname+0x15b/0x1c2
 [<ffffffff800239a1>] __user_walk_fd+0x37/0x4c
 [<ffffffff800288d5>] vfs_stat_fd+0x1b/0x4a
 [<ffffffff800236d3>] sys_newstat+0x19/0x31
 [<ffffffff8005d229>] tracesys+0x71/0xe0
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

smbd          D ffff81000900caa0     0 23626   8231         31407 19766 (NOTLB)
 ffff8101319dfc78 0000000000000082 0000000000000100 ffff81012ba75280
 0000000000000286 000000000000000a ffff81014ffbb0c0 ffff81019ff12100
 0000cbea05fd5a09 00000000000195f9 ffff81014ffbb2a8 00000001319dfda0
Call Trace:
 [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
 [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
 [<ffffffff8000cf82>] do_lookup+0x90/0x1e6
 [<ffffffff8000a29c>] __link_path_walk+0xa01/0xf5b
 [<ffffffff8000ea4b>] link_path_walk+0x42/0xb2
 [<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1
 [<ffffffff80012877>] getname+0x15b/0x1c2
 [<ffffffff800239a1>] __user_walk_fd+0x37/0x4c
 [<ffffffff800288d5>] vfs_stat_fd+0x1b/0x4a
 [<ffffffff800236d3>] sys_newstat+0x19/0x31
 [<ffffffff8005d229>] tracesys+0x71/0xe0
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

smbd          D ffff81000901d7a0     0 31407   8231         31749 23626 (NOTLB)
 ffff8101101a5c78 0000000000000082 0000000000000100 ffff8101712292c0
 0000000000000286 0000000000000009 ffff810115f7c100 ffff810105b95100
 0000cbfe634ff507 0000000000004cc9 ffff810115f7c2e8 00000003101a5da0
Call Trace:
 [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
 [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
 [<ffffffff8000cf82>] do_lookup+0x90/0x1e6
 [<ffffffff8000a29c>] __link_path_walk+0xa01/0xf5b
 [<ffffffff8000ea4b>] link_path_walk+0x42/0xb2
 [<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1
 [<ffffffff80012877>] getname+0x15b/0x1c2
 [<ffffffff800239a1>] __user_walk_fd+0x37/0x4c
 [<ffffffff800288d5>] vfs_stat_fd+0x1b/0x4a
 [<ffffffff80039f3e>] fcntl_setlk+0x243/0x273
 [<ffffffff800236d3>] sys_newstat+0x19/0x31
 [<ffffffff8005d229>] tracesys+0x71/0xe0
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

smbd          D ffff81000900caa0     0 31749   8231         32390 31407 (NOTLB)
 ffff810129381e88 0000000000000082 ffff8101467e3e40 0000000000000246
 ffff8101041c5770 000000000000000a ffff81018eb04860 ffff81019ff12100
 0000cc0e66f92d7f 000000000001d2eb ffff81018eb04a48 0000000100000000
Call Trace:
 [<ffffffff8002588d>] filldir+0x0/0xb7
 [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
 [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
 [<ffffffff80035212>] vfs_readdir+0x5c/0xa9
 [<ffffffff80038ae2>] sys_getdents+0x75/0xbd
 [<ffffffff8005d229>] tracesys+0x71/0xe0
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

smbd          D ffff81000900caa0     0 32390   8231         32752 31749 (NOTLB)
 ffff8101218bfe88 0000000000000086 ffff81012a080000 0000000000000246
 ffff81012a080000 0000000000000009 ffff81019b716100 ffff81019ff12100
 0000cc1e16118456 00000000000435cf ffff81019b7162e8 0000000100000000
Call Trace:
 [<ffffffff8002ca5a>] mntput_no_expire+0x19/0x89
 [<ffffffff8002588d>] filldir+0x0/0xb7
 [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
 [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
 [<ffffffff80035212>] vfs_readdir+0x5c/0xa9
 [<ffffffff80038ae2>] sys_getdents+0x75/0xbd
 [<ffffffff8005d229>] tracesys+0x71/0xe0
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0




Article URL http://www.symantec.com/docs/TECH144100


Terms of use for this information are found in Legal Notices