RAC cluster panic, duplicate "init.cssd" process running

Article:TECH195616  |  Created: 2012-08-24  |  Updated: 2012-09-01  |  Article URL http://www.symantec.com/docs/TECH195616
Article Type
Technical Solution


Issue



 When starting the cluster engine, node rebooted.

Found that there are duplicate "init.cssd" process running


Error



 DIAGNOSTIC STEPS:

core file: /evidence/mtv/51/419-015-751/2012-08-15/vmcore.0
user: Super-User (root:0)
release: 5.10 (64-bit)
version: Generic_142909-17
machine: sun4v
node name: usbdc3me003
hw_provider: Sun_Microsystems
system type: SUNW,T5240 (UltraSPARC-T2+)
hostid: 8534c92e
dump_conflags: 0x10000 (DUMP_KERNEL) on /dev/dsk/c1t0d0s1(16G)
time of crash: Wed Aug 15 16:28:13 GMT 2012
age of system: 653 days 13 hours 54 minutes 1.78 seconds
panic CPU: 23 (64 CPUs, 31.7G memory, 1 nodes)
panic string: forced crash dump initiated at user request

------------------------------

panic string: forced crash dump initiated at user request
==== panic user (LWP_SYS) thread: 0x30078b48860 PID: 2047 on CPU: 23 affinity CPU: 23 ====
cmd: /sbin/uadmin 5 1
t_procp: 0x3005c7db2f0
p_as: 0x300a56cd6b0 size: 2686976 RSS: 1662976
hat: 0x30071636480
cnum: CPU16:19475/5986
cpusran: 23
zone: global
t_stk: 0x2a104c99ae0 sp: 0x2a104c990b1 t_stkbase: 0x2a104c94000
t_pri: 59(TS) t_tid: 1 pctcpu: 0.058453
t_lwp: 0x3007e761650 machpcb: 0x2a104c99ae0
mstate: LMS_SYSTEM ms_prev: LMS_USER
ms_state_start: 0.570323688 seconds later
ms_start: 0.549774969 seconds later
psrset: 0 last CPU: 23
idle: 2 ticks (0.02 seconds)
start: Wed Aug 15 16:28:13 2012
age: 0 seconds (0 seconds)
syscall: #55 uadmin(, 0xffbffb18) (sysent: genunix:uadmin+0x0)
tstate: TS_ONPROC - thread is being run on a processor
tflg: T_PANIC - thread initiated a system panic
T_DFLTSTK - stack is default size
tpflg: TP_TWAIT - wait to be freed by lwp_wait
TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
pflag: SMSACCT - process is keeping micro-state accounting
SMSFORK - child inherits micro-state accounting

pc: unix:panic+0x1c: call unix:vpanic

unix:panic+0x1c(0x1299240, 0x1202400, 0x1, 0x183f800, 0x183f800, 0x0)
genunix:kadmin+0x544(, 0x1, 0x0, 0x60031c17d98)
genunix:uadmin+0x11c(, 0x1)
unix:syscall_trap32+0xcc()
-- switch to user thread's user stack --

------------------------------

Walking parent process id tree(PPID)

CAT(vmcore.0/10V)> proc -t 2047
addr PID PPID RUID/UID size RSS swresv time command
============= ====== ====== ========== ========== ======== ======== ====== =========
0x3005c7db2f0 2047 1636 0 2686976 1662976 311296 1 /sbin/uadmin 5 1
thread: 0x30078b48860 state: onpr wchan: 0x0 sobj: undefined
idle: 2 ticks (0.02 seconds)


CAT(vmcore.0/10V)> proc 1636
addr PID PPID RUID/UID size RSS swresv time command
============= ====== ====== ========== ========== ======== ======== ====== =========
0x300fcc870b0 1636 6442 0 1900544 1630208 286720 13 /bin/sh /etc/init.d/init.cssd daemon
thread: 0x30053e055e0 state: slp wchan: 0x300fcc87170 sobj: condition var (from genunix:waitid+0x484)

CAT(vmcore.0/10V)> proc -t 6442
addr PID PPID RUID/UID size RSS swresv time command
============= ====== ====== ========== ========== ======== ======== ====== =========
0x300743faf18 6442 1 0 1900544 1343488 286720 9 /bin/sh /etc/init.d/init.cssd fatal
thread: 0x3005483aee0 state: slp wchan: 0x300743fafd8 sobj: condition var (from genunix:waitid+0x484)
idle: 22 ticks (0.22 seconds)

------------------------------

No busy devices:

CAT(vmcore.0/10V)> dev busy

Scanning for busy devices:
No busy/hanging devices found
Scanning for threads in biowait:

no threads in biowait() found.

Scanning for procs with aio:
CAT(vmcore.0/10V)>

------------------------------
CAT(vmcore.0/10V)> tlist pinned
==== user (LWP_USER) thread: 0x300863c74e0 PID: 14670 on CPU: 8 ====
cmd: ./SunOS/device_config.SunOS
t_procp: 0x300bfd0fa48
p_as: 0x300ae441de8 size: 4259840 RSS: 1875968
hat: 0x3005171e940
cnum: CPU0:88420/97 CPU8:27872/55 CPU16:19475/35 CPU24:20629/53 CPU64:88584/312 CPU72:34825/83 CPU80:30403/110 CPU88:29733/135
cpusran: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95
zone: global
t_stk: 0x2a100137ae0 sp: 0x2a1001372e1 t_stkbase: 0x2a100132000
t_pri: 0(TS) t_tid: 1 pctcpu: 99.992119
t_lwp: 0x3005cd51898 machpcb: 0x2a100137ae0
mstate: LMS_USER ms_prev: LMS_SYSTEM
ms_state_start: 0.570387612 seconds later
ms_start: 390 days 15 hours 14 minutes 16.315036382 seconds earlier
psrset: 0 last CPU: 8
idle: 35 ticks (0.35 seconds)
start: Fri Jul 22 01:08:06 2011
age: 33751207 seconds (390 days 15 hours 20 minutes 7 seconds)
tstate: TS_ONPROC - thread is being run on a processor
tflg: T_DFLTSTK - stack is default size
tpflg: TP_TWAIT - wait to be freed by lwp_wait
TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
TS_SIGNALLED - thread was awakened by cv_signal()
pflag: SMSACCT - process is keeping micro-state accounting
SMSFORK - child inherits micro-state accounting

pc: unix:utl0+0x4c: jmpl %l3, %o7 ( call %l3 )

unix:user_rtt+0x0()
-- switch to user thread's user stack --


1 pinned thread found.

------------------------------


Cause



 CAT(vmcore.0/10V)> proc 1636

addr PID PPID RUID/UID size RSS swresv time command
============= ====== ====== ========== ========== ======== ======== ====== =========
0x300fcc870b0 1636 6442 0 1900544 1630208 286720 13 /bin/sh /etc/init.d/init.cssd daemon
thread: 0x30053e055e0 state: slp wchan: 0x300fcc87170 sobj: condition var (from genunix:waitid+0x484)

CAT(vmcore.0/10V)> proc -t 6442
addr PID PPID RUID/UID size RSS swresv time command
============= ====== ====== ========== ========== ======== ======== ====== =========
0x300743faf18 6442 1 0 1900544 1343488 286720 9 /bin/sh /etc/init.d/init.cssd fatal
thread: 0x3005483aee0 state: slp wchan: 0x300743fafd8 sobj: condition var (from genunix:waitid+0x484)
idle: 22 ticks (0.22 seconds)


Solution



 The customer confirmed init.cssd was indeed running twice when the server 

crashed:
1) Started by un as part of VCS startup (hastart) on Aug 15th,
2) Started manually by someone, previously and not under VCS.




Article URL http://www.symantec.com/docs/TECH195616


Terms of use for this information are found in Legal Notices