DB2 process being falsely detected as offline - reaching ToleranceLimit

Article:TECH194423  |  Created: 2012-08-03  |  Updated: 2012-08-08  |  Article URL http://www.symantec.com/docs/TECH194423
Article Type
Technical Solution


Issue



 Db2udb resource shows unexpected OFFLINE as it cannot find the db2 process, but collected PS output at similar time shows db2 process exist. Increasing ToleranceLimit does not correct symptom.

Main error:
Application resources for DB2 reaching ToleranceLimit of 2, offline unexpectedly after after SFHA5.1SP1RP2 upgrade.


Error



 Engine_A log

2012/07/20 07:42:35 VCS INFO V-16-2-13075 (ua1425) Resource(d2iramp2_db) has reported unexpected OFFLINE 1 times, which is still within the ToleranceLimit(1).
*
2012/07/20 07:48:35 VCS INFO V-16-1-10307 Resource d2iramp2_db (Owner: Unspecified, Group: vcs-ram_css_sg) is offline on ua1425 (Not initiated by VCS)
*
2012/07/20 06:21:32 VCS INFO V-16-2-13075 (ua1425) Resource(d2iramp2_db) has reported unexpected OFFLINE 1 times, which is still within the ToleranceLimit(1).
2012/07/20 06:51:34 VCS INFO V-16-2-13075 (ua1425) Resource(d2iramp2_db) has reported unexpected OFFLINE 1 times, which is still within the ToleranceLimit(1).
2012/07/20 06:54:33 VCS INFO V-16-1-10307 Resource d2iramp2_db (Owner: Unspecified, Group: vcs-ram_css_sg) is offline on ua1425 (Not initiated by VCS)

 

Increased debug level. It does clearly show one of the "false positive" events:
> 2012/07/26 06:19:52 VCS DBG_AGDEBUG V-16-50-0 Thread(1800) Canceling timer for (d2ivedm1_db) op(1608)
> 2012/07/26 06:19:52 VCS DBG_AGINFO V-16-50-0 Thread(1800) Resource d2ivedm1_db transitioning from Monitoring to Online
> 2012/07/26 06:20:51 VCS DBG_AGDEBUG V-16-50-0 Thread(258) name(d2ivedm1_db) op(1607)
> 2012/07/26 06:20:51 VCS DBG_AGDEBUG V-16-50-0 Thread(258) Resetting periodic timer for resource d2ivedm1_db op 1607 to expire at 7853738
> 2012/07/26 06:20:51 VCS DBG_AGDEBUG V-16-50-0 Thread(258) Adding timer for d2ivedm1_db with tmo 7853738
> 2012/07/26 06:20:51 VCS DBG_AGDEBUG V-16-50-0 Thread(258) Appending command minor code 1607 for resource d2ivedm1_db
> 2012/07/26 06:20:51 VCS DBG_AGDEBUG V-16-50-0 Thread(258) Scheduled resource d2ivedm1_db
> 2012/07/26 06:20:51 VCS DBG_AGDEBUG V-16-50-0 Thread(1029) Picked Res(d2ivedm1_db) from Scheduler
> 2012/07/26 06:20:51 VCS DBG_AGDEBUG V-16-50-0 Thread(1029) Resource (d2ivedm1_db) received cmd minor code (MSG_AGI_MONITOR_TIMER)
> 2012/07/26 06:20:51 VCS DBG_AGDEBUG V-16-50-0 Thread(1029) Res(d2ivedm1_db) - _imf_on : (0), _imf_monitor_freq : (1),_leveltwo_monitor_freq : (0), _probe_requested : (0),_monitor_levels : (1), _monitor_count : (1), _monitor_count_overflow : (1)
> 2012/07/26 06:20:51 VCS DBG_AGINFO V-16-50-0 Thread(1029) Resource d2ivedm1_db transitioning from Online to Monitoring
> 2012/07/26 06:20:51 VCS DBG_AGINFO V-16-50-0 Thread(1029) arg[2] is (d2ivedm1)
> 2012/07/26 06:20:51 VCS DBG_AGINFO V-16-50-0 Thread(1029) arg[5] is (/db2/d2ivedm1)
> 2012/07/26 06:20:51 VCS DBG_AGDEBUG V-16-50-0 Thread(1029) Setting TLS key vcs_key_resepstruct with values {ResName=d2ivedm1_db, EpName=monitor, EpEnum=5, ConfLevel=100, MonitorLevel=1}
> 2012/07/26 06:20:51 VCS DBG_AGDEBUG V-16-50-0 Thread(1029) Enabling non-periodic timer for resource(d2ivedm1_db) op(1608) with period(240)
> 2012/07/26 06:20:51 VCS DBG_AGDEBUG V-16-50-0 Thread(1029) Adding timer for d2ivedm1_db with tmo 7853918
> 2012/07/26 06:20:51 VCS DBG_AGDEBUG V-16-50-0 Thread(1029) Calling monitor for resource d2ivedm1_db
> 2012/07/26 06:20:51 VCS DBG_AGDEBUG V-16-50-0 Thread(1029) Value of VCSAgResEPStruct is {ResName=d2ivedm1_db, EpName=monitor, EpEnum=5, ConfLevel=100, MonitorLevel=1}
> 2012/07/26 06:20:52 VCS DBG_AGINFO V-16-50-0 Thread(1029) Resource(d2ivedm1_db) - monitor entry point exited with a confidence value 0.
> 2012/07/26 06:20:52 VCS DBG_AGINFO V-16-50-0 Thread(1029) d2ivedm1_db reported state (Offline) & conf_level (0) <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<


Environment



 VCS=5.1.112.0-5.1SP1RP2-2011-09-14

AIX=6100-06


Cause



 This issue is similar to a previously reported problem and the incident reference is: 2699800 that’s scheduled to be fixed in 5.1SP1RP3.

We call this command to check the running db2sysc process:
"/usr/sysv/bin/ps -eo user,pid,args"
For some reason, the above ps command was returning garbled output and this causes the monitor to not detect the running instance.


Solution



 This issue only applies to AIX systems.

We discussed with IBM in the past and they recommended us to use the /usr/bin/ps command.
Example:
bash-3.00# pwd
/opt/VRTSagents/ha/bin/Db2udb
bash-3.00# cp db2config.pm db2config.pm.orig
bash-3.00# vi db2config.pm

Change this line from:
41 PS => "/usr/sysv/bin/ps -eo user,pid,args",
To
41 PS => "/usr/bin/ps -eo user,pid,args",

We provided the above workaround to another customer and it resolved the issue.
Permanent fix scheduled to be fixed in 5.1SP1RP2





Article URL http://www.symantec.com/docs/TECH194423


Terms of use for this information are found in Legal Notices