Oracle Agent may hang in rare case when NIS / LDAP used for user authentication

Article:TECH170998  |  Created: 2011-10-03  |  Updated: 2013-01-01  |  Article URL http://www.symantec.com/docs/TECH170998
Article Type
Technical Solution


Environment

Issue



When NIS or LDAP are used for oracle user authentication, in rare conditions, the agent may hang in getpwnam_r() system call and stop managing resources. Agent heartbeat will still work thus HAD will not restart the agent.


Error



Agent monitor or online entry points will not finish or timeout.

2011/09/09 01:36:17 VCS NOTICE V-16-1-10301 Initiating Online of Resource Ora_Oracle (Owner: Unspecified, Group: OraGrp) on System node01

Environment



VCS with LDAP or NIS used for Oracle user authentication


Cause



Suspecting intermittent LDAP or NIS issue causing getpwnam_r()  system function to hang when Agent tries to authenticate Oracle user. As this is operating system call, agent cannot cancel this thread.

In one customer case the problem was encountered on the customer setup where the agent was doing a getpwnam_r() call within the monitor entry point on Solaris and the implementation of getpwnam_r() was disabling thread-cancellation before blocking. Because the NIS server was having problems, the getpwnam_r() call was stuck and hence all service threads were disfunctional. Agent Framework could not cancel them because getpwnam_r() had disabled cancellation internally. The agent was pretty much hung but since timer-thread was successfully heartbeating with the engine no problem was detected.
Normally blocking calls should not disable cancellation.


Solution



Workaround:

To fix this issue we need to find and kill the Oracle Agent pid and restart it using haagent -start.

hastop command will not work as Agent is not responsive.

1. Find the Oracle Agent

# ps -ef|grep OracleAgent

4 S root     18708     1  0  75   0 -  4209 stext  Mar31 ?        00:01:04 /opt/VRTSagents/ha/bin/Oracle/OracleAgent -type Oracle -agdir /opt/VRTSagents/ha/bin/Oracle

2. Kill the OracleAgent

# kill -9 18708

3. Restart the agent

# haagent -start Oracle -sys node01

 


Supplemental Materials

SourceETrack
Value1442255
Description

AGFW should detect if ALL service threads hang inside a C entry point and cannot be canceled successfully
 




Article URL http://www.symantec.com/docs/TECH170998


Terms of use for this information are found in Legal Notices