Oracle RMAN application backup job hangs and the automatic job fails with status 6 at 15 minutes after the data transfer completes.

Article:TECH73130  |  Created: 2009-01-04  |  Updated: 2013-03-27  |  Article URL http://www.symantec.com/docs/TECH73130
Article Type
Technical Solution


Issue



Oracle RMAN application backup job hangs and the automatic job fails with status 6 at 15 minutes after the data transfer completes.


Error



<16> readCommFile: ERR - timed out after 900 seconds while reading from


Solution



Overview:
Some Oracle 10g R1 backup operations create unexpected child processes during the backup.  Depending on the timing, this can result in the child inheriting a copy of the open file descriptor for the network sockets connecting the NetBackup dbclient to the media server.  If the child process does not exit promptly, and thereby close it's copy of the file descriptors, those sockets will remain up after the dbclient has closed it's copy of the same file descriptors.  As a result, the NetBackup media server will be unaware that the data transfer has completed.  This will delay server status being written to the comm file and the dbclient will timeout in 900 seconds.

Troubleshooting:
The RMAN output will show that the sbtclose2 API call failed following the successful transfer of the backupset piece to storage.

ORA-27192: skgfcls: sbtclose2 returned error - failed to close file
ORA-19511: Error received from media manager layer, error text:
  VxBSAEndTxn: Failed with error:
  The transaction was aborted.

The dbclient debug log shows the setup of the network sockets using vnetd, followed by an Oracle child process exiting during the data transfer.  

01:40:23.986 [22394] <4> serverResponse: read comm file:<01:40:15 INF - Name socket = mymm.net.IPC:/tmp/vnet-23462245717614913764000000000-jF2wv0;c97275c40f5d40391409b206858309cc;4;3000>
01:40:23.987 [22394] <4> serverResponse: INF - NAME sockfd: 28
...snip...
01:40:23.987 [22394] <4> serverResponse: read comm file:<01:40:16 INF - New data socket = mymm.net.IPC:/tmp/vnet-23460245717614796628000000000-DEwD2T;ed232e16ffa825ddd5c107c10015d23d;4;3000>
01:40:23.988 [22394] <4> serverResponse: INF - NEW DATA sockfd: 29
...snip...
01:40:24.081 [22394] <2> sbtbackup: INF - leaving
01:40:24.088 [22394] <2> int_signal_hdlr: INF - ** SIGCLD received **
01:40:27.202 [22394] <2> int_WriteData: INF - writing buffer # 1 of size 262144

After the data transfer is complete, Oracle returns control to dbclient which finishes the tar header, closes the sockets, and then times out waiting for server status.

01:40:29.355 [22394] <2> sbtclose2: INF - entering
...snip...
01:40:29.356 [22394] <4> finishTarImage: INF - dumped size = 13369344 bytes
01:40:29.356 [22394] <4> finishTarImage: INF - sending image info on nameSock: FIL - 13369344 1 0 13 33200 oracle oinstall 13369344 1245717603 1245717603 1245717603 1 0 0 15 13 0 0 3 1245717603 1 4   Oracle Database Oracle Backup   /1uki8tni_1_1
...snip...
01:40:29.363 [22394] <4> closeApi: INF - EXIT STATUS 0: the requested operation was successfully completed
...snip...
01:40:29.363 [22394] <4> closeApi: INF - closing dataSock 29
01:40:29.363 [22394] <4> closeApi: INF - close of dataSock returned <0> errno is <0>
01:40:29.363 [22394] <4> closeApi: INF - setting linger on nameSock 28
01:40:29.363 [22394] <4> closeApi: INF - closing nameSock 28
01:40:29.363 [22394] <4> closeApi: INF - close of nameSock returned <0> errno is <0>
01:40:29.363 [22394] <4> serverExitStatus: entering serverExitStatus.
01:40:29.363 [22394] <4> readCommMessages: Entering readCommMessages
01:55:35.842 [22394] <16> readCommFile: ERR - timed out after 900 seconds while reading from /usr/openv/netbackup/logs/user_ops/dbext/logs/22394.0.1245717603
01:55:35.842 [22394] <4> closeApi: INF - server EXIT STATUS = -1: invalid error number
...snip...
01:55:35.842 [22394] <2> sbtclose2: INF - leaving

The bpbrm debug log shows that it received the exit status from the client via the NAME socket, but did not receive an exit status from bptm.  It continued to run until the job was canceled, at which point it terminated the bpbrm.child and bptm.

01:40:29.277 [932] <2> bpbrm read_media_msg: read from media manager: WROTE myclient_1245717608 40192 0 2939.515 0
01:40:29.277 [932] <2> bpbrm send_parent_msg: WROTE myclient_1245717608 40192 0 2939.515 0
01:40:29.413 [942] <2> bpbrm handle_backup: ADDED FILES TO DB FOR myclient_1245717611 1 /1uki8tni_1_1
01:40:29.414 [942] <2> bpbrm handle_backup: client myclient EXIT STATUS = 0: the requested operation was successfully completed
01:40:29.506 [942] <2> logconnections: BPDBM CONNECT FROM 2.2.2.2.52841 TO 1.1.1.1.13724
01:40:33.357 [932] <2> bpbrm read_media_msg: read from media manager: WROTE myclient_1245717608 40192 0 4533.017 0
01:40:33.357 [932] <2> bpbrm send_parent_msg: WROTE myclient_1245717608 40192 0 4533.017 0
01:40:34.377 [932] <2> bpbrm read_media_msg: read from media manager: WROTE myclient_1245717608 10706 1 4862.022 0
01:40:34.377 [932] <2> bpbrm send_parent_msg: WROTE myclient_1245717608 10706 1 4862.022 0
01:40:34.377 [932] <2> bpbrm read_media_msg: read from media manager: WROTE myclient_1245717611 13056 0 348.438 0
...snip...
07:10:09.038 [932] <2> bpbrm read_parent_msg: read from parent
07:20:08.937 [932] <2> bpbrm read_parent_msg: read from parent
07:30:08.914 [932] <2> bpbrm read_parent_msg: read from parent
07:40:09.343 [932] <2> bpbrm read_parent_msg: read from parent
07:50:09.251 [932] <2> bpbrm read_parent_msg: read from parent
07:59:30.430 [932] <2> bpbrm read_parent_msg: read from parent STOP BACKUP myclient_1245717611
07:59:30.431 [932] <2> bpbrm kill_bpbrm_child: killing bpbrm child 942.
07:59:30.431 [942] <2> Bpbrm_siginfo_print: 1: delay 0 signo SIGHUP:1 code 0 pid 932
07:59:30.431 [942] <2> bpbrm check_for_terminate: process killed by signal 1
07:59:30.459 [932] <2> Bpbrm_siginfo_print: 4: delay 0 signo SIGCHLD:18 code 1 pid 942
07:59:30.460 [932] <2> bpbrm brm_sigcld: SIGCLD caught by bpbrm
07:59:30.460 [932] <2> bpbrm brm_sigcld: bpbrm child 942 exit_status = 150, signal_status = 0
07:59:30.460 [932] <2> bpbrm brm_sigcld: child 942 exited with status 150: termination requested by administrator
07:59:30.460 [932] <2> bpbrm send_status_to_parent: bpbrm child is done, but the media manager child is not.
07:59:30.460 [932] <2> bpbrm tell_mm: sending media manager msg: STOP BACKUP myclient_1245717611
07:59:30.960 [932] <2> bpbrm read_media_msg: read from media manager: WROTE myclient_1245717611 0 0 0.498 0
07:59:30.960 [932] <2> bpbrm send_parent_msg: WROTE myclient_1245717611 0 0 0.498 0
07:59:31.470 [932] <2> bpbrm read_media_msg: read from media manager: EXIT myclient_1245717611 150

The bptm debug log confirms that the NEW_DATA socket did not close and that it ran until terminated by bpbrm.

01:40:24.188 [937] <2> write_data: received first buffer (262144 bytes), begin writing data
...snip...
07:59:31.088 [937] <2> send_brm_msg: EXIT myclient_1245717611 150
...snip...
07:59:31.129 [937] <2> Media_siginfo_print: 10: delay 0 signo SIGCHLD:18 code 1 pid 940
07:59:31.129 [937] <2> child_wait: SIGCHLD: exit=0, signo=0 core=no, pid=940 (tmcommon.c:5336)

The bptm debug log should have show a log entry similar to this one when the NEW_DATA socket was closed by dbclient.  This example is from another Oracle backup job on that same client and is part of the same MPX group and that did not fork an Oracle child process.

01:40:33.998 [938] <2> fill_buffer: [937] socket is closed, waited for empty buffer 0 times, delayed 0 times, read 93093888 bytes

The vnetd debug log on the client shows the IP addresses and port numbers for the NEW_DATA and NAME sockets when created by the media server and connected to a few seconds later by dbclient.

01:40:14.756 [23460] <2> ProcessRequests: vnetd.c.283: msg: VNETD ACCEPT FROM 2.2.2.2.52829 TO 3.3.3.3.13724 fd = 4
01:40:14.796 [23460] <2> ProcessRequests: vnetd.c.364: msg: Request VN_REQUEST_CONNECT_FORWARD_SOCKET(10)
01:40:14.796 [23460] <2> vnet_generate_hashes: vnet_vnetd.c.1739: hash_str1: ed232e16ffa825ddd5c107c10015d23d
01:40:14.796 [23460] <2> vnet_generate_hashes: vnet_vnetd.c.1740: hash_str2: 8d9ec73ed6aa1f28b0b80bf4d2a9d7e6
01:40:14.796 [23460] <2> process_connect_forward_socket: vnetd.c.1911: ipc_string: /tmp/vnet-23460245717614796628000000000-DEwD2T
#
01:40:14.872 [23462] <2> ProcessRequests: vnetd.c.283: msg: VNETD ACCEPT FROM 2.2.2.2.52831 TO 3.3.3.3.13724 fd = 4
01:40:14.913 [23462] <2> ProcessRequests: vnetd.c.364: msg: Request VN_REQUEST_CONNECT_FORWARD_SOCKET(10)
01:40:14.913 [23462] <2> vnet_generate_hashes: vnet_vnetd.c.1739: hash_str1: c97275c40f5d40391409b206858309cc
01:40:14.913 [23462] <2> vnet_generate_hashes: vnet_vnetd.c.1740: hash_str2: daf1e6464bf2beee95d20ca275bc433d
01:40:14.913 [23462] <2> process_connect_forward_socket: vnetd.c.1911: ipc_string: /tmp/vnet-23462245717614913764000000000-jF2wv0
...snip...
01:40:23.986 [23462] <2> vnet_send_network_socket: vnet_vnetd.c.1535: hash_str2: daf1e6464bf2beee95d20ca275bc433d
01:40:23.987 [23462] <4> main: Terminating with status 0
#
01:40:23.987 [23460] <2> vnet_send_network_socket: vnet_vnetd.c.1535: hash_str2: 8d9ec73ed6aa1f28b0b80bf4d2a9d7e6
01:40:23.988 [23460] <4> main: Terminating with status 0

The netstat output from the client host confirms that some other client side process was holding the sockets open after dbclient exited.  

Tue Jun 23 01:39:05 BST 2009
Tue Jun 23 01:39:35 BST 2009
Tue Jun 23 01:40:05 BST 2009
Tue Jun 23 01:40:35 BST 2009
tcp        0      0 3.3.3.3:13724           2.2.2.2:52831          ESTABLISHED
tcp        0      0 3.3.3.3:13724           2.2.2.2:52829          ESTABLISHED
Tue Jun 23 01:41:05 BST 2009
tcp        0      0 3.3.3.3:13724           2.2.2.2:52831          ESTABLISHED
tcp        0      0 3.3.3.3:13724           2.2.2.2:52829          ESTABLISHED
Tue Jun 23 01:41:35 BST 2009
tcp        0      0 3.3.3.3:13724           2.2.2.2:52831          ESTABLISHED
tcp        0      0 3.3.3.3:13724           2.2.2.2:52829          ESTABLISHED
...snip, dbclient closed the sockets here but some other process is holding them open...
Tue Jun 23 01:51:08 BST 2009
tcp        0      0 3.3.3.3:13724           2.2.2.2:52831          ESTABLISHED
tcp        0      0 3.3.3.3:13724           2.2.2.2:52829          ESTABLISHED
Tue Jun 23 01:51:38 BST 2009
tcp        0      0 3.3.3.3:13724           2.2.2.2:52831          ESTABLISHED
tcp        0      0 3.3.3.3:13724           2.2.2.2:52829          ESTABLISHED
Tue Jun 23 01:52:08 BST 2009
tcp        0      0 3.3.3.3:13724           2.2.2.2:52831          ESTABLISHED
tcp        0      0 3.3.3.3:13724           2.2.2.2:52829          ESTABLISHED
...snip, many hours later the sockets finally close when the Oracle child process finally exits.
Tue Jun 23 07:05:50 BST 2009
tcp        0      0 3.3.3.3:13724           2.2.2.2:52831          ESTABLISHED
tcp        0      0 3.3.3.3:13724           2.2.2.2:52829          ESTABLISHED
Tue Jun 23 07:06:20 BST 2009
tcp        0      0 3.3.3.3:13724           2.2.2.2:52831          FIN_WAIT2  
tcp        0      1 3.3.3.3:13724           2.2.2.2:52829          FIN_WAIT1  
Tue Jun 23 07:06:50 BST 2009
tcp        0      0 3.3.3.3:13724           2.2.2.2:52831          FIN_WAIT2  
tcp        0      1 3.3.3.3:13724           2.2.2.2:52829          FIN_WAIT1  
Tue Jun 23 07:07:20 BST 2009
tcp        0      1 3.3.3.3:13724           2.2.2.2:52829          FIN_WAIT1  
Tue Jun 23 07:07:50 BST 2009
Tue Jun 23 07:08:20 BST 2009
Tue Jun 23 07:08:50 BST 2009

The netstat output from the media server confirms the same behavior.

Tue Jun 23 01:39:37 BST 2009
Tue Jun 23 01:40:07 BST 2009
Tue Jun 23 01:40:37 BST 2009
2.2.2.2.52829   3.3.3.3.13724     5840      0 64860      0 ESTABLISHED
2.2.2.2.52831   3.3.3.3.13724     5840      0 49680      0 ESTABLISHED
Tue Jun 23 01:41:08 BST 2009
2.2.2.2.52829   3.3.3.3.13724     5840      0 64860      0 ESTABLISHED
2.2.2.2.52831   3.3.3.3.13724     5840      0 49680      0 ESTABLISHED
Tue Jun 23 01:41:38 BST 2009
2.2.2.2.52829   3.3.3.3.13724     5840      0 64860      0 ESTABLISHED
2.2.2.2.52831   3.3.3.3.13724     5840      0 49680      0 ESTABLISHED
...snip, dbclient closed the sockets here, but some other process is holding them open.
Tue Jun 23 07:04:30 BST 2009
2.2.2.2.52829   3.3.3.3.13724     5840      0 64860      0 ESTABLISHED
2.2.2.2.52831   3.3.3.3.13724     5840      0 49680      0 ESTABLISHED
Tue Jun 23 07:05:01 BST 2009
2.2.2.2.52829   3.3.3.3.13724     5840      0 64860      0 ESTABLISHED
2.2.2.2.52831   3.3.3.3.13724     5840      0 49680      0 ESTABLISHED
Tue Jun 23 07:05:31 BST 2009
2.2.2.2.52829   3.3.3.3.13724     5840      0 64860      0 ESTABLISHED
2.2.2.2.52831   3.3.3.3.13724     5840      0 49680      0 ESTABLISHED
Tue Jun 23 07:06:01 BST 2009
2.2.2.2.52829   3.3.3.3.13724     5840      0 64860      0 ESTABLISHED
2.2.2.2.52831   3.3.3.3.13724     5840      0 49680      0 CLOSE_WAIT
...snip, the sockets closed on the media server when bpbrm exits.
Tue Jun 23 07:58:33 BST 2009
2.2.2.2.52829   3.3.3.3.13724     5840      0 64860      0 ESTABLISHED
2.2.2.2.52831   3.3.3.3.13724     5840      0 49680      0 CLOSE_WAIT
Tue Jun 23 07:59:03 BST 2009
2.2.2.2.52829   3.3.3.3.13724     5840      0 64860      0 ESTABLISHED
2.2.2.2.52831   3.3.3.3.13724     5840      0 49680      0 CLOSE_WAIT
Tue Jun 23 07:59:33 BST 2009
Tue Jun 23 08:00:04 BST 2009
Tue Jun 23 08:00:34 BST 2009

An overview of the dbclient debug log confirms that if Oracle creates child processes outside of the data transfer, they do not inherit file descriptors for any of the sockets that NetBackup is using.  However, if the Oracle child process is created between the end of the sbtbackup and the beginning of the sbtclose then it inherits the file descriptors and prevents the sockets from closing.

This Oracle backup process completed several successful backups without experiencing a SIGCLD.

01:39:39.806 [22385] <2> sbtbackup: INF - leaving
01:39:41.040 [22385] <2> sbtclose2: INF - entering
01:39:41.043 [22385] <4> closeApi: INF - EXIT STATUS 0: the requested operation was successfully completed
01:39:51.053 [22385] <2> sbtclose2: INF - leaving
01:39:51.053 [22385] <2> sbtinfo2: INF - entering
01:39:51.053 [22385] <2> sbtinfo2: INF - requesting image info for <c-1290166513-20090623-00>
01:39:51.053 [22385] <4> closeApi: INF - server EXIT STATUS = 0: the requested operation was successfully completed
01:39:51.272 [22385] <2> sbtinfo2: INF - leaving
01:40:03.913 [22385] <2> sbtbackup: INF - entering
01:40:24.080 [22385] <2> sbtbackup: INF - leaving
01:40:34.000 [22385] <2> sbtclose2: INF - entering
01:40:34.003 [22385] <4> closeApi: INF - EXIT STATUS 0: the requested operation was successfully completed
01:40:44.014 [22385] <2> sbtclose2: INF - leaving
01:40:44.014 [22385] <2> sbtinfo2: INF - entering
01:40:44.014 [22385] <2> sbtinfo2: INF - requesting image info for <1tki8tni_1_1>
01:40:44.014 [22385] <4> closeApi: INF - server EXIT STATUS = 0: the requested operation was successfully completed
01:40:44.604 [22385] <2> sbtinfo2: INF - leaving
01:55:45.126 [22385] <2> sbtend: INF - entering
01:55:45.126 [22385] <2> sbtend: INF - leaving
01:55:45.126 [22385] <4> sbtend: INF - --- END of SESSION ---

This Oracle backup process performed one successful backup, but experienced a SIGCLD between the next sbtbackup and sbtclose2 (i.e. while a data transfer is in progress with network sockets open) which subsequently caused the backup to fail when the sockets did not close.

01:38:55.948 [22394] <2> sbtclose2: INF - entering
01:38:55.951 [22394] <4> closeApi: INF - EXIT STATUS 0: the requested operation was successfully completed
01:39:05.963 [22394] <2> sbtclose2: INF - leaving
01:39:05.963 [22394] <2> sbtinfo2: INF - entering
01:39:05.963 [22394] <2> sbtinfo2: INF - requesting image info for <1oki869k_1_1>
01:39:05.963 [22394] <4> closeApi: INF - server EXIT STATUS = 0: the requested operation was successfully completed
01:39:06.600 [22394] <2> sbtinfo2: INF - leaving
01:40:03.924 [22394] <2> sbtbackup: INF - entering
01:40:24.081 [22394] <2> sbtbackup: INF - leaving
01:40:24.088 [22394] <2> int_signal_hdlr: INF - ** SIGCLD received **
01:40:29.355 [22394] <2> sbtclose2: INF - entering
01:40:29.363 [22394] <4> closeApi: INF - EXIT STATUS 0: the requested operation was successfully completed
01:55:35.842 [22394] <2> sbtclose2: INF - leaving
01:55:35.842 [22394] <2> sbterror: INF - Error=0: VxBSAEndTxn: Failed with error:
01:55:35.842 [22394] <4> closeApi: INF - server EXIT STATUS = -1: invalid error number

This Oracle process did several successful backups.  The SIGCLD occurred before the data transfer and hence does not cause problems.

07:00:34.471 [11175] <2> sbtbackup: INF - leaving
07:00:38.075 [11175] <2> sbtclose2: INF - entering
07:00:38.079 [11175] <4> closeApi: INF - EXIT STATUS 0: the requested operation was successfully completed
07:00:48.091 [11175] <2> sbtclose2: INF - leaving
07:00:48.091 [11175] <2> sbtinfo2: INF - entering
07:00:48.091 [11175] <2> sbtinfo2: INF - requesting image info for <1vki9gft_1_1>
07:00:48.091 [11175] <4> closeApi: INF - server EXIT STATUS = 0: the requested operation was successfully completed
07:00:49.059 [11175] <2> sbtinfo2: INF - leaving
07:00:49.319 [11175] <2> int_signal_hdlr: INF - ** SIGCLD received **
07:00:56.494 [11175] <2> sbtbackup: INF - entering
07:01:11.658 [11175] <2> sbtbackup: INF - leaving
07:01:12.901 [11175] <2> sbtclose2: INF - entering
07:01:12.905 [11175] <4> closeApi: INF - EXIT STATUS 0: the requested operation was successfully completed
07:01:22.915 [11175] <2> sbtclose2: INF - leaving
07:01:22.915 [11175] <2> sbtinfo2: INF - entering
07:01:22.915 [11175] <2> sbtinfo2: INF - requesting image info for <c-1290166513-20090623-01>
07:01:22.915 [11175] <4> closeApi: INF - server EXIT STATUS = 0: the requested operation was successfully completed
07:01:23.127 [11175] <2> sbtinfo2: INF - leaving

The pfiles and lsof operating system commands were used to identify the Oracle child process that was holding file descriptors and keeping the sockets from closing.  The process was named 'ora_o000_<SID>' and related to Oracle ASM.  The process was observed to continue running for 4-6 hours after the backup had otherwise completed, before exiting and closing the sockets at which time bptm and bpbrm exited with status 0, but the client side had already timed out and failed.

Resolution:
Four sites have contacted Oracle support and have reported the problem resolved after upgrading Oracle components.  The root cause is believed related to MetaLink document 7225720 and reportedly resolved in ASM versions 10.2.0.4.1, 10.2.0.5, 11.1.0.7.1, and 11.2.0.1.

However, another site encountered the same problem on Oracle 11.2.0.2 with ASM 11.2.0.2.

The NetBackup 7.5.0.5 client contains a design change that minimizes the opportunity for this issue to occur.   Previously, the Oracle child process would inherit the socket if forked/created between sbtbackup and sbtclose processing.  That window has been narrowed to a few milliseconds during sbtbackup processing.
 

Supplemental Materials

SourceETrack
Value2717058
Description

(enhancement) Status 6 or status 42 because dbclient sockets are not isolated from Oracle child processes


Legacy ID



329338


Article URL http://www.symantec.com/docs/TECH73130


Terms of use for this information are found in Legal Notices