Application backups fails with status 6 or status 42 when extjob.exe inherits NetBackup sockets
| Article:TECH189853 | | | Created: 2012-05-28 | | | Updated: 2013-03-27 | | | Article URL http://www.symantec.com/docs/TECH189853 |
Problem
Oracle RMAN application backup jobs fail intermittently. Sometimes with a status 42 near the end of the data transfer, or sometimes with a status 6 approximately 15 minutes after the data transfer appears to complete.
The same backup can be rerun and will complete successfully.
Error
The RMAN output shows that the sbtbackup and sbtwrite operations completed the transfer of the backupset piece to the media management layer, but that the sbtclose API call failed to close the image successfully.
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on ch00 channel at 02/15/2012 16:44:07
ORA-27192: skgfcls: sbtclose2 returned error - failed to close file
ORA-19511: Error received from media manager layer, error text:
VxBSAEndTxn: Failed with error:
The transaction was aborted.
The dbclient debug log identifies the vnetd hash strings for the Data/Name/New Data sockets, when the New Data socket is connected, and when the data transfer starts over the New Data socket to bptm.
16:42:40.672 [4228.12636] <4> sendRequest: sending BACKUP request to bprd
16:42:40.672 [4228.12636] <4> sendRequest: request = SYSTEM SYSTEM ... user_ops/dbext/logs/12636.0.1330558960 ...
...snip...
16:43:00.937 [4228.12636] <4> serverResponse: 16:42:52 INF - Data socket = mymm.IPC:61310;014c6b27b64fd77bbea296d67cb235c7;4;300
16:43:00.937 [4228.12636] <4> serverResponse: 16:42:52 INF - Name socket = mymm.IPC:61311;686e9f7f98835e5400ad404e5002bf8f;4;300
16:43:00.937 [4228.12636] <4> serverResponse: 16:42:52 INF - Job id = 28038
16:43:00.937 [4228.12636] <4> serverResponse: 16:42:52 INF - Backup id = myclient_1330558966
...snip..
16:43:00.937 [4228.12636] <4> serverResponse: 16:42:54 INF - New data socket = mymm.IPC:61307;5831783f6099df0ccf36cbafd45b3188;4;300
16:43:00.937 [4228.12636] <4> serverResponse: INF - NEW DATA sockfd: 4760
...snip...
16:43:21.045 [4228.12636] <2> int_WriteData: INF - writing buffer # 1 of size 262144
The backupset piece is written to the New Data socket (bound for bptm) in less than 2 seconds and then the meta data about the piece and client status is written to the Name socket (bound for bpbrm) and all three sockets are closed without any error indications from the client host O/S. After a 45 second delay, a server status 42 is unexpectedly found in the comm file and the failure is returned to Oracle.
16:43:22.168 [4228.12636] <4> finishTarImage: INF - dumped size = 786432 bytes
16:43:22.168 [4228.12636] <4> finishTarImage: INF - sending image info on nameSock: FIL - 786432 1 0 15 33152 SYSTEM SYSTEM 786432 1330558960 1330558960 1330558960 1 0 0 15 13 0 0 3 1330558960 1 4 Oracle Database Oracle Backup /arch-s14879-p1
16:43:22.168 [4228.12636] <4> closeApi: INF - EXIT STATUS 0: the requested operation was successfully completed
16:43:22.168 [4228.12636] <4> closeApi: INF - closing commSock 3984
16:43:22.168 [4228.12636] <4> closeApi: INF - closing dataSock 4760
16:43:22.527 [4228.12636] <4> closeApi: INF - setting linger on nameSock 2936
16:43:22.527 [4228.12636] <4> closeApi: INF - closing nameSock 2936
16:43:22.652 [4228.12636] <4> closeApi: serverReadTimeout is <900> seconds
16:43:22.652 [4228.12636] <4> serverExitStatus: entering serverExitStatus.
16:43:22.652 [4228.12636] <4> readCommMessages: Entering readCommMessages
16:44:07.720 [4228.12636] <4> serverExitStatus: read from comm file:<16:44:00 INF - Server status = 42>
16:44:07.720 [4228.12636] <4> closeApi: INF - server EXIT STATUS = 42: network read failed
...snip...
16:44:07.720 [4228.12636] <2> sbtclose2: INF - leaving
16:44:07.720 [4228.12636] <2> sbterror: INF - entering
16:44:07.720 [4228.12636] <2> sbterror: INF - Error=0: VxBSAEndTxn: Failed with error:
16:44:07.720 [4228.12636] <2> sbterror: INF - leaving
The vnetd debug log on the client shows the TCP IP addresses and port numbers associated with the New Data, Data, and Name sockets respectively in the order created. Each has a hash string which matches the dbclient (and bpbrm/bptm) log.
16:42:49.034 [12536.13340] <2> logparams: D:\Veritas\NETBAC~1\bin\VNETD.EXE
16:42:49.049 [12536.13340] <2> ProcessRequests: ... VNETD ACCEPT FROM 2.2.2.2.2739 TO 3.3.3.3.13724 fd = 548
16:42:49.268 [12536.13340] <2> ProcessRequests: ... Request VN_REQUEST_CONNECT_FORWARD_SOCKET(10)
16:42:49.268 [12536.13340] <2> vnet_generate_hashes: ... hash_str1: 5831783f6099df0ccf36cbafd45b3188
...snip...
16:42:49.439 [11156.11092] <2> logparams: D:\Veritas\NETBAC~1\bin\VNETD.EXE
16:42:49.455 [11156.11092] <2> ProcessRequests: ... VNETD ACCEPT FROM 2.2.2.2.2740 TO 3.3.3.3.13724 fd = 404
16:42:49.658 [11156.11092] <2> ProcessRequests: ... Request VN_REQUEST_CONNECT_FORWARD_SOCKET(10)
16:42:49.658 [11156.11092] <2> vnet_generate_hashes: ... hash_str1: 014c6b27b64fd77bbea296d67cb235c7
...snip...
16:42:49.876 [14000.13316] <2> logparams: D:\Veritas\NETBAC~1\bin\VNETD.EXE
16:42:49.892 [14000.13316] <2> ProcessRequests: ... VNETD ACCEPT FROM 2.2.2.2.2741 TO 3.3.3.3.13724 fd = 548
16:42:50.110 [14000.13316] <2> ProcessRequests: ... Request VN_REQUEST_CONNECT_FORWARD_SOCKET(10)
16:42:50.110 [14000.13316] <2> vnet_generate_hashes: ... hash_str1: 686e9f7f98835e5400ad404e5002bf8f
The bpbrm debug log confirms the TCP IP addresses and port numbers associated with the New Data, Data, and Name sockets respectively in the order created. If the backup had been multiplexed, the New Data socket creation would have been in the bptm debug log.
16:42:47.637 [2372.5204] <2> logparams: -backup ... -b myclient_1330558966 -jobid 28038 ... user_ops/dbext/logs/12636.0.1330558960 ...
...snip...
16:42:49.450 [2372.5204] <2> vnet_connect_to_vnetd_extra: ... VNETD CONNECT FROM 2.2.2.2.2739 TO 3.3.3.3.13724 fd = 1716
16:42:49.528 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... VN_REQUEST_CONNECT_FORWARD_SOCKET: 10 0x0000000a
16:42:49.746 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... ipc_string: 61307
16:42:49.887 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... hash_str1: 5831783f6099df0ccf36cbafd45b3188
16:42:49.887 [2372.5204] <2> bpbrm listen_for_client: HOT_ORACLE_DB_BACKUP
16:42:49.887 [2372.5204] <2> bpbrm listen_for_client: ... vnetd connect forward: 0 0 0x00000000
16:42:49.887 [2372.5204] <2> vnet_connect_to_vnetd_extra: ... msg: VNETD CONNECT FROM 2.2.2.2.2740 TO 3.3.3.3.13724 fd = 1696
16:42:49.934 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... VN_REQUEST_CONNECT_FORWARD_SOCKET: 10 0x0000000a
16:42:50.137 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... ipc_string: 61310
16:42:50.325 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... hash_str1: 014c6b27b64fd77bbea296d67cb235c7
16:42:50.325 [2372.5204] <2> bpbrm listen_for_client: ... data_ipc: 61310;014c6b27b64fd77bbea296d67cb235c7;4;300
16:42:50.325 [2372.5204] <2> bpbrm listen_for_client: ... accept_socket_data: 1696 1696 0x000006a0
16:42:50.325 [2372.5204] <2> vnet_connect_to_vnetd_extra: ... msg: VNETD CONNECT FROM 2.2.2.2.2741 TO 3.3.3.3.13724 fd = 1684
16:42:50.371 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... VN_REQUEST_CONNECT_FORWARD_SOCKET: 10 0x0000000a
16:42:50.590 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... ipc_string: 61311
16:42:50.762 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... hash_str1: 686e9f7f98835e5400ad404e5002bf8f
16:42:50.762 [2372.5204] <2> bpbrm listen_for_client: ... name_ipc: 61311;686e9f7f98835e5400ad404e5002bf8f;4;300
16:42:50.762 [2372.5204] <2> bpbrm listen_for_client: ... accept_socket_name: 1684 1684 0x00000694
The log also confirms that the meta data for the backup and client exit status were received successfully over the Name socket at the end of the backup. However, bptm exited with status 42 after a delay of 36 seconds. It should have exited within a second or two with status 0.
16:43:22.653 [2372.5204] <2> bpbrm main: ADDED FILES TO DB FOR myclient_1330558966 1 /arch-s14879-p1
16:43:22.810 [2372.5204] <2> bpbrm main: client myclient EXIT STATUS = 0: the requested operation was successfully completed
...snip...
16:43:23.513 [2372.5204] <2> bpbrm wait_for_child: start
16:43:59.373 [2372.5204] <2> bpbrm wait_for_child: child exit_status = 42
16:43:59.373 [2372.5204] <2> bpbrm kill_child_process_Ex: start
16:44:04.373 [2372.5204] <2> bpbrm Exit: client backup EXIT STATUS 42: network read failed
The bptm debug log shows that the media server O/S reported an error 10054 while waiting for the New Data socket to close. This is approximately 15 seconds after the data transfer began and 13 seconds after dbclient closed the remote end of the sockets.
16:43:13.544 [5228.1892] <4> write_backup: begin writing backup id myclient_1330558966, copy 1, fragment 1, to media id DSS083 on drive Px720_D4 (index 6)
16:43:13.544 [5228.1892] <2> signal_parent: set bpbrm media ready event (pid = 2372)
...snip...
16:43:21.560 [5228.1892] <2> write_data: received first buffer (65536 bytes), begin writing data
… The O/S raises error 10054 so bptm updates bpjobd on the master server.
16:43:36.419 [252.5028] <2> init_cache: vnet_hosts.c.1045: host_cache_size: 200 0x000000c8
16:43:36.623 [252.5028] <2> logconnections: BPJOBD CONNECT FROM 2.2.2.2.2768 TO 1.1.1.1.13724
...snip...
16:43:36.638 [252.5028] <2> set_job_details: Sending Tfile jobid (28038)
16:43:36.638 [252.5028] <2> set_job_details: LOG 1330559016 16 bptm 252 socket operation failed - 10054 (at child.c.1294)
...snip...
16:43:37.607 [252.5028] <16> fill_buffer: socket operation failed - 10054 (at child.c.1294)
...snip...
16:43:38.701 [252.5028] <16> fill_buffer: unable to perform read from client socket, connection may have been broken
16:43:38.810 [5228.1892] <2> wait_for_sigcld: waiting for child 252 to exit, timeout is 300
16:43:38.810 [5228.1892] <2> mm_child_done: SIGCHLD: exit=42, pid=252
A network capture from the client host shows that dbclient has completed writing the backup image to the New Data socket to bptm and the meta data to the Name socket to bpbrm. The TCP stack on the media server host has returned TCP ACKs for all of the transmitted data on both sockets. But when dbclient closes the file descriptors for the sockets at 16:43:22.168 a TCP FIN is not sent by the client TCP stack as expected. Instead, 13 additional seconds elapse before the stack sends a TCP RESET on each of the Name, Data, and New Data sockets.
... dbclient completes sending the last 512 bytes of the backupset piece to bptm over the New Data socket
350 16:43:22.06868 3.3.3.3 -> 2.2.2.2 drops: 0 TCP D=2739 S=13724 Push Ack=685441171 Seq=1762802400 Len=512 Win=64231
... dbclient sends the meta data about the piece to bpbrm over the Name socket.
351 16:43:22.06899 3.3.3.3 -> 2.2.2.2 drops: 0 TCP D=2741 S=13724 Push Ack=652357522 Seq=4054606214 Len=163 Win=64231 (4054606377)
48: fae7 0ac4 0000 4649 4c20 2d20 3738 3634 .ç....FIL - 7864
64: 3332 2031 2030 2031 3520 3333 3135 3220 32 1 0 15 33152
80: 5359 5354 454d 2053 5953 5445 4d20 3738 SYSTEM SYSTEM 78
96: 3634 3332 2031 3333 3035 3538 3936 3020 6432 1330558960
112: 3133 3330 3535 3839 13305589
352 16:43:22.06969 2.2.2.2 -> 3.3.3.3 drops: 0 TCP D=13724 S=2739 Ack=1762804448 Seq=685441171 Len=0 Win=65535
... dbclient pads the backup image to bptm over the New Data socket.
353 16:43:22.06971 3.3.3.3 -> 2.2.2.2 drops: 0 TCP D=2739 S=13724 Push Ack=685441171 Seq=1762807520 Len=1024 Win=64231
...snipped additional pad bytes transmitted...
376 16:43:22.07183 3.3.3.3 -> 2.2.2.2 drops: 0 TCP D=2739 S=13724 Push Ack=685441171 Seq=1762833120 Len=512 Win=64231 (1762833632)
377 16:43:22.07206 2.2.2.2 -> 3.3.3.3 drops: 0 TCP D=13724 S=2739 Ack=1762831584 Seq=685441171 Len=0 Win=65535
378 16:43:22.07221 2.2.2.2 -> 3.3.3.3 drops: 0 TCP D=13724 S=2739 Ack=1762833120 Seq=685441171 Len=0 Win=65535
... the mm TCP stack ACKs the backup image and pad bytes.
379 16:43:22.21746 2.2.2.2 -> 3.3.3.3 drops: 0 TCP D=13724 S=2739 Ack=1762833632 Seq=685441171 Len=0 Win=65023
... the mm TCP stack ACKs the meta data.
380 16:43:22.21752 2.2.2.2 -> 3.3.3.3 drops: 0 TCP D=13724 S=2741 Ack=4054606377 Seq=652357522 Len=0 Win=65281
... dbclient sends client status to bpbrm over the Name socket.
381 16:43:22.21757 3.3.3.3 -> 2.2.2.2 drops: 0 TCP D=2741 S=13724 Push Ack=652357522 Seq=4054606377 Len=72 Win=64231 (4054606449)
48: fae7 0a69 0000 494e 4620 2d20 4558 4954 .ç.i..INF - EXIT
64: 2053 5441 5455 5320 303a 2074 6865 2072 STATUS 0: the r
80: 6571 7565 7374 6564 206f 7065 7261 7469 equested operati
96: 6f6e 2077 6173 2073 7563 6365 7373 6675 on was successfu
112: 6c6c 7920 636f 6d70 lly comp
... the mm TCP stack ACKs the client status.
382 16:43:22.43620 2.2.2.2 -> 3.3.3.3 drops: 0 TCP D=13724 S=2741 Ack=4054606449 Seq=652357522 Len=0 Win=65209
...13 second delay before the client host TCP stack sends a TSP RESET on all three sockets.
383 16:43:35.82822 3.3.3.3 -> 2.2.2.2 drops: 0 TCP D=2741 S=13724 Rst Ack=652357522 Seq=4054606449 Len=0 Win=0
384 16:43:35.82831 3.3.3.3 -> 2.2.2.2 drops: 0 TCP D=2740 S=13724 Rst Ack=278814088 Seq=15610922 Len=0 Win=0
385 16:43:35.82836 3.3.3.3 -> 2.2.2.2 drops: 0 TCP D=2739 S=13724 Rst Ack=685441171 Seq=1762833632 Len=0 Win=0
A review of the Task Manager on the client hosts confirms that an extjob.exe process was started from the Oracle.exe process between when the dbclient connected to the New Data socket at 16:43:00.937 and when it closed the socket at 16:43:22.168. That process completed at 16:43:35 which matches when the TCP RESETs were sent.
If/When the extjob.exe process runs for longer than 15 minutes after dbclient closes the sockets, then dbclient will timeout at a hardcoded 900 seconds. This causes RMAN to record the failure 15 minutes after the data transfer completes. E.g.
900 seconds.
19:07:12.418 [4888] <4> closeApi: INF - EXIT STATUS 0: the requested operation
was successfully completed
19:07:12.419 [4888] <4> closeApi: INF - closing commSock 37
19:07:12.419 [4888] <4> closeApi: INF - close of commSock returned <0> errno is <0>
19:07:12.419 [4888] <4> closeApi: INF - closing dataSock 39
19:07:12.419 [4888] <4> closeApi: INF - close of dataSock returned <0> errno is <0>
19:07:12.419 [4888] <4> closeApi: INF - setting linger on nameSock 38
19:07:12.420 [4888] <4> closeApi: INF - closing nameSock 38
19:07:12.420 [4888] <4> closeApi: INF - close of nameSock returned <0> errno is <0>
19:07:12.420 [4888] <4> closeApi: serverReadTimeout is <900> seconds
19:07:12.420 [4888] <4> serverExitStatus: entering serverExitStatus.
19:07:12.420 [4888] <4> readCommMessages: Entering readCommMessages
19:22:17.768 [4888] <16> readCommFile: ERR - timed out after 900 seconds while reading from /usr/openv/netbackup/logs/user_ops/dbext/logs/4888.0.1331060711>
19:22:17.768 [4888] <4> closeApi: INF - server EXIT STATUS = -1: invalid error number
...snip...
19:22:17.769 [4888] <2> sbtclose2: INF - leaving
19:22:17.769 [4888] <2> sbterror: INF - entering
19:22:17.769 [4888] <2> sbterror: INF - Error=0: VxBSAEndTxn: Failed with error:
The transaction was aborted..
19:22:17.769 [4888] <2> sbterror: INF - leaving
...snip...
19:22:27.238 [4892] <4> VxBSATerminate: INF - entering VxBSATerminate.
Environment
NetBackup 6.0, 6.5, 7.0, 7.5
Oracle 10g R2 and 11g R2
Windows 2008 R2 x64
Cause
The oracle.exe process serves multiple purposes. Each RMAN channel is a thread of oracle.exe, as are any extjob.exe tasks that are run from the SYS schema.
Consequently, any extjob.exe task that starts while dbclient is connected to the media server processes will inherit copies of the file descriptors for the sockets. If the extjob.exe task has not yet exited when dbclient closes it's copies of the file descriptors, the client host TCP stack will not send the TCP FINs as expected and the media server will be unaware that the backup is complete because the sockets have not yet closed.
If the extjob.exe completes within 15 minutes, a disordly shutdown of the sockets will occur (because extjob.exe does not know the sockets exist and hence can't perform an orderly close) resulting in the TCP RESETs.
If the extjob.exe does not complete for at least 15 minutes, then dbclient will timeout as noted above.
Solution
The behaviors documented above are not explicitly addressed within the Oracle SBT API and neither Oracle nor Symantec have a code level change to fix this situation at the current time. However, several workarounds are available.
1) Use a scheduler other than extjob.exe for the conflicting tasks.
2) Schedule the external tasks to initiate at a time when RMAN backups are not running.
3) Configure the external tasks to run from an Oracle schema other than SYS. When run from a non-SYS schema, the extjob.exe process will not be spawned directly from oracle.exe and will not conflict with the backup. When run from the SYS schema, the thread is spawned using the parent oracle.exe process which is required to also be used by RMAN for the backups.
Current versions of the NetBackup client contain a design change that minimizes the opportunity for this issue to occur. Previously, the Oracle child process would inherit the socket if forked/created between sbtbackup and sbtclose processing. That window has been narrowed to a few milliseconds during sbtbackup processing.
The formal resolution for this issue (ETrack 2717058) is included in the following patch release(s):
NetBackup 7.5 Maintenance Release 5 (7.5.0.5)
Please visit this link for download and README information:
http://www.symantec.com/business/support/index?page=landing&key=15143
|
|
| Source | ETrack |
| Value | 2717058 |
| Description | (enhancement) Status 6 or status 42 because dbclient sockets are not isolated from Oracle child processes |
Related Articles
Article URL http://www.symantec.com/docs/TECH189853
Terms of use for this information are found in Legal Notices









Thank you.