Application backups fails with status 6 or status 42 when extjob.exe inherits NetBackup sockets

Article:TECH189853  |  Created: 2012-05-28  |  Updated: 2013-03-27  |  Article URL http://www.symantec.com/docs/TECH189853
Article Type
Technical Solution


Issue



Oracle RMAN application backup jobs fail intermittently.  Sometimes with a status 42 near the end of the data transfer, or sometimes with a status 6 approximately 15 minutes after the data transfer appears to complete.

 

The same backup can be rerun and will complete successfully.

 


Error



The RMAN output shows that the sbtbackup and sbtwrite operations completed the transfer of the backupset piece to the media management layer, but that the sbtclose API call failed to close the image successfully.

 

RMAN-00571: ===========================================================

RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============

RMAN-00571: ===========================================================

RMAN-03009: failure of backup command on ch00 channel at 02/15/2012 16:44:07

ORA-27192: skgfcls: sbtclose2 returned error - failed to close file

ORA-19511: Error received from media manager layer, error text:

   VxBSAEndTxn: Failed with error:

   The transaction was aborted.

 

 

The dbclient debug log identifies the vnetd hash strings for the Data/Name/New Data sockets, when the New Data socket is connected, and when the data transfer starts over the New Data socket to bptm.

 

16:42:40.672 [4228.12636] <4> sendRequest: sending BACKUP request to bprd

16:42:40.672 [4228.12636] <4> sendRequest:    request = SYSTEM SYSTEM ... user_ops/dbext/logs/12636.0.1330558960 ...

...snip...

16:43:00.937 [4228.12636] <4> serverResponse: 16:42:52 INF - Data socket = mymm.IPC:61310;014c6b27b64fd77bbea296d67cb235c7;4;300

16:43:00.937 [4228.12636] <4> serverResponse: 16:42:52 INF - Name socket = mymm.IPC:61311;686e9f7f98835e5400ad404e5002bf8f;4;300

16:43:00.937 [4228.12636] <4> serverResponse: 16:42:52 INF - Job id = 28038

16:43:00.937 [4228.12636] <4> serverResponse: 16:42:52 INF - Backup id = myclient_1330558966

...snip..

16:43:00.937 [4228.12636] <4> serverResponse: 16:42:54 INF - New data socket = mymm.IPC:61307;5831783f6099df0ccf36cbafd45b3188;4;300

16:43:00.937 [4228.12636] <4> serverResponse: INF - NEW DATA sockfd: 4760

...snip...

16:43:21.045 [4228.12636] <2> int_WriteData: INF - writing buffer # 1 of size 262144

 

The backupset piece is written to the New Data socket (bound for bptm) in less than 2 seconds and then the meta data about the piece and client status is written to the Name socket (bound for bpbrm) and all three sockets are closed without any error indications from the client host O/S.  After a 45 second delay, a server status 42 is unexpectedly found in the comm file and the failure is returned to Oracle.

 

16:43:22.168 [4228.12636] <4> finishTarImage: INF - dumped size = 786432 bytes

16:43:22.168 [4228.12636] <4> finishTarImage: INF - sending image info on nameSock: FIL - 786432 1 0 15 33152 SYSTEM SYSTEM 786432 1330558960 1330558960 1330558960 1 0 0 15 13 0 0 3 1330558960 1 4   Oracle Database Oracle Backup   /arch-s14879-p1

16:43:22.168 [4228.12636] <4> closeApi: INF - EXIT STATUS 0: the requested operation was successfully completed

16:43:22.168 [4228.12636] <4> closeApi: INF - closing commSock 3984

16:43:22.168 [4228.12636] <4> closeApi: INF - closing dataSock 4760

16:43:22.527 [4228.12636] <4> closeApi: INF - setting linger on nameSock 2936

16:43:22.527 [4228.12636] <4> closeApi: INF - closing nameSock 2936

16:43:22.652 [4228.12636] <4> closeApi: serverReadTimeout is <900> seconds

16:43:22.652 [4228.12636] <4> serverExitStatus: entering serverExitStatus.

16:43:22.652 [4228.12636] <4> readCommMessages: Entering readCommMessages

16:44:07.720 [4228.12636] <4> serverExitStatus: read from comm file:<16:44:00 INF - Server status = 42>

16:44:07.720 [4228.12636] <4> closeApi: INF - server EXIT STATUS = 42: network read failed

...snip...

16:44:07.720 [4228.12636] <2> sbtclose2: INF - leaving

16:44:07.720 [4228.12636] <2> sbterror: INF - entering

16:44:07.720 [4228.12636] <2> sbterror: INF - Error=0: VxBSAEndTxn: Failed with error:

16:44:07.720 [4228.12636] <2> sbterror: INF - leaving

 

 

The vnetd debug log on the client shows the TCP IP addresses and port numbers associated with the New Data, Data, and Name sockets respectively in the order created.  Each has a hash string which matches the dbclient (and bpbrm/bptm) log.

 

16:42:49.034 [12536.13340] <2> logparams: D:\Veritas\NETBAC~1\bin\VNETD.EXE

16:42:49.049 [12536.13340] <2> ProcessRequests: ... VNETD ACCEPT FROM 2.2.2.2.2739 TO 3.3.3.3.13724 fd = 548

16:42:49.268 [12536.13340] <2> ProcessRequests: ... Request VN_REQUEST_CONNECT_FORWARD_SOCKET(10)

16:42:49.268 [12536.13340] <2> vnet_generate_hashes: ... hash_str1: 5831783f6099df0ccf36cbafd45b3188

...snip...

16:42:49.439 [11156.11092] <2> logparams: D:\Veritas\NETBAC~1\bin\VNETD.EXE

16:42:49.455 [11156.11092] <2> ProcessRequests: ... VNETD ACCEPT FROM 2.2.2.2.2740 TO 3.3.3.3.13724 fd = 404

16:42:49.658 [11156.11092] <2> ProcessRequests: ... Request VN_REQUEST_CONNECT_FORWARD_SOCKET(10)

16:42:49.658 [11156.11092] <2> vnet_generate_hashes: ... hash_str1: 014c6b27b64fd77bbea296d67cb235c7

...snip...

16:42:49.876 [14000.13316] <2> logparams: D:\Veritas\NETBAC~1\bin\VNETD.EXE

16:42:49.892 [14000.13316] <2> ProcessRequests: ... VNETD ACCEPT FROM 2.2.2.2.2741 TO 3.3.3.3.13724 fd = 548

16:42:50.110 [14000.13316] <2> ProcessRequests: ... Request VN_REQUEST_CONNECT_FORWARD_SOCKET(10)

16:42:50.110 [14000.13316] <2> vnet_generate_hashes: ... hash_str1: 686e9f7f98835e5400ad404e5002bf8f

 

 

The bpbrm debug log confirms the TCP IP addresses and port numbers associated with the New Data, Data, and Name sockets respectively in the order created.  If the backup had been multiplexed, the New Data socket creation would have been in the bptm debug log.

 

16:42:47.637 [2372.5204] <2> logparams: -backup ... -b myclient_1330558966 -jobid 28038 ... user_ops/dbext/logs/12636.0.1330558960 ...

...snip...

16:42:49.450 [2372.5204] <2> vnet_connect_to_vnetd_extra: ... VNETD CONNECT FROM 2.2.2.2.2739 TO 3.3.3.3.13724 fd = 1716

16:42:49.528 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... VN_REQUEST_CONNECT_FORWARD_SOCKET: 10 0x0000000a

16:42:49.746 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... ipc_string: 61307

16:42:49.887 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... hash_str1: 5831783f6099df0ccf36cbafd45b3188

16:42:49.887 [2372.5204] <2> bpbrm listen_for_client: HOT_ORACLE_DB_BACKUP

16:42:49.887 [2372.5204] <2> bpbrm listen_for_client: ... vnetd connect forward: 0 0 0x00000000

16:42:49.887 [2372.5204] <2> vnet_connect_to_vnetd_extra: ... msg: VNETD CONNECT FROM 2.2.2.2.2740 TO 3.3.3.3.13724 fd = 1696

16:42:49.934 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... VN_REQUEST_CONNECT_FORWARD_SOCKET: 10 0x0000000a

16:42:50.137 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... ipc_string: 61310

16:42:50.325 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... hash_str1: 014c6b27b64fd77bbea296d67cb235c7

16:42:50.325 [2372.5204] <2> bpbrm listen_for_client: ... data_ipc: 61310;014c6b27b64fd77bbea296d67cb235c7;4;300

16:42:50.325 [2372.5204] <2> bpbrm listen_for_client: ... accept_socket_data: 1696 1696 0x000006a0

16:42:50.325 [2372.5204] <2> vnet_connect_to_vnetd_extra: ... msg: VNETD CONNECT FROM 2.2.2.2.2741 TO 3.3.3.3.13724 fd = 1684

16:42:50.371 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... VN_REQUEST_CONNECT_FORWARD_SOCKET: 10 0x0000000a

16:42:50.590 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... ipc_string: 61311

16:42:50.762 [2372.5204] <2> vnet_vnetd_connect_forward_socket_begin: ... hash_str1: 686e9f7f98835e5400ad404e5002bf8f

16:42:50.762 [2372.5204] <2> bpbrm listen_for_client: ... name_ipc: 61311;686e9f7f98835e5400ad404e5002bf8f;4;300

16:42:50.762 [2372.5204] <2> bpbrm listen_for_client: ... accept_socket_name: 1684 1684 0x00000694

 

The log also confirms that the meta data for the backup and client exit status were received successfully over the Name socket at the end of the backup.  However, bptm exited with status 42 after a delay of 36 seconds.  It should have exited within a second or two with status 0.

 

16:43:22.653 [2372.5204] <2> bpbrm main: ADDED FILES TO DB FOR myclient_1330558966 1 /arch-s14879-p1

16:43:22.810 [2372.5204] <2> bpbrm main: client myclient EXIT STATUS = 0: the requested operation was successfully completed

...snip...

16:43:23.513 [2372.5204] <2> bpbrm wait_for_child: start

16:43:59.373 [2372.5204] <2> bpbrm wait_for_child: child exit_status = 42

16:43:59.373 [2372.5204] <2> bpbrm kill_child_process_Ex: start

16:44:04.373 [2372.5204] <2> bpbrm Exit: client backup EXIT STATUS 42: network read failed

 

 

The bptm debug log shows that the media server O/S reported an error 10054 while waiting for the New Data socket to close.  This is approximately 15 seconds after the data transfer began and 13 seconds after dbclient closed the remote end of the sockets.

 

16:43:13.544 [5228.1892] <4> write_backup: begin writing backup id myclient_1330558966, copy 1, fragment 1, to media id DSS083 on drive Px720_D4 (index 6)

16:43:13.544 [5228.1892] <2> signal_parent: set bpbrm media ready event (pid = 2372)

...snip...

16:43:21.560 [5228.1892] <2> write_data: received first buffer (65536 bytes), begin writing data

… The O/S raises error 10054 so bptm updates bpjobd on the master server.

16:43:36.419 [252.5028] <2> init_cache: vnet_hosts.c.1045: host_cache_size: 200 0x000000c8

16:43:36.623 [252.5028] <2> logconnections: BPJOBD CONNECT FROM 2.2.2.2.2768 TO 1.1.1.1.13724

...snip...

16:43:36.638 [252.5028] <2> set_job_details: Sending Tfile jobid (28038)

16:43:36.638 [252.5028] <2> set_job_details: LOG 1330559016 16 bptm 252 socket operation failed - 10054 (at child.c.1294)

...snip...

16:43:37.607 [252.5028] <16> fill_buffer: socket operation failed - 10054 (at child.c.1294)

...snip...

16:43:38.701 [252.5028] <16> fill_buffer: unable to perform read from client socket, connection may have been broken

16:43:38.810 [5228.1892] <2> wait_for_sigcld: waiting for child 252 to exit, timeout is 300

16:43:38.810 [5228.1892] <2> mm_child_done: SIGCHLD: exit=42, pid=252

 

 

A network capture from the client host shows that dbclient has completed writing the backup image to the New Data socket to bptm and the meta data to the Name socket to bpbrm.  The TCP stack on the media server host has returned TCP ACKs for all of the transmitted data on both sockets.  But when dbclient closes the file descriptors for the sockets at 16:43:22.168 a TCP FIN is not sent by the client TCP stack as expected.  Instead, 13 additional seconds elapse before the stack sends a TCP RESET on each of the Name, Data, and New Data sockets.

 

... dbclient completes sending the last 512 bytes of the backupset piece to bptm over the New Data socket

350 16:43:22.06868 3.3.3.3 -> 2.2.2.2 drops: 0 TCP D=2739 S=13724 Push Ack=685441171 Seq=1762802400 Len=512 Win=64231

... dbclient sends the meta data about the piece to bpbrm over the Name socket.

351 16:43:22.06899 3.3.3.3 -> 2.2.2.2 drops: 0 TCP D=2741 S=13724 Push Ack=652357522 Seq=4054606214 Len=163 Win=64231 (4054606377)

          48: fae7 0ac4 0000 4649 4c20 2d20 3738 3634    .ç....FIL - 7864

          64: 3332 2031 2030 2031 3520 3333 3135 3220    32 1 0 15 33152

          80: 5359 5354 454d 2053 5953 5445 4d20 3738    SYSTEM SYSTEM 78

          96: 3634 3332 2031 3333 3035 3538 3936 3020    6432 1330558960

         112: 3133 3330 3535 3839                        13305589

352 16:43:22.06969 2.2.2.2 -> 3.3.3.3 drops: 0 TCP D=13724 S=2739 Ack=1762804448 Seq=685441171 Len=0 Win=65535

... dbclient pads the backup image to bptm over the New Data socket.

353 16:43:22.06971 3.3.3.3 -> 2.2.2.2 drops: 0 TCP D=2739 S=13724 Push Ack=685441171 Seq=1762807520 Len=1024 Win=64231

...snipped additional pad bytes transmitted...

376 16:43:22.07183 3.3.3.3 -> 2.2.2.2 drops: 0 TCP D=2739 S=13724 Push Ack=685441171 Seq=1762833120 Len=512 Win=64231 (1762833632)

377 16:43:22.07206 2.2.2.2 -> 3.3.3.3 drops: 0 TCP D=13724 S=2739 Ack=1762831584 Seq=685441171 Len=0 Win=65535

378 16:43:22.07221 2.2.2.2 -> 3.3.3.3 drops: 0 TCP D=13724 S=2739 Ack=1762833120 Seq=685441171 Len=0 Win=65535

... the mm TCP stack ACKs the backup image and pad bytes.

379 16:43:22.21746 2.2.2.2 -> 3.3.3.3 drops: 0 TCP D=13724 S=2739 Ack=1762833632 Seq=685441171 Len=0 Win=65023

... the mm TCP stack ACKs the meta data.

380 16:43:22.21752 2.2.2.2 -> 3.3.3.3 drops: 0 TCP D=13724 S=2741 Ack=4054606377 Seq=652357522 Len=0 Win=65281

... dbclient sends client status to bpbrm over the Name socket.

381 16:43:22.21757 3.3.3.3 -> 2.2.2.2 drops: 0 TCP D=2741 S=13724 Push Ack=652357522 Seq=4054606377 Len=72 Win=64231 (4054606449)

          48: fae7 0a69 0000 494e 4620 2d20 4558 4954    .ç.i..INF - EXIT

          64: 2053 5441 5455 5320 303a 2074 6865 2072     STATUS 0: the r

          80: 6571 7565 7374 6564 206f 7065 7261 7469    equested operati

          96: 6f6e 2077 6173 2073 7563 6365 7373 6675    on was successfu

         112: 6c6c 7920 636f 6d70                        lly comp

... the mm TCP stack ACKs the client status.

382 16:43:22.43620 2.2.2.2 -> 3.3.3.3 drops: 0 TCP D=13724 S=2741 Ack=4054606449 Seq=652357522 Len=0 Win=65209

...13 second delay before the client host TCP stack sends a TSP RESET on all three sockets.

383 16:43:35.82822 3.3.3.3 -> 2.2.2.2 drops: 0 TCP D=2741 S=13724 Rst Ack=652357522 Seq=4054606449 Len=0 Win=0

384 16:43:35.82831 3.3.3.3 -> 2.2.2.2 drops: 0 TCP D=2740 S=13724 Rst Ack=278814088 Seq=15610922 Len=0 Win=0

385 16:43:35.82836 3.3.3.3 -> 2.2.2.2 drops: 0 TCP D=2739 S=13724 Rst Ack=685441171 Seq=1762833632 Len=0 Win=0

 

 

A review of the Task Manager on the client hosts confirms that an extjob.exe process was started from the Oracle.exe process between when the dbclient connected to the New Data socket at 16:43:00.937 and when it closed the socket at 16:43:22.168.  That process completed at 16:43:35 which matches when the TCP RESETs were sent.

 

 

If/When the extjob.exe process runs for longer than 15 minutes after dbclient closes the sockets, then dbclient will timeout at a hardcoded 900 seconds.  This causes RMAN to record the failure 15 minutes after the data transfer completes.  E.g.

900 seconds.

 

19:07:12.418 [4888] <4> closeApi: INF - EXIT STATUS 0: the requested operation

was successfully completed

19:07:12.419 [4888] <4> closeApi: INF - closing commSock 37

19:07:12.419 [4888] <4> closeApi: INF - close of commSock returned <0> errno is <0>

19:07:12.419 [4888] <4> closeApi: INF - closing dataSock 39

19:07:12.419 [4888] <4> closeApi: INF - close of dataSock returned <0> errno is <0>

19:07:12.419 [4888] <4> closeApi: INF - setting linger on nameSock 38

19:07:12.420 [4888] <4> closeApi: INF - closing nameSock 38

19:07:12.420 [4888] <4> closeApi: INF - close of nameSock returned <0> errno is <0>

19:07:12.420 [4888] <4> closeApi: serverReadTimeout is <900> seconds

19:07:12.420 [4888] <4> serverExitStatus: entering serverExitStatus.

19:07:12.420 [4888] <4> readCommMessages: Entering readCommMessages

19:22:17.768 [4888] <16> readCommFile: ERR - timed out after 900 seconds while reading from /usr/openv/netbackup/logs/user_ops/dbext/logs/4888.0.1331060711>

19:22:17.768 [4888] <4> closeApi: INF - server EXIT STATUS = -1: invalid error number

...snip...

19:22:17.769 [4888] <2> sbtclose2: INF - leaving

19:22:17.769 [4888] <2> sbterror: INF - entering

19:22:17.769 [4888] <2> sbterror: INF - Error=0: VxBSAEndTxn: Failed with error:

   The transaction was aborted..

19:22:17.769 [4888] <2> sbterror: INF - leaving

...snip...

19:22:27.238 [4892] <4> VxBSATerminate: INF - entering VxBSATerminate.

 


Environment



NetBackup 6.0, 6.5, 7.0, 7.5

Oracle 10g R2 and 11g R2

Windows 2008 R2 x64

 

 


Cause



The oracle.exe process serves multiple purposes.  Each RMAN channel is a thread of oracle.exe, as are any extjob.exe tasks that are run from the SYS schema. 

 

Consequently, any extjob.exe task that starts while dbclient is connected to the media server processes will inherit copies of the file descriptors for the sockets.  If the extjob.exe task has not yet exited when dbclient closes it's copies of the file descriptors, the client host TCP stack will not send the TCP FINs as expected and the media server will be unaware that the backup is complete because the sockets have not yet closed.

 

If the extjob.exe completes within 15 minutes, a disordly shutdown of the sockets will occur (because extjob.exe does not know the sockets exist and hence can't perform an orderly close) resulting in the TCP RESETs.

 

If the extjob.exe does not complete for at least 15 minutes, then dbclient will timeout as noted above.

 


Solution



The behaviors documented above are not explicitly addressed within the Oracle SBT API and neither Oracle nor Symantec have a code level change to fix this situation at the current time.  However, several workarounds are available.

 

1) Use a scheduler other than extjob.exe for the conflicting tasks.

 

2) Schedule the external tasks to initiate at a time when RMAN backups are not running.

 

3) Configure the external tasks to run from an Oracle schema other than SYS.  When run from a non-SYS schema, the extjob.exe process will not be spawned directly from oracle.exe and will not conflict with the backup.  When run from the SYS schema, the thread is spawned using the parent oracle.exe process which is required to also be used by RMAN for the backups. 

  

Current versions of the NetBackup client contain a design change that minimizes the opportunity for this issue to occur.   Previously, the Oracle child process would inherit the socket if forked/created between sbtbackup and sbtclose processing.  That window has been narrowed to a few milliseconds during sbtbackup processing.

 

The formal resolution for this issue (ETrack 2717058) is included in the following patch release(s):


NetBackup 7.5 Maintenance Release 5 (7.5.0.5)


Please visit this link for download and README information:
 http://www.symantec.com/business/support/index?page=landing&key=15143

 


Supplemental Materials

SourceETrack
Value2717058
Description

(enhancement) Status 6 or status 42 because dbclient sockets are not isolated from Oracle child processes




Article URL http://www.symantec.com/docs/TECH189853


Terms of use for this information are found in Legal Notices