RMAN backup to SBT_TAPE is much slower than backup to DISK, may fail with status 13

Article:TECH130209  |  Created: 2010-01-19  |  Updated: 2011-09-21  |  Article URL http://www.symantec.com/docs/TECH130209
Article Type
Technical Solution

Product(s)

Environment

Issue



The Oracle backup processing when channels are allocated of TYPE DISK is not exactly the same as when channels are allocated of TYPE SBT_TAPE.  These difference can result in very dissimilar performance during the backup and consequently the NetBackup Oracle extension or other third party media manager - which implements the SBT API - may be mistakenly suspected of providing slower performance.

 

In addition, if the delays are excessive, the SBT API may timeout causing a status 13 if using NetBackup.


Error



There may not be an error if the delays are not excessive, simply slow performance.  The details are as follows.

 

The SBT API used by Oracle to perform backup to third-party media managers is simply a library of function calls, provided by the media manager vendor, that Oracle calls when need.  The function used to transfer data from Oracle to NetBackup is named 'sbtwrite' or 'sbtwrite2' depending on the version of the API.  The NetBackup implementation of sbtwrite and sbtwrite2 is provided within the libobk shared library.  These functions write entries to the dbclient debug log when called and then return control back to the Oracle process of which it is part.  These log entries are recorded at VERBOSE=5 for NetBackup 3.4/4.5/5.x and at VERBOSE=6 at NetBackup 6.5.4 and above, but are not available in NetBackup 6.0 to 6.5.3. These entries can be used to determine if the delays during the backup are on the Oracle side (between calls to sbtwrite/sbtwrite2) or on the network/mediaserver side (before returning control to Oracle).  E.g.

 

Initially the thread is within Oracle.  When Oracle has data to send it calls the sbtwrite/sbtwrite2 function, and dbclient begins logging.  This is an example of Oracle providing a 256 KB buffer of data to NetBackup.

 

16:44:28.272 [8095] <2> xbsa_SendData: INF - entering

16:44:28.272 [8095] <4> VxBSASendData: INF - entering SendData.

16:44:28.272 [8095] <4> dbc_put: INF - sending 262144 bytes

16:44:28.272 [8095] <2> writeToServer: bytesToSend = 262144 bytes

16:44:28.272 [8095] <2> writeToServer: start send -- try=1

 

The 0.003 second elapse between the log entries above and below shows that it took that long to transfer the buffer from the NetBackup Oracle extension onto the file descriptor for the TCP socket to the NetBackup mediaserver.  Delays here could be due to TCP congestion or remote server processing.

 

16:44:28.275 [8095] <2> writeToServer: bytesSent = 262144 bytes

16:44:28.275 [8095] <4> dbc_put: INF - sent: 10 GB + 721944576 Bytes

16:44:28.275 [8095] <2> xbsa_SendData: INF - leaving (0)

 

Once the TCP socket accepts the sent data, the sbtwrite/sbtwrite2 function completes - returning the thread to Oracle per the 'leaving' entry above.  The 0.019 second elapse before the log entries below is how long it took for Oracle to gather the next 256 KB of data and make the next sbtwrite/sbtwrite2 function call.

 

16:44:28.294 [8095] <2> xbsa_SendData: INF - entering

16:44:28.295 [8095] <4> VxBSASendData: INF - entering SendData.

16:44:28.295 [8095] <4> dbc_put: INF - sending 262144 bytes

16:44:28.295 [8095] <2> writeToServer: bytesToSend = 262144 bytes

16:44:28.295 [8095] <2> writeToServer: start send -- try=1

 

By inspecting and totaling the elapse time between and within sbtwrite/sbtwrite2 function calls as represented by the 'leaving' and 'entering' entries, it can be determined whether the delays are while Oracle gathers the data or while the network/mediaserver transfers the data.

 

In some instances it has been observed that Oracle may take many seconds, or even minutes, to make the next sbtwrite/sbtwrite2 function call.  E.g.

 

10:25:10.504 [13097] <2> xbsa_SendData: INF - leaving (0)

...Here, Oracle takes only 0.325 seconds to gather data.

10:25:10.829 [13097] <2> xbsa_SendData: INF - entering

...snipped transfer onto the network...

10:25:10.835 [13097] <2> xbsa_SendData: INF - leaving (0)

...Here, Oracle takes over 18 minutes to gather data.

10:43:37.990 [13097] <2> xbsa_SendData: INF - entering

...snip...

10:46:01.855 [13097] <2> xbsa_SendData: INF - leaving (0)

...Here, Oracle takes only 0.002 seconds to gather data.

10:46:01.857 [13097] <2> xbsa_SendData: INF - entering

...snip...

10:46:01.861 [13097] <2> xbsa_SendData: INF - leaving (0)

...Here, Oracle takes over 21 minutes to gather data.

11:07:45.342 [13097] <2> xbsa_SendData: INF - entering

...snip...

11:09:35.347 [13097] <2> xbsa_SendData: INF - leaving (0)

...Here, Oracle takes only 0.002 seconds to gather data.

11:09:35.349 [13097] <2> xbsa_SendData: INF - entering

...snip...

11:09:35.353 [13097] <2> xbsa_SendData: INF - leaving (0)

...Here, Oracle takes over 6 minutes to gather data.

11:15:54.240 [13097] <2> xbsa_SendData: INF - entering

11:15:54.241 [13097] <4> VxBSASendData: INF - entering SendData.

 

A truss of the Oracle server process during the last delay above shows that all of the threads of the Oracle server process read, and apparently discard, large volumes of data for many minutes before filling the buffer to be provided to NetBackup.

 

...snip...

13097/6:  3644.2506  pread(282, "\0A2\0\0\0\tFF A\0\0\0\0".., 524288, 0x13FE82000) = 524288

13097/8:  3644.2550  pread(283, "\0A2\0\0\0\tFF A\0\0\0\0".., 524288, 0x13FE82000) = 524288

13097/4:  3644.2601  pread(280, "\0A2\0\0\0\tFF81\0\0\0\0".., 524288, 0x13FF02000) = 524288

13097/10: 3644.2608  pread(283, "\0A2\0\0\0\tFF81\0\0\0\0".., 524288, 0x13FF02000) = 524288

13097/6:  3644.2649  pread(281, "\0A2\0\0\0\tFF81\0\0\0\0".., 524288, 0x13FF02000) = 524288

13097/4:  3644.2670  pread(279, "\0A2\0\0\0\tFFC1\0\0\0\0".., 524288, 0x13FF82000) = 524288

13097/4:  3644.2745  pread(283, "\0A2\0\0\0\tFFC1\0\0\0\0".., 524288, 0x13FF82000) = 524288

13097/8:  3644.2779  pread(282, "\0A2\0\0\0\tFF81\0\0\0\0".., 524288, 0x13FF02000) = 524288

13097/6:  3644.2784  pread(280, "\0A2\0\0\0\tFFC1\0\0\0\0".., 524288, 0x13FF82000) = 524288

13097/10: 3644.2859  pread(282, "\0A2\0\0\0\tFFC1\0\0\0\0".., 524288, 0x13FF82000) = 524288

13097/8:  3644.2937  pread(281, "\0A2\0\0\0\tFFC1\0\0\0\0".., 524288, 0x13FF82000) = 524288

...snip...

13097/1:  3644.3041  pread(257, "15C2\0\0\0\0\001\0\0\0\0".., 16384, 16384) = 16384

13097/1:  3644.3043  pread(258, "15C2\0\0\0\0\001\0\0\0\0".., 16384, 16384) = 16384

13097/1:  3644.3045  pread(256, "15C2\0\0\0\0\00F\0\012\n".., 16384, 245760) = 16384

13097/1:  3644.3047  pread(256, "15C2\0\0\0\0\011\0\012\n".., 16384, 278528) = 16384

 

Until eventually data is found for backup and the next sbtwrite/sbtwrite2 function call is made to send the data from the client host to the mediaserver host.

 

13097/1:  3644.3060  getpid()                                        = 13097 [13089]

13097/1:  3644.3061  write(15, " 1 1 : 1 5 : 5 4 . 2 4 0".., 67)     = 67

13097/1:  3644.3062  getpid()                                        = 13097 [13089]

13097/1:  3644.3063  write(15, " 1 1 : 1 5 : 5 4 . 2 4 1".., 60)     = 60

13097/1:  3644.3107  send(22, "06A2\0\0\0 @D2A1 u91C5 @".., 262144, 0) = 262144

 

 

 

If the client host is the mediaserver, the backup will use shared memory, instead of a network socket to bptm, but the log entries are similar in this second backup.  Notice also that every so many buffers, dbclient will send an application keep-alive to the mediaserver so that it knows that the client and mediaserver are progressing.

 

18:35:55.498 [14809] <2> xbsa_SendData: INF - leaving (0)

...Here, 5.477 delay while Oracle gathers the next buffer.

18:36:00.975 [14809] <2> xbsa_SendData: INF - entering

18:36:00.975 [14809] <4> VxBSASendData: INF - entering SendData.

18:36:00.975 [14809] <4> dbc_put: INF - sending 262144 bytes

18:36:00.975 [14809] <2> writeToServer: bytesToSend = 262144 bytes

18:36:00.975 [14809] <2> CopyToShM: Copying 262144 bytes to ShM

...Here, transfer to bptm takes less than 0.001 seconds.

18:36:00.976 [14809] <4> dbc_put: INF - sent: 2359296 Bytes

18:36:00.976 [14809] <2> xbsa_SendData: INF - leaving (0)

...Here, 5.409 delay while Oracle gathers the next buffer.

18:36:06.385 [14809] <2> xbsa_SendData: INF - entering

18:36:06.385 [14809] <4> VxBSASendData: INF - entering SendData.

18:36:06.385 [14809] <4> dbc_put: INF - sending 262144 bytes

18:36:06.385 [14809] <2> writeToServer: bytesToSend = 262144 bytes

18:36:06.385 [14809] <2> CopyToShM: Copying 262144 bytes to ShM

...Here, transfer to bptm takes less than 0.001 seconds.

18:36:06.386 [14809] <4> dbc_put: INF - sent: 2621440 Bytes

...Here, dbclient sends keepalive to bpbrm before returning control to Oracle.

18:36:06.386 [14809] <2> writeToServer: bytesToSend = 1 bytes

18:36:06.386 [14809] <2> writeToServer: start send -- try=1

18:36:06.386 [14809] <2> writeToServer: bytesSent = 1 bytes

18:36:06.386 [14809] <4> dbc_put: INF - sending keepalive on NAME Socket

18:36:06.386 [14809] <2> xbsa_SendData: INF - leaving (0)

...Here, 6.670 delay while Oracle gathers the next buffer.

18:36:13.056 [14809] <2> xbsa_SendData: INF - entering

 

If Oracle pauses for too long between buffers, then the mediaserver will not receive a keep-alive and will assume a network read failure.  The bpbrm debug log will show the delay, followed by timer expiration, termination of bptm, and the reporting of status 13 for the job.

 

18:35:08.210 [19994] <2> bpbrm main: from client scorpion: change timeout to 10800

...3 hour delay waiting for keep-alive or backup meta data from client.

21:36:06.722 [19994] <16> bpbrm readline: socket read failed: errno = 62 - Timer expired

21:36:06.815 [19994] <2> bpbrm kill_child_process: start

21:36:06.815 [19994] <2> bpbrm wait_for_child: start

21:36:18.964 [19994] <2> bpbrm wait_for_child: child exit_status = 82 signal_status = 0

21:36:18.965 [19994] <2> bpbrm Exit: client backup EXIT STATUS 13: file read failed

 

The bptm debug log confirms that it was still waiting for data from the client at the time of termination.

 

18:35:13.639 [19995] <2> write_data: received first buffer (262144 bytes), begin writing data

18:35:13.643 [19995] <2> send_media_kbytes_written_establish_threshold: CINDEX 0, RB Kbytes for monitoring = 1000000

21:36:06.816 [19995] <2> Media_dispatch_signal: calling catch_signal for 1 (bptm.c:23390) delay 0 seconds

21:36:06.816 [19995] <2> Media_siginfo_print: 0: delay 0 signo SIGHUP:1 code 0 pid 19994

21:36:08.816 [19995] <2> set_job_details: Tfile (755705): LOG 1315974968 16 bptm 19995 media manager terminated by parent process

...snip...

23:36:15.874 [19995] <2> catch_signal: EXITING with status 82

 

The dbclient debug log shows that Oracle actually delayed 3 hours and 37 minutes before providing the last buffer.  Had this backup been using a network socket, it would have been detected when this buffer was written to the network.  Because it wasn't, the shared memory update occurred, but when the metadata for the backup was passed up the network socket to bpbrm, that socket was discovered to be broken due to bpbrm having exited 37 minutes prior.  CopyToShM is the name of the function attempting to write to the socket, this failure does not relate to problems with the shared memory.

 

18:36:13.057 [14809] <2> xbsa_SendData: INF - leaving (0)

...waiting for Oracle

22:14:05.450 [14809] <2> xbsa_SendData: INF - entering

22:14:05.450 [14809] <4> VxBSASendData: INF - entering SendData.

22:14:05.450 [14809] <4> dbc_put: INF - sending 262144 bytes

22:14:05.450 [14809] <2> writeToServer: bytesToSend = 262144 bytes

22:14:05.450 [14809] <2> CopyToShM: Copying 262144 bytes to ShM

22:14:05.468 [14809] <4> dbc_put: INF - sent: 3145728 Bytes

22:14:05.468 [14809] <2> xbsa_SendData: INF - leaving (0)

22:14:05.468 [14809] <2> sbtclose2: INF - entering

...snip...

22:14:05.481 [14809] <2> writeToServer: bytesToSend = 512 bytes

22:14:05.481 [14809] <2> CopyToShM: Copying 512 bytes to ShM

22:14:05.481 [14809] <16> CopyToShM: connection broken. Error #40

22:14:05.481 [14809] <16> writeToServer: ERR - CopyToShM() failed, no files backed up

22:14:05.481 [14809] <16> close_image: ERR - failed zeroing to 32K boundary of the image tape block

22:14:05.481 [14809] <16> VxBSAEndTxn: ERR - bsa_close() failed.

...snip...

22:14:05.481 [14809] <2> sbtclose2: INF - leaving

22:14:05.481 [14809] <2> sbterror: INF - entering

22:14:05.481 [14809] <2> sbterror: INF - Error=0: VxBSAEndTxn: Failed with error:

22:14:05.481 [14809] <2> sbterror: INF - leaving


Environment



Any version of NetBackup.
See Oracle technical support for affected versions.


Solution



The long delays between Oracle making the sbtwrite/sbtwrite2 function calls may be due to some of the Oracle database files being pre-allocated with a size that greatly exceeds the amount of data currently being stored.  In this example, some of the files were 30 GB in size and others were 5 GB in size, all of which contained less than 3 GB of data.

 

When RMAN performs a backup to TYPE DISK, if certain conditions are met, it may skip reading the unused portions of the files.  When RMAN performs a backup to TYPE SBT_TAPE it reads all of the files which may result in significant delays before it finds data that actually needs to be backed up.

 

This topic is summarized below and detailed by Oracle in the section titled 'Unused Block Compression Of Datafile Backups to Backup Sets' of  http://download.oracle.com/docs/cd/B19306_01/backup.102/b14194/rcmsynta009.htm#sthref141

 

When backing up datafiles into backup sets, RMAN does not back up the contents of data blocks that have never been allocated. (In previous releases, this behavior was referred to as NULL compression.)  RMAN also skips other datafile blocks that do not currently contain data, if all of the following conditions apply:

 

  • The COMPATIBLE initialization parameter is set to 10.2

  • There are currently no guaranteed restore points defined for the database

  • The datafile is locally managed

  • The datafile is being backed up to a backup set as part of a full backup or a level 0 incremental backup 
  • The backup set is being created on disk.

If the Oracle delays cannot be minimized, a workaround is available to avoid the status 13.  To allow the bpbrm process to wait long enough for the next keep-alive, increase the CLIENT_READ_TIMEOUT on the mediaserver to a value that is long enough to accommodate the Oracle delays.



Legacy ID



354522


Article URL http://www.symantec.com/docs/TECH130209


Terms of use for this information are found in Legal Notices