Video Screencast Help

error 21 / 25 by a RMAN Online Backup

Created: 02 May 2014 • Updated: 14 May 2014 | 14 comments
This issue has been solved. See solution.

Hi.

I have a problem with some RMAN Online Backups.

In the Netbackup Activity Monitor i get some errors "socket open failed  (21)" and  "cannot connect on socket  (25)", but i can restore the database!

What is the reason? The network between my backup servers and the Oracle clients runs...and runs....and runs.

Attached a pic from my Activity Monitor.

 

Operating Systems:

Comments 14 CommentsJump to latest comment

Marianne's picture

We see individual streams that are failing.

Please show us all text in Details tab of these jobs?

Also check the 'file name' in the Overview tab and make a note of it.

Next, run bplist for this client:

bplist -C <client-name> -t 4 -R /

Do you see the 'file name' seen above?

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

carlos_jimenez's picture

Because their are only 2 out of what appear to be many streams that are failing it is very likely you are dealing with more of an environmental/configuration related issue.  We know the binaries work, as jobs complete, and that communication between master/media and client exist.  It could be related to network stability, a tcp stack issue or issues with ports on the client/servers.

Status 21 denotes that one of the Netbackup servers couldn't open a socket connection to the client.  This is usually because the connection is blocked via a firewall, being misrouted to an incorrect host due to host name issues, port exhaustion on the client, whatever would cause a socket connection to fail to be opened. 

Status 25 implies a socket connection failure.  A slightly different issue.  This could be related to a socket connection that was established that dropped as a result of network instability or tcp stack issues. 

As Marriane suggested we would need the detailed status of those jobs to have an idea of which direction to go but these types of issues, if persistent, can be tricky to troubleshoot.  We would need verbose level logging of bpbrm, bptm on the media server and dbclient on the client.

Dip's picture

Also the reason why you are able to restore DB is because RMAN backs up all db files assigned to that failed channel to another active channel. If you look at RMAN log file on Oracle Server, search for one of the file assigned to the failed channel, it will appear in another channel's file list. This means RMAN is redistributing all files which were not backed up; to another active Channel. This is "Channel Failover" feature and you will see message "channel ORA_SBT_TAPE_x: backup set failed, re-triable on other channel" in the RMAN log file.

Also for the Error 25, I have seen number of channels causes failures due to not having enough resources on Oracle Server. Find out how many Channels are configured (Ask DBA as it is specified in RMAN backup script). Normally, 4 Channels are more than enough on a Gigabit network connection, but for a small DB 2 Channels will give best performance. 

SOLUTION
Thomas Schulz 3's picture

I asked my DBA for the channel setup and it will take some hours/days for the response ....

Dip's picture

You can also check in Activity Monitor to see how many Jobs atarted at the same time when RMAN backup was started. If you see four RMAN jobs started at the start of the backup that means DBA is allocating 4 channels. 

Marianne's picture

Have you checked bplist output yet as per my previous post?

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

Thomas Schulz 3's picture

Hello.

I saw the error today ( error 25, and error 6 and error 13 ) by backuping my Oracle DB on the one server.

The File from the error 25 was

-> /DBKI_4252087926_20140506_58p7jilb_9384_3.bak

And i see the file by running the bplist, yes.

A restore will work, but this error messages .....

Thomas Schulz 3's picture

Now i have some error(13) :-(

This say's the logfile:

...

07.05.2014 11:43:13 - begin writing
07.05.2014 11:43:14 - Info bphdb (pid=11966) dbclient(pid=11966) wrote first buffer(size=262144)
07.05.2014 11:45:13 - Info bptm (pid=21496) waited for full buffer 2893 times, delayed 7483 times
07.05.2014 11:45:44 - Info bptm (pid=21496) EXITING with status 0 <----------
07.05.2014 13:45:13 - Error bpbrm (pid=20162) socket read failed: errno = 104 - Connection reset by peer
07.05.2014 13:45:13 - Info bphdb (pid=11966) done. status: 13: file read failed
07.05.2014 13:45:13 - end writing; write time: 2:02:00
file read failed  (13)

...

 

And attached a new pic.

 

error21_25-2.JPG
carlos_jimenez's picture

The file gets backed up and restores work because of your Channel failover settings.  Because retries work we know we can back up the file, and the failures are not reproducible with the same file every time.  As I mentioned earlier that implies an environmental issue, and in particular all the status codes are suggesting something along the lines of network or tcp related issues.  It might lie outside of that such as an Oracle resource issue.  Dip made an excellent suggestion to test that theory by reducing the number of channels if you are using more than 4.

Outside of that to get a better idea of what is happening I would suggest that you carefully (watching for disk space issues)  enable verbose logging on the client and media server.  In particular for bpbrm and bptm on the media server and dbclient, user_ops, vnetd, and bpcd on the client.  

Ultimately, it is recommended that you open a support case with Netbackup so that the logs can be reviewed.  

 

DOCUMENTATION: How to enable logging to troubleshoot NetBackup for Oracle RMAN

Article:TECH32031  |  Created: 2004-01-09  |  Updated: 2011-10-25  |  Article URL http://www.symantec.com/docs/TECH32031
Dimitrios's picture

the status 25 und 6 is a typical failure with you RMAN Backup Script.

@carlos is right, could be due to the settings of the channels, how many channel have you open? I hope I do not tell any bullshit, but when you use a Tape-Library with 2 Drive, cant open 3 or 4 channels?! frown

Check you backupRman log file for more information.
 

Thomas Schulz 3's picture

My DBA use 5 Channel and of the Oracle Server run a lot of Oracle DB's. For each Oracle DB a RMAN Script run's with 5 Channels.

I talked with my DBA and we'll try with 2 Channels, because this is only a test system.

And we backup all our data in a ADVD disk device.

Thomas Schulz 3's picture

We have reduced all the Channels to 2 ( from 5 ) and test it the next days.

 

 

Thomas Schulz 3's picture

Hi.

The errors are mostly gone, but today i get one "error 13".

what can that be?

09.05.2014 11:50:20 - Info bpbrm (pid=3096) ildd04-16 is the host to backup data from
09.05.2014 11:50:20 - Info bpbrm (pid=3096) reading file list from client
09.05.2014 11:50:21 - Info bpbrm (pid=3096) listening for client connection
09.05.2014 11:50:21 - Info bpbrm (pid=3096) INF - Client read timeout = 1800
09.05.2014 11:50:22 - Info bpbrm (pid=3096) accepted connection from client
09.05.2014 11:50:22 - Info bphdb (pid=24119) Backup started
09.05.2014 11:50:22 - Info bpbrm (pid=3096) bptm pid: 3244
09.05.2014 11:50:22 - Info bptm (pid=3244) start
09.05.2014 11:50:22 - Info bptm (pid=3244) using 2097152 data buffer size
09.05.2014 11:50:22 - Info bptm (pid=3244) using 32 data buffers
09.05.2014 11:50:23 - Info bptm (pid=3244) start backup
09.05.2014 11:50:24 - Info bptm (pid=3244) backup child process is pid 3327
09.05.2014 11:50:24 - Info bphdb (pid=24119) dbclient(pid=24119) wrote first buffer(size=262144)
09.05.2014 11:50:18 - Info nbjm (pid=5521) starting backup job (jobid=3090013) for client ildd04-16, policy RMAN_Integration, schedule Default-Application-Backup
09.05.2014 11:50:18 - Info nbjm (pid=5521) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=3090013, request id:{5432CB1E-D75F-11E3-B1F9-00212858BFD3})
09.05.2014 11:50:18 - requesting resource ADVD_Oracle_INT_CITI
09.05.2014 11:50:18 - requesting resource nbumaster.NBU_CLIENT.MAXJOBS.ildd04-16
09.05.2014 11:50:18 - requesting resource nbumaster.NBU_POLICY.MAXJOBS.RMAN_Integration
09.05.2014 11:50:18 - granted resource  nbumaster.NBU_CLIENT.MAXJOBS.ildd04-16
09.05.2014 11:50:18 - granted resource  nbumaster.NBU_POLICY.MAXJOBS.RMAN_Integration
09.05.2014 11:50:18 - granted resource  MediaID=@aaaaJ;DiskVolume=/ADVD02;DiskPool=CITI-ADVD-Pool;Path=/ADVD02;StorageServer=media04;MediaServer=media04
09.05.2014 11:50:18 - granted resource  ADVD_Oracle_INT_CITI
09.05.2014 11:50:19 - estimated 0 kbytes needed
09.05.2014 11:50:19 - Info nbjm (pid=5521) started backup (backupid=ildd04-16_1399629018) job for client ildd04-16, policy RMAN_Integration, schedule Default-Application-Backup on storage unit ADVD_Oracle_INT_CITI
09.05.2014 11:50:19 - started process bpbrm (pid=3096)
09.05.2014 11:50:20 - connecting
09.05.2014 11:50:22 - connected; connect time: 0:00:00
09.05.2014 11:50:24 - begin writing
09.05.2014 11:55:48 - Info bptm (pid=3244) waited for full buffer 10883 times, delayed 19491 times
09.05.2014 11:55:52 - Info bptm (pid=3244) EXITING with status 0 <----------
09.05.2014 13:55:47 - Error bpbrm (pid=3096) socket read failed: errno = 104 - Connection reset by peer
09.05.2014 13:55:47 - Info bphdb (pid=24119) done. status: 13: file read failed
09.05.2014 13:55:47 - end writing; write time: 2:05:23
file read failed  (13)
 

Thomas Schulz 3's picture

Thank you very much, the problem is solved.

I reduced the RMAN Channels from 5 to 2 and moved the backups to an ( most ) idle media server.