Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

Oracle (RMAN) job fails, but NetBackup job is ok

Created: 22 Nov 2010 • Updated: 08 Mar 2011 | 9 comments
konrad's picture
This issue has been solved. See solution.

Hi All

 

I have a strange situation. The RMAN backup job ends with this error:

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup plus archivelog command at 11/21/2010 02:50:37
ORA-00604: error occurred at recursive SQL level 1
ORA-06502: PL/SQL: numeric or value error
ORA-19506: failed to create sequential file, name="<name>", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
   VxBSACreateObject: Failed with error:
   Server Status:  client process aborted 
 

 

but the NetBackup job has status of "0". This happens during the archive log backup in most cases.

The master, media and clients are on RHEL 5.2. The NetBackup is at 6.5.6. CLIENT_READ_TIMEOUT on a client is set to 5400. The backup job is lunched from a client via cron. The backup data are written to a BasicDisk.

In a dbclient log I can see this:

 

  02:32:06.773 [29038] <2> int_CloseImage: INF - Backup - closing <name>

02:47:12.014 [29038] <16> readCommFile: ERR - timed out after 900 seconds while reading from /usr/openv/netbackup/logs/user_ops/dbext/logs/29038.0.1286324889 

It is like the image cannot be closed and the timeout appears, but why after 900s whet CLIENT_READ_TIMEOUT is 5400?

Do you have any idea what is wrong? 

Comments 9 CommentsJump to latest comment

Marianne's picture

CLIENT_READ_TIMEOUT on the media server is used to determine the timeout.

The timeout error could be a 'red herring'. You will need to look at the entire dbclient log as well as the corresponding period in bprd log on the master server.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

konrad's picture

The backup job was created - I found it. I thought that the backup was not even initiated.

But in nbpem it looks like this

 11/21/2010 02:48:57.003 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [No context] [Warning] V-116-264 jobid=2242935 may have been lost because of communication problem with nbjm

 

11/21/2010 02:48:57.007 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [jobid=2242935 job_group_id=2242935 client=<client> type=4 server= task=ID:0xab6d120 policy=<policy>] [Info] V-116-14 CLIENT <client>  POLICY <policy>  SCHED <sch_name> EXIT STATUS 25 (cannot connect on socket)

(..)

11/21/2010 02:49:15.492 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [jobid=2242935 job_group_id=2242935 client=<client> type=4 server= task=ID:0xab6d120 policy=<policy>] [Error] V-116-85 backup of client <client> exited with status 25 (cannot connect on socket) 
 
I can see some errors in vnetd on a client, but I am not sure they are relevant.
 
 02:44:53.783 [8005] <2> process_service_socket_plus: vnetd.c.1557: vnet_read_service_file failed: 19 0x00000013

02:44:53.783 [8005] <2> process_service_socket_plus: vnetd.c.1548: service_name: inetd_bpcd 
 
 
In a dbclient log I can see the following:
 
 02:44:52.595 [5658] <4> CreateNewImage: INF - backing up File:</DGGFDF_full_12358_1_7762349491>

02:44:53.491 [5658] <2> vnet_async_connect: vnet_vnetd.c.4228: connect: async CONNECT FROM 10.1.26.21.6427 TO 10.2.190.31.13724 fd = 24

02:44:53.491 [5658] <2> logconnections: BPRD CONNECT FROM 10.1.26.21.6427 TO 10.2.190.31.13724

02:44:53.491 [5658] <4> serverResponse: initial client_read_timeout = <5400>

02:44:53.491 [5658] <4> readCommMessages: Entering readCommMessages

02:45:03.491 [5658] <4> serverResponse: read comm file:<02:44:53 Initiating backup>

02:45:03.492 [5658] <4> readCommMessages: Entering readCommMessages

02:50:28.499 [5658] <4> serverResponse: read comm file:<02:50:19 INF - Server status = 50>

02:50:28.499 [5658] <16> serverResponse: ERR - server exited with status 50: client process aborted

02:50:28.499 [5658] <16> CreateNewImage: ERR - serverResponse() failed 
 
 
Any ideas why is it so?
Marianne's picture
CreateNewImage: ERR - serverResponse() failed 

You need bprd on master to troubleshoot this. Look for incoming request from client's IP.

Check if master is able to resolve IP correctly to hostname that corresponds with name in Policy.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

konrad's picture

Hi

 

Thank you for your sugestions.

I have looked into bprd log on the master and the request was successfully proccessed.

I fond a problem in nbpem, look at this:

 

 11/21/2010 02:44:56.876 [Application] NB 51216 nbpem 116 PID:16059 TID:4096248720 File ID:116 [jobid=-1 job_group_id=-1 client=<client> type=2 server= task=ID:0xf36d9e6c policy=<policy>] [Info] V-116-82 received start user backup request for client=<client>, policy=<policy>, schedule=<schedule>, stream #=0, client type=Oracle (4), parent jobid=-1
11/21/2010 02:48:57.003 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [No context] [Warning] V-116-264 jobid=2242935 may have been lost because of communication problem with nbjm
11/21/2010 02:48:57.007 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [jobid=2242935 job_group_id=2242935 client=<client> type=4 server= task=ID:0xab6d120 policy=<policy>] [Info] V-116-14 CLIENT <client> POLICY <policy>  SCHED <schedule> EXIT STATUS 25 (cannot connect on socket)
11/21/2010 02:49:15.492 [Application] NB 51216 nbpem 116 PID:16059 TID:4100451216 File ID:116 [jobid=2242935 job_group_id=2242935 client=<client> type=4 server= task=ID:0xab6d120 policy=<policy>] [Error] V-116-85 backup of client <client> exited with status 25 (cannot connect on socket)
11/21/2010 02:50:26.756 [Application] NB 51216 nbpem 116 PID:16059 TID:4108856208 File ID:116 [No context] [Warning] V-116-226 received completion status 50 for jobid=2242935 but know nothing of the job, will finalize the job 

 

Do you have any idea what does it mean? The server is overloaded and cannot handle connections between nbpem and nbjm? The server statistics seems to be fine.

mpatt's picture

Konrad,

Did you ever this resolved. I have atleast 10 clients which started failing with similar error. My timeout settings on the media server are 

 

CLIENT_READ_TIMEOUT = 4800
CLIENT_CONNECT_TIMEOUT = 4800
SERVER_CONNECT_TIMEOUT = 900
 
Would be great if u can post the solution.
Yogesh9881's picture

I have atleast 10 clients whi --- is this database level backup or file level backup ?

make sure etc-host entry  in master / client and database backup scripts.

192.168.100.15   myserver01    myserver01.mydomain.com

If this post has helped you, please vote or mark as solution.

Before break-up, make sure you have a good backup.....  ;-)

Jim-90's picture

If server is running Linux - check to see it IPtables is not running.

If running make sure the data owner between source and backup target is set to the owner of the data and rman script (usaully oracle).  If its set to root then the backups will fail becasue root does not own the RMAN backup and transmission of backup data. 

Stumpr2's picture

First 3 things I check when a system level backup is good but the oracle fails:

1. name resolution for master/media/client forward AND reverse between all. I also place entries for each in each of the local hosts file.

2. the very first line of the bp.conf file must contain

SERVER = masterservername

3. timeouts as described above.

I make sure those 3 things are done befor retesting oracle backups.

VERITAS ain't it the truth?

konrad's picture

The ware a NBRB performace problems, the backup jobs initiated from the client ware not proccessed durig CLIENT_READ_TIMEOUT interval.

The solution was:

1. Increase the CLIENT_READ_TIMEOUT to 5400 to give more time

2. Do some TCP/IP tuning on master server (RHEL 5.2)

net.core.rmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216 
net.ipv4.tcp_wmem = 4096 65536 16777216 
net.ipv4.tcp_no_metrics_save = 1
net.core.netdev_max_backlog = 30000
 
3.  The errors seems to appear rarely. But finely we had to replace a Master Server with a new (powerfull) server box to give more performace

After that the problems ware gone.

SOLUTION