Video Screencast Help
Search Video Help Close Back
to help

Intermittent status 23 errors for MS SQL backup

Created: 30 Apr 2012 | Updated: 09 May 2012 | 11 comments
WayneLackey's picture
0 0 Votes
Login to vote
This issue has been solved. See solution.

I have a Windows client running MS SQL that is experiencing intermittent status 23 errors. The file system backup for this client is working without issue, the problem is only with the SQL backups. There are two databases that are being backed up, and some days both back up fine, other days one or both fail with the status 23 error. This environment is as follows:

Master: Windows Server 2003 R2 Enterprise Ed SP2 running NBU Server 7.0

Client: Windows Server 2003 R2 Enterprise Ed SP2 running NBU 7.0 client and MS SQL Server 2000.

What logs should I enable (dbclient? others?), and else what should I look at to resolve this problem?

Thanks!

Comments 11 CommentsJump to latest comment

Marianne van den Berg's picture

A couple of questions before we get to logs:

Is there a firewall between master and media server?
Does backup fail after certain elapsed time?

Logs needed:

On client: dbclient
On media server: bpbrm and bptm
All info in Details tab of failed job.

 

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

0
Login to vote
  • Actions
WayneLackey's picture

No firewall between media/master server. Not sure about the amount of time, last night's failure only took a few minutes to fail after it was granted access to a tape drive. Job details follow:

4/30/2012 6:01:01 PM - requesting resource escsrde47-LTO4-2Drives
4/30/2012 6:01:01 PM - requesting resource escsrde47.NBU_CLIENT.MAXJOBS.usolwqrsql01
4/30/2012 6:01:01 PM - requesting resource escsrde47.NBU_POLICY.MAXJOBS.DCS_SQL
4/30/2012 6:01:01 PM - awaiting resource escsrde47-LTO4-2Drives - Maximum job count has been reached for the storage unit
4/30/2012 6:18:46 PM - awaiting resource escsrde47-LTO4-2Drives - No drives are available
4/30/2012 6:19:19 PM - awaiting resource escsrde47-LTO4-2Drives - Maximum job count has been reached for the storage unit
4/30/2012 6:30:02 PM - awaiting resource escsrde47-LTO4-2Drives - No drives are available
4/30/2012 6:31:21 PM - awaiting resource escsrde47-LTO4-2Drives - Maximum job count has been reached for the storage unit
4/30/2012 6:35:07 PM - awaiting resource escsrde47-LTO4-2Drives - No drives are available
4/30/2012 6:35:26 PM - awaiting resource escsrde47-LTO4-2Drives - Maximum job count has been reached for the storage unit
4/30/2012 6:38:25 PM - awaiting resource escsrde47-LTO4-2Drives - No drives are available
4/30/2012 6:39:45 PM - awaiting resource escsrde47-LTO4-2Drives - Maximum job count has been reached for the storage unit
4/30/2012 6:44:23 PM - granted resource escsrde47.NBU_CLIENT.MAXJOBS.usolwqrsql01
4/30/2012 6:44:23 PM - granted resource escsrde47.NBU_POLICY.MAXJOBS.DCS_SQL
4/30/2012 6:44:23 PM - granted resource 0051L4
4/30/2012 6:44:23 PM - granted resource IBM.ULT3580-HH4.004
4/30/2012 6:44:23 PM - granted resource escsrde47-LTO4-2Drives
4/30/2012 6:44:23 PM - estimated 0 Kbytes needed
4/30/2012 6:44:23 PM - started process bpbrm (1852)
4/30/2012 6:44:29 PM - Error bpbrm(pid=1852) bpcd on usolwqrsql01 exited with status 23: socket read failed  
4/30/2012 6:44:29 PM - end writing
socket read failed(23)

I have also attached the corresponding dbclient log from last night. Will have to retrieve the other logs you mentioned.

Thanks!

AttachmentSize
dbclient043012.txt 15.75 KB
0
Login to vote
  • Actions
Omar Villa's picture

Because is intermitent your network socket gets disconnected because of a timeout, raise the client side timeout value to 7200 and run a test, this should cover a good piece, also if you have a firewall in the middle check what is the port open timeout, most of the times this times are short and the firewall fence any port that exceeds the watermark.

 

hope this helps.

regards.

Omar Villa

Netbackup Expert

Winners do not win the race, just love to run.

 

0
Login to vote
  • Actions
WayneLackey's picture

@Omar, thanks for your suggestion. In what file would I change this timeout value?

Thanks!

0
Login to vote
  • Actions
Omar Villa's picture

you will find it in this link, but instead of opening the master server client properties do it under the client.

http://www.symantec.com/business/support/index?page=content&id=HOWTO13869

 

 

Regards.

Omar Villa

Netbackup Expert

Winners do not win the race, just love to run.

 

0
Login to vote
  • Actions
Marianne van den Berg's picture

dbclient log shows +- 40 minutes between the 'connect' and 'failure' entry:

18:05:55.625 [3588.3100] <2> logconnections: BPRD CONNECT FROM 172.16.37.25.3792 TO 172.16.37.47.13724
18:49:55.676 [3588.3100] <16> serverResponse: ERR - server exited with status 23: socket read failed

I we look at at job details, the delay seems to be a result of no drives available.
Connection between media server and client seemed to have timeout by now:

4/30/2012 6:44:29 PM - Error bpbrm(pid=1852) bpcd on usolwqrsql01 exited with status 23: socket read failed 

Can you tell us if this 'intermittent' problem is usually seen after the long delay in start of job as can be seen here?

We can see that the failure is between media server's bpbrm process and client's bpcd. Please get us media server's bpbrm and client's bpcd log (create folder if it doesn't exist).

 

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

0
Login to vote
  • Actions
WayneLackey's picture

@Marianne, in this environment, it is typical for these backup jobs to wait up to an hour or more in some cases for an available tape drive, since the storage unit only has two available. I will see if I can pull up information for a SQL backup job on this client that was successful and see what kind of wait time it experienced.

The client didn't have a bpcd folder, so I created one and will check back for a log later this evening. I will also see about getting the bpbrm log from the master/media server.

0
Login to vote
  • Actions
WayneLackey's picture

@Marianne:

I looked back on the master server at the SQL jobs for this client. Of the ones that were successful, there was no more than 13 minutes between the time the job started and when it started writing. The remainder of the jobs experienced a longer wait (sometimes up to four hours) and failed, so there could be a correlation.

Also, there was no bpbrm folder on the master/media server, so I created the folder and will check back tomorrow morning for it.

0
Login to vote
  • Actions
WayneLackey's picture

I've attached the dbclient and bpcd logs from the client, and the bpbrm log from the master/media server. Job details for the failed job (one of the two dbs failed with status 23 last night) follow:

5/1/2012 6:30:55 PM - requesting resource escsrde47-LTO4-2Drives
5/1/2012 6:30:55 PM - requesting resource escsrde47.NBU_CLIENT.MAXJOBS.usolwqrsql01
5/1/2012 6:30:55 PM - requesting resource escsrde47.NBU_POLICY.MAXJOBS.DCS_SQL
5/1/2012 6:30:55 PM - awaiting resource escsrde47-LTO4-2Drives - Maximum job count has been reached for the storage unit
5/1/2012 6:36:28 PM - granted resource escsrde47.NBU_CLIENT.MAXJOBS.usolwqrsql01
5/1/2012 6:36:28 PM - granted resource escsrde47.NBU_POLICY.MAXJOBS.DCS_SQL
5/1/2012 6:36:28 PM - granted resource 0076L4
5/1/2012 6:36:28 PM - granted resource IBM.ULT3580-HH4.004
5/1/2012 6:36:28 PM - granted resource escsrde47-LTO4-2Drives
5/1/2012 6:36:28 PM - estimated 0 Kbytes needed
5/1/2012 6:36:29 PM - started process bpbrm (3484)
5/1/2012 6:36:34 PM - Error bpbrm(pid=3484) bpcd on usolwqrsql01 exited with status 23: socket read failed  
5/1/2012 6:36:34 PM - end writing
socket read failed(23)

What other information can I provide?

Thanks!

AttachmentSize
usolwqrsql01-bpcd-051012.txt 105.04 KB
usolwqrsql01-dbclient-051012.txt 15.74 KB
escsrde47-bpbrm-050112.txt 3.52 MB
0
Login to vote
  • Actions
Marianne van den Berg's picture

DNS hiccup?

The one connection request to client's bpcd is resolved 100% fine, and the very next one unable to resolve.

Extracts from client's bpcd log:

Successful connection for PID 3952:

18:35:51.865 [3952.1980] <2> logconnections: BPCD ACCEPT FROM 172.16.37.47.3353 TO 172.16.37.25.13724
18:35:51.865 [3952.1980] <2> process_requests: setup_sockopts complete
18:35:55.506 [3952.1980] <2> bpcd peer_hostname: Connection from host ESCSRDE47 (172.16.37.47) port 3353
18:35:55.506 [3952.1980] <2> bpcd valid_server: comparing 172.16.37.47 and ESCSRDE47
18:35:55.506 [3952.1980] <4> bpcd valid_server: hostname comparison succeeded

Compare with PID 3204 :

18:35:52.099 [3204.1480] <2> logconnections: BPCD ACCEPT FROM 172.16.37.47.3354 TO 172.16.37.25.13724
18:35:52.099 [3204.1480] <2> process_requests: setup_sockopts complete
18:35:56.990 [3204.1480] <8> bpcd peer_hostname: gethostbyaddr failed : The requested name is valid, but no data of the requested type was found.

 (0)
18:35:56.990 [3204.1480] <16> bpcd peer_hostname: gethostbyaddr failed to return peer host, herrno = 0
18:35:56.990 [3204.1480] <16> process_requests: Couldn't get peer hostname

You may want to take this up with network/DNS admins, but in the meantime you can ensure constant successful forward and reverse lookup by adding hosts entry on the client for the master/media server.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

SOLUTION
0
Login to vote
  • Actions
WayneLackey's picture

I added the master/media server's information to the local hosts file on the client, we'll see what happens tonight. Thanks for catching that.

Wayne

0
Login to vote
  • Actions