Video Screencast Help

Backup failing in error code 13 (only for / file system)

Created: 11 Nov 2013 • Updated: 02 Dec 2013 | 25 comments
This issue has been solved. See solution.
The backup is failing with status code 13 only for the / file system. Other streams are completing successfully.
 
Client server : 
OS : Linux 2.6.18-348.1.1.el5 - 
NBU version : 7.5.0.5
 
 
Media server:
 
OS : Linux 2.6.18-194.3.1.el5
NBU version : 7.5.0.5
 
 
Master server:
OS :Linux 2.6.18-308.4.1.el5
NBU version : 7.5.0.5
 
Other inputs:
- The backup jobs are failing by leaving the bpbkar jobs into hung state
- In the media server CLIENT_READ_TIMEOUT and CLIENT_CONNECT_TIMEOUT are set to 9000 seconds
Operating Systems:

Comments 25 CommentsJump to latest comment

Nagalla's picture

check if df -k command is working fine or not and its return to command prompt in the client..

if not first ask the system admin to fix the df -k issue then try the backup.

Nick J's picture

The command is working fine and listing lot of filesystems.

Nick J's picture

I could see the following error in the job detailed status

 

11/8/2013 6:57:23 PM - Error bpbrm(pid=17054) socket read failed: errno = 62 - Timer expired    
11/8/2013 6:57:23 PM - Info bpbrm(pid=16069) sending message to media manager: STOP BACKUP xxxx_1383959537     
11/8/2013 6:57:24 PM - Info bpbrm(pid=16069) media manager for backup id xxxx_1383959537 exited with status 150: termination requested by administrator
11/8/2013 7:04:42 PM - end writing; write time: 01:00:01
file read failed(13)
Stumpr2's picture

sounds like a stale mount problem. reboot or try Joe's method

How to fix stale NFS mounts on linux without rebooting
http://joelinoff.com/blog/?p=356

 

See, NetBackup has no issues as it is backing up defined streams with no problems. But when root or " / " is used then the backup fails. All Netbackup is trying to do is traverse its directories and it cant do that due to a stale file or something wrong in the filesystems. This isn't a backup problem. It is a filesystem problem.

VERITAS ain't it the truth?

SOLUTION
Nagalla's picture

keep the client verbose to 5 and make sure bpbkar log directory is present in /usr/openv/netbackup/logs/bpbkar in clinet.

trigger the test backup only for / FS

and attach the log file to this post.

 

Mark_Solutions's picture

Also sounds like a keep alive setting has been breached - does this happen after a 1 or 2 hour process?

If 1 hour it  could be a firewall causing it, if 2 hours it could be the Operating System that needs tuning to stop it closeing out the connection after 2 hours

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Nick J's picture

I have attached the bpbkar log with high verbose level.

The backup will be active for one hour (without writing any data) and then it will fail in EC-13.

AttachmentSize
bpbkar log.zip 101.43 KB
Nagalla's picture

it looks like..it has lot of NFS file systems..

are you planning to take NFS mount points also...?

does the follow NFS is set even for the test backup of File system /  ?

Nick J's picture

As per the plan the NFS is not selected in NBU policy and in the test backup run.

Mark_Solutions's picture

Is this a full or incremental backup?

Sounds like a timeout (or keep alive) issue - why it should take that long to scan the system is another matter unless you have an exclusion set along the lines of *.pst which would force it to read the entire file system looking for pst files before it could get going on a backup

Worth checking the exclusions for wild cards as well as firewall / keep alive checks

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Nick J's picture

Thank you all for the quick replies.

It is a full backup and attached the latest bpbkar log. There are no exclusion list set for this client server.

Can you share the points in reference to latest log so that I can take it to the respective server admin teams.

 

AttachmentSize
111113_00001.zip 3 KB
Nagalla's picture

what about the du -sh /* command output.. does it compleating successfully ? how much time its taking...?

and also check ls -l / too..

if those commands did return normally the promt as df -k .. 

reboot would be the another option with FSCk check.. 

SOLUTION
Marianne's picture

11/8/2013 6:57:23 PM - Error bpbrm(pid=17054) socket read failed: errno = 62 - Timer expired    

What is before this entry?

Please show us the policy for this client:
/usr/openv/netbackup/bin/admincmd/bppllist ENG-UNX-FRI_Full-smb1 -U

as well as output of 'df -h' on the client.

Have you tried the bpbkar test on the client?
Logging level of 3 in client's bp.conf will be fine.

Please rename current bpbkar log on the the client, then run bpbkar test as follows:

/usr/openv/netbackup/bin/bpbkar -nocont -dt 0 -nofileinfo -nokeepalives / > /dev/null
 
You can monitor bpbkar log in another window with :
tail -f <log-name>
 
Let us know if it gets stuck and at which point.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

Nick J's picture
The 'du -sh /*' and 'ls -l /'commands are not displaying any information since 2.5 hours.
 
Policy list - Attached with this post.
 
Output of 'df -h' - Attached with this post.
 
bpbkar test could not see more progress attached the latest bpbkar log for reference (remains the same for 2.5 hours)
 
AttachmentSize
bpbkar-111213_00001.zip 3.75 KB
bppllist.txt 3.98 KB
df -h.txt 10.53 KB
Marianne's picture

The 'du -sh /*' and 'ls -l /'commands are not displaying any information since 2.5 hours.

This is proof that the problem is at OS level, not NBU.
NBU needs similar information in order to backup / filesystem.

Have you logged a call with OS support team?

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

SOLUTION
Nick J's picture

Thank you for the quick turnaround. I am working along with the OS admin team.

Keep you posted.

Nicolai's picture

I agree with Mariann's observation.

The issue is a stale NFS mount (unix admin will know what this is). Netbackup do a "taste" of every mount point to see what type it is. However a stale NFS mount never reply to the query and Netbackup then hangs. Quite likely you can't kill the "du -sh" not even with signal 9.

Best Regards

Nicolai

 

Assumption is the mother of all mess ups.

If this post answered your'e qustion -  Please mark as a soloution.

Stumpr2's picture

LOL I don't care if Marianne gets marked as a solution since she posts so many times and does not get recognition, but I idetified the problem and the solution in a little over an hour after Nick made the initial post. I think it just got lost in all the requests for logs and outputs from commands :-)

VERITAS ain't it the truth?

Marianne's picture

LOL!

Maybe a split solution between Bob and Nagalla?
Nagalla was asking for 'ls' and 'du' output that was proof that NBU is not to blame...

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

Nicolai's picture

Getting off-topic friends ...

I removed the recommendation when I re-read the thread and saw Nagalla was on the track from the beginning.

Assumption is the mother of all mess ups.

If this post answered your'e qustion -  Please mark as a soloution.

Nick J's picture
Hi All,
 
The OS Admin team did a housekeeping to get free space in the /var/log filesystem but still backup if failing in EC-13
 
  • Could not be able to run the commands ls -l/* and du -sh /*
  • bpbkar jobs are going to hung state in the client machine
  • File descriptors status:

# cat /proc/sys/fs/file-nr

12240   0       500000

No. of currently opened               No. of free allocated      Max. no. of

File descriptors                          File descriptors             File descriptors for the whole system

12240                                              0                                  500000

  • lods of error recorded in /var/logs/messages file as shown below:
 Nov 13 03:24:18 smb1 automount[3820]: tree_make_mnt_tree:668: setmntent: Too many open files
 Nov 13 03:24:24 smb1 automount[3820]: tree_make_mnt_tree:668: setmntent: Too many open files
 Nov 13 03:25:22 smb1 automount[3820]: get_mnt_list:226: setmntent: Too many open files
 Nov 13 03:25:22 smb1 last message repeated 3 times
 
I agree the issue is not with NBU but I need evidences to prove the OS related issue. Please let me know if any other log files required.
 
Attached the Job Detailed Status, bpbkar, bpcd and /var/log/messages files for further analysis.
AttachmentSize
Job Detailed Status - 13Nov2013.txt 4.05 KB
bpbkar log - 111313_00001.zip 912.16 KB
bpcd log - 111313_00001.zip 24.46 KB
var log - messages file.zip 47.41 KB
Marianne's picture

Not sure how much more evidence you need?

  • Could not be able to run the commands ls -l/* and du -sh /*

tree_make_mnt_tree:668: setmntent: Too many open files

 

Surely OS admins understand that the above has nothing to do with NBU and that these issues will cause other applications to fail as well?

 

Just to prove that any kind of backup will fail - ask the admin to backup / to /dev/null using OS backup tool such as dump, cpio or tar.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

Stumpr2's picture

NetBackup as a diagnostic tool.
If there is an OS problem, a network problem, a file system problem more times than not the first indication is a failed backup. Its not the backup applications fault. NetBackup is just the messenger.

You want proof that NetBackup is not the culprit?
Then uninstall NetBackup and have the admin run as root:

ls -alr /

Ok, I better stop now.

VERITAS ain't it the truth?

Nick J's picture

With the information posted in this forum I have strongly advised my SysAdmin team to work on this issue.

Thanks all for the time and help.