Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

Status 23 on Linux FS after upgrading to 7.5.0.4

Created: 13 Feb 2013 | 5 comments

All of our Linux file system backups have started becoming unresponsive after 1TB-2TB of data is backed up. At this point, the bpbkar process for these jobs would terminate, while the bpbrm process would remain. The amount of data backed up when this occurs isn't important, as it appears to be some sort of timeout regarding the checkpoint recovery process, which occurs after approximately 2-hours, regardless of backup activity from the file servers.

The backup job would stopped updating in NetBackup's GUI; however, opening the bpbkar log reveals the EXIT STATUS 23 error.

This doesn't just affect a single job. When one job from a policy becomes unresponsive, all jobs from the same policy are affected.

At this point, the backup jobs cannot be stopped or cancelled from either within the NetBackup GUI or command-line. The only way to remove these jobs is to terminate their bpbrm processes. When these processes are terminated, the following message appears in the logs of each affected job:

       could not write checkpoint processed message to COMM_SOCK.

These systems were upgraded to 7.5.0.4 from 7.1, and it wasn't until the upgrade that we started seeing this behavior.

The short-term solution has been to disable checkpoint recovery for all Linux policies. Is anyone else seeing this behavior?

Discussion Filed Under:

Comments 5 CommentsJump to latest comment

RamNagalla's picture

may be you would need to take this to symantec support,  to check if they have any bug.

rainvilles's picture

I believe there is a bug and I am in the process of contacting support. The purpose of my post is to see if anyone else is experiencing the same issue.

Stumpr2's picture

We select checkpoint every 60 minutes and have not run into this issue but our largest image is 0.9TB

VERITAS ain't it the truth?

rainvilles's picture

I don't believe the amount of data backed up is triggering it, but the amount of time the backup has ran. Depending on environmental conditions at the time of backup, we can do about 1TB-2TB in two-hours from a single file server. That 2TB of data contains hundreds of thousands of files. Our checkpoints were left at the default 15 minutes which, with our large linux file servers, may be a contributing factor in bpbrm hanging, causing bpbkar to timeout and terminate.

Something must have changed between 7.1 and 7.5.0.4 in the way checkpoints are calculated or the way the bpbrm process works.

Dyneshia's picture

As a test remove 'checkpoint restart' .

I have seen status 23's caused by an I/O delay from file system to the NetBackup client and causing a lack of activity which results in a time out (exit status 23)

Do you also use multiple streams ?