Status 23 on Linux FS after upgrading to 126.96.36.199
All of our Linux file system backups have started becoming unresponsive after 1TB-2TB of data is backed up. At this point, the bpbkar process for these jobs would terminate, while the bpbrm process would remain. The amount of data backed up when this occurs isn't important, as it appears to be some sort of timeout regarding the checkpoint recovery process, which occurs after approximately 2-hours, regardless of backup activity from the file servers.
The backup job would stopped updating in NetBackup's GUI; however, opening the bpbkar log reveals the EXIT STATUS 23 error.
This doesn't just affect a single job. When one job from a policy becomes unresponsive, all jobs from the same policy are affected.
At this point, the backup jobs cannot be stopped or cancelled from either within the NetBackup GUI or command-line. The only way to remove these jobs is to terminate their bpbrm processes. When these processes are terminated, the following message appears in the logs of each affected job:
could not write checkpoint processed message to COMM_SOCK.
These systems were upgraded to 188.8.131.52 from 7.1, and it wasn't until the upgrade that we started seeing this behavior.
The short-term solution has been to disable checkpoint recovery for all Linux policies. Is anyone else seeing this behavior?