Backup jobs do not get scheduled, existing backups hang on very busy systems when a large number of messages are written to the netbackup/db/error log files.
|Article:TECH199637|||||Created: 2012-11-12|||||Updated: 2013-03-11|||||Article URL http://www.symantec.com/docs/TECH199637|
At times, on very busy systems, the number of updates written to the netbackup/db/error/log_########## file may cause backup jobs to hang. Additionally, scheduled backup jobs will not kick off until hours after the backup window opens.
If there are numerous TRV (trivial) messages being sent to the master server, the media server bpbrm log will show that it takes a long time to satisfy these calls.
Example of a call that took 28 seconds to complete [from the bpbrm log]:
07:58:47.940  <2> logconnections: BPDBM CONNECT FROM 10.0.0.1.37519 TO 10.1.1.2.1556 fd = 5
07:58:47.945  <2> db_end: Need to collect reply
[28 seconds later]
07:59:05.188  <4> bpbrm main: from client any_client.domain.com: TRV - /var/spool/sockets/pwgr/client1234 is a socket special file. Skipping.
07:59:05.189  <2> vnet_pbxConnect: pbxConnectEx Succeeded
When these delays writting to the error log occur, the nbpem log at the default logging levels (DebugLevel=1 and DiagnosticLevel=6) will show various jobs being submitted to nbjm for processing:
10/01/12 18:01:28.390 [jobid=1234 job_group_id=1234 client= CLIENTA type=4 server= task=ID:0x2aaab853b698 CTX:0x2aaab853b698 policy=POLICY1] [BaseJob::run] jobid=1234 submitted to nbjm for processing
Review of the nbjm logs from the same time period will not show the jobid being reviewed for a long period of time, possibly hours:
10/01/12 19:54:49.017 [jobid=1234 job_group_id=1234 client=CLIENTA Changing job state from PJS_SUBMITTING (1) to PJS_SUBMITTED (3)(../RecoverableJob.cpp:1285)
For a raw count of TRV messages that are being written to netbackup/db/error, the following can be run to count the number messages that are written to a single or multiple log files:
# perl -ne 'print "$&\n" if m/(TRV|ERR|WRN|FTL) ‑/' /usr/openv/netbackup/logs/bpdbm/log.mmddyy | sort | uniq -c | sort -n
The output should show how many trivial, warning and error messages were written to the log (or logs) specified in the above command. Example of good output:
2 ERR -
3 WRN -
192 TRV -
Returning numbers in the hundreds of thousands or millions of files may indicate that the trivial (TRV) messages are the problem.
Alternately, nbjm also has to write data to the netbackup/db/error logs. An O_SYNC call is made with requires the lock of the netbackup/db/error/errordb.lock file. If a large number of calls are made to bpjobd, the requirement to wait for the file lock may cause similar delays.
This issue affects all versions of NetBackup 7.1.0.x and all versions of 7.5.0.x through 188.8.131.52.
This problem is caused by errordb.lock contention and O_SYNC performance when writing to the netbackup/db/error logs. Two different types of updates can cause this behavior:
- The master server being inundated by trivial (TRV) messages from the bpbrm process on the media server(s)
- A large number of updates being written by nbjm to the netbackup/db/error logs
The formal resolution for this issue (Etrack 2957929) is included in the following release:
- NetBackup 7.5 Maintenance Release 5 (184.108.40.206)
NetBackup 220.127.116.11 is now available - please access the Related Article linked below for download and README information.
If NetBackup cannot immediately be upgraded to a fixed version, there may be two steps to work around this issue:
Step 1: In 18.104.22.168 and later versions of NetBackup, the sending of trivial messages by bpbrm can be disabled by adding the following touch file on the media servers:
*** It may be necessary to apply the 22.214.171.124 binary from Etrack 2957751 to completely eliminate all trivial messages.
Step 2: If the problem persists after eliminating the TRV messages, a binary can be obtained from Symantec NetBackup Technical Support, referencing Etrack 2967780. The binary removes the requirement for O_SYNC during the updates to netbackup/db/error.
Numerous backups failing due to /usr/openv/netbackup/db/error/errordb.lock contention and O_SYNC performance
Article URL http://www.symantec.com/docs/TECH199637