Backup jobs do not get scheduled, existing backups hang on very busy systems when a large number of messages are written to the netbackup/db/error log files.

Article:TECH199637  |  Created: 2012-11-12  |  Updated: 2013-03-11  |  Article URL http://www.symantec.com/docs/TECH199637
Article Type
Technical Solution


Issue



At times, on very busy systems, the number of updates written to the netbackup/db/error/log_########## file may cause backup jobs to hang.  Additionally, scheduled backup jobs will not kick off until hours after the backup window opens. 


Error



If there are numerous TRV (trivial) messages being sent to the master server, the media server bpbrm log will show that it takes a long time to satisfy these calls.

Example of a call that took 28 seconds to complete [from the bpbrm log]:

07:58:47.940 [8691] <2> logconnections: BPDBM CONNECT FROM 10.0.0.1.37519 TO 10.1.1.2.1556 fd = 5
07:58:47.945 [8691] <2> db_end: Need to collect reply

[28 seconds later]
07:59:05.188 [8691] <4> bpbrm main: from client any_client.domain.com: TRV - /var/spool/sockets/pwgr/client1234 is a socket special file. Skipping.
07:59:05.189 [8691] <2> vnet_pbxConnect: pbxConnectEx Succeeded

When these delays writting to the error log occur, the nbpem log at the default logging levels (DebugLevel=1 and DiagnosticLevel=6) will show various jobs being submitted to nbjm for processing:

10/01/12 18:01:28.390 [jobid=1234 job_group_id=1234 client= CLIENTA type=4 server= task=ID:0x2aaab853b698 CTX:0x2aaab853b698 policy=POLICY1] [BaseJob::run] jobid=1234 submitted to nbjm for processing

Review of the nbjm logs from the same time period will not show the jobid being reviewed for a long period of time, possibly hours:
10/01/12 19:54:49.017  [jobid=1234 job_group_id=1234 client=CLIENTA Changing job state from PJS_SUBMITTING (1) to PJS_SUBMITTED (3)(../RecoverableJob.cpp:1285)

For a raw count of TRV messages that are being written to netbackup/db/error, the following can be run to count the number messages that are written to a single or multiple log files:

# perl -ne 'print "$&\n" if m/(TRV|ERR|WRN|FTL) ‑/' /usr/openv/netbackup/logs/bpdbm/log.mmddyy | sort | uniq -c | sort -n

The output should show how many trivial, warning and error messages were written to the log (or logs) specified in the above command.  Example of good output:

 2 ERR -
 3 WRN -
192 TRV -

Returning numbers in the hundreds of thousands or millions of files may indicate that the trivial (TRV) messages are the problem.

Alternately, nbjm also has to write data to the netbackup/db/error logs.  An O_SYNC call is made with requires the lock of the netbackup/db/error/errordb.lock file.  If a large number of calls are made to bpjobd, the requirement to wait for the file lock may cause similar delays.


Environment



This issue affects all versions of NetBackup 7.1.0.x and all versions of 7.5.0.x through 7.5.0.4.


Cause



This problem is caused by errordb.lock contention and O_SYNC performance when writing to the netbackup/db/error logs.  Two different types of updates can cause this behavior:

  • The master server being inundated by trivial (TRV) messages from the bpbrm process on the media server(s)
  • A large number of updates being written by nbjm to the netbackup/db/error logs

Solution



The formal resolution for this issue (Etrack 2957929) is included in the following release:

  •  NetBackup 7.5 Maintenance Release 5 (7.5.0.5)

NetBackup 7.5.0.5 is now available - please access the Related Article linked below for download and README information.

Workaround:
If NetBackup cannot immediately be upgraded to a fixed version, there may be two steps to work around this issue:

Step 1:  In 7.1.0.4 and later versions of NetBackup, the sending of trivial messages by bpbrm can be disabled by adding the following touch file on the media servers:

UNIX:  /usr/openv/netbackup/bin/BRM_IGNORE_TRV_MESSAGES

Windows:  <install_path>\VERITAS\NetBackup\bin\BRM_IGNORE_TRV_MESSAGES

*** It may be necessary to apply the 7.1.0.4 binary from Etrack 2957751 to completely eliminate all trivial messages. 

Step 2:  If the problem persists after eliminating the TRV messages, a binary can be obtained from Symantec NetBackup Technical Support, referencing Etrack 2967780.   The binary removes the requirement for O_SYNC during the updates to netbackup/db/error.


Supplemental Materials

SourceETrack
Value2957929
Description

Numerous backups failing due to /usr/openv/netbackup/db/error/errordb.lock contention and O_SYNC performance




Article URL http://www.symantec.com/docs/TECH199637


Terms of use for this information are found in Legal Notices