Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

2008 R2 Windows client ignores exclusions for scheduled jobs

Created: 14 Feb 2013 • Updated: 09 May 2013 | 19 comments
This issue has been solved. See solution.

Master: RHEL 5.6, NetBackup 7.1.0.4

Client: Windows Server 2008 R2 SP1, NetBackup 7.1.0.4

Scheduled backup job ignores exclusion list and fails with 84 after backing up exactly 50,000,128 Kb, but restarting the same failed job honors the exclusion list and completes with 0.

Comments 19 CommentsJump to latest comment

RamNagalla's picture

hi,

please post the exclude list , and bpbkar log of client with Verbose 5.

and also detail status of the failed job?

error code 84 indicates the issue with the destination (tape or disk) or media server.

what is the stoarge device that you are using ?

TimWillingham's picture

bpbkar log attached

Exclude list:

$RECYCLE.BIN
*.ldf
*.mdf
*.ndf
C:\Program Files\VERITAS\NetBackup\bin\*.lock
C:\Program Files\VERITAS\NetBackup\bin\bprd.d\*.lock
C:\Program Files\VERITAS\NetBackup\bin\bpsched.d\*.lock
C:\Program Files\VERITAS\Volmgr\misc\*
E:\temp
Documents and Settings\All Users\Application Data\Microsoft\Network\Downloader\qmgr0.dat
Documents and Settings\All Users\Application Data\Microsoft\Network\Downloader\qmgr1.dat
Local Settings\Temp
Microsoft Operations Manager
Symantec AntiVirus\SAVRT
System Center Operations Manager 2007\Health Service State\Health Service Store
System32\Perflib_Perfdata*.dat
pagefile.sys
CLASS:gr_qpw_sql_nt
*Hist*.BAK

Detailed status:

02/13/2013 19:00:30 - Info nbjm (pid=1514) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=737948, request id:{ED769632-7641-11E2-B583-38CEB073C8D1})
02/13/2013 19:00:30 - requesting resource gaqpwfp01_PureDiskATL
02/13/2013 19:00:30 - requesting resource gaxgptb07xs.NBU_CLIENT.MAXJOBS.gaqpwpe04
02/13/2013 19:00:30 - requesting resource gaxgptb07xs.NBU_POLICY.MAXJOBS.gr_qpw_sql_nt
02/13/2013 19:00:30 - granted resource  gaxgptb07xs.NBU_CLIENT.MAXJOBS.gaqpwpe04
02/13/2013 19:00:30 - granted resource  gaxgptb07xs.NBU_POLICY.MAXJOBS.gr_qpw_sql_nt
02/13/2013 19:00:30 - granted resource  MediaID=@aaaad;DiskVolume=PureDiskVolume;DiskPool=PureDiskATL;Path=PureDiskVolume;StorageServer=gaxgppd01;MediaServer=gaqpwfp01
02/13/2013 19:00:30 - granted resource  gaqpwfp01_PureDiskATL
02/13/2013 19:00:40 - estimated 6226878 kbytes needed
02/13/2013 19:00:40 - Info nbjm (pid=1514) started backup job for client gaqpwpe04, policy gr_qpw_sql_nt, schedule frq-inc on storage unit gaqpwfp01_PureDiskATL
02/13/2013 19:00:41 - started process bpbrm (pid=932)
02/13/2013 19:00:45 - Info bpbrm (pid=932) gaqpwpe04 is the host to backup data from
02/13/2013 19:00:45 - Info bpbrm (pid=932) reading file list from client
02/13/2013 19:00:46 - connecting
02/13/2013 19:00:48 - Info bpbrm (pid=932) starting bpbkar32 on client
02/13/2013 19:00:48 - connected; connect time: 0:00:00
02/13/2013 19:00:51 - Info bpbkar32 (pid=6988) Backup started
02/13/2013 19:00:51 - Info bptm (pid=5600) start
02/13/2013 19:00:52 - Info bptm (pid=5600) using 262144 data buffer size
02/13/2013 19:00:52 - Info bptm (pid=5600) setting receive network buffer to 1049600 bytes
02/13/2013 19:00:52 - Info bptm (pid=5600) using 30 data buffers
02/13/2013 19:00:53 - Info bptm (pid=5600) start backup
02/13/2013 19:00:55 - Info bptm (pid=5600) backup child process is pid 6120.2460
02/13/2013 19:00:55 - Info bptm (pid=6120) start
02/13/2013 19:00:55 - begin writing
02/13/2013 20:21:19 - Error bptm (pid=5600) cannot write image to disk, media close failed with status 2060019
02/13/2013 20:21:20 - Critical bptm (pid=5600) sts_delete_image of image gaqpwpe04_1360803640_C1_HDR failed: error 2060018 file not found
02/13/2013 20:21:20 - Critical bptm (pid=5600) image delete failed: error 2060018: file not found
02/13/2013 20:21:24 - Info bptm (pid=5600) EXITING with status 84 <----------
02/13/2013 20:21:24 - Info gaqpwfp01 (pid=5600) StorageServer=PureDisk:gaxgppd01; Report=PDDO Stats for (gaxgppd01): scanned: 50000130 KB, stream rate: 11.69 MB/sec, CR sent: 0 KB, dedup: 100.0%
02/13/2013 20:21:37 - end writing; write time: 1:20:42
02/13/2013 20:31:37 - Info nbjm (pid=1514) starting backup job (jobid=737948) for client gaqpwpe04, policy gr_qpw_sql_nt, schedule frq-inc
02/13/2013 20:31:41 - Info bpbrm (pid=3144) gaqpwpe04 is the host to backup data from
02/13/2013 20:31:42 - Info bpbrm (pid=3144) reading file list from client
media write error  (84)
 

Target is PDDO storage pool (PD 663a + v5 bundle)
 

RamNagalla's picture

02/13/2013 20:21:19 - Error bptm (pid=5600) cannot write image to disk, media close failed with status 2060019
02/13/2013 20:21:20 - Critical bptm (pid=5600) sts_delete_image of image gaqpwpe04_1360803640_C1_HDR failed: error 2060018 file not found
02/13/2013 20:21:20 - Critical bptm (pid=5600) image delete failed: error 2060018: file not found
02/13/2013 20:21:24 - Info bptm (pid=5600) EXITING with status 84

failure is not because of exclude list, its related to some storage,

see the below related T/N for the error that you get, and check if that helps

http://www.symantec.com/business/support/index?pag...

if this does not help , pleaes provide bptm log from media server 

TimWillingham's picture

Our STUs are set to 5 concurrent jobs, but we have 24 STUs writing to the same pool.  Would the aggregate concurrent jobs need to be reduced?

And the maximum concurrent jobs on the storage pool is 90.

RamNagalla's picture

24*5, its cumulating 120 jobs,

try to reduce maX I/O streams on Pools and see how it helps.

try to reduce in batch wise, first reduce 10 streams, if that does not help go for anohter 10 and see  what is the magic number it can handle.

if that become too  low, then we need to fine tune it... 

Will Restore's picture

technote which Nagalla linked above says

Solution

Reduced the max concurrent jobs on the storage units (changed from 50 to 30).

Will Restore -- where there is a Will there is a way

RamNagalla's picture

goahead and redude it, and also check if below WR T/N relates to your issue.

TimWillingham's picture

Reduced it to 80.  We'll see if things improve.  This is a PD 663a storage pool.  Is there a command or log we can examine to determine if this is the problem?

Thanks!

Will Restore's picture

your bpbkar logs shows

OS Error: 10054 (An existing connection was forcibly closed by the remote host.

per this technote, Article URL http://www.symantec.com/docs/TECH137012

Cause

Found Chimney Offload and Autotune enabled on the Media server and client. These features are found to cause unexpected socket drops during network load for backups. Disable these features on the servers involved to resolve the problem.

Solution

- Run this command to check if Chimney Offload and/or Autotune are enabled
netsh int tcp show global

- Disable the features by running these commands
netsh int tcp set global autotuning=disabled
netsh int tcp set global chimney=disabled

Will Restore -- where there is a Will there is a way

TimWillingham's picture

It's Server 2003, so the command is slightly different.  2 NICs, but the "Southernco Domain" is the one of interest.

netsh int ip show offload

Offload Options for interface "Bently Device Network" with index: 10003:
TCP Large Send
TCP Chimney Offload.

Offload Options for interface "Southernco Domain" with index: 10004:

Mark_Solutions's picture

OK, A couple of things here ..

At the bottom of your exclude list you have:

CLASS:gr_qpw_sql_nt
*Hist*.BAK

The job log says it is running the job mentioned above and so the ONLY exclusion it will use is for the job is *Hist*.BAK - it will ignore all of the others unless you copy them into the specific Policy exclusion list as well.

02/13/2013 19:00:30 - requesting resource gaxgptb07xs.NBU_POLICY.MAXJOBS.gr_qpw_sql_nt

Next ... this could be a timeout issue causing the media failure. When the system is very busy it is not getting acknowledgment from the Media Servers so is considering the session are broken.

Try adding this file (it has no extentiosn and is case sensitive) to the Master and all Media Servers and open it up in notepad and add a value of 800 in it:

/usr/openv/netbackup/db/config/DPS_PROXYDEFAULTRECVTMO (unix)

<install path>\veritas\netbackup\db\config\DPS_PROXYDEFAULTRECVTMO (Windows)

This does need a NetBackup Service re-start to register it on each server but it can really help with this type of error

Hope this helps

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

SOLUTION
TimWillingham's picture

I have copied the appropriate exclusions to the sql policy.  We'll see how it goes tonight.

The master server already has the DPS_PROXYDEFAULTRECVTMO file with a value of 1800.  Should I reduce it or leave it as is?

The media server did not have the file, so I created it with a value of 800.

Thanks for your help!

Mark_Solutions's picture

Reduce it to 800 - should get a better result and more stable communications during busy periods

This is for all servers (Master and Media Servers)

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

TimWillingham's picture

Your solution did resolve the problem I was having with excludes.

I am still having the timeout problem with the new file and restarting NBU on the master and media servers.

Thanks for your help.

Mark_Solutions's picture

Thats good news on the excludes! Glad that one is sorted!

The only other things, assuming the systems are actually healthy, is any other sort of timeouts (keep alive and port usage)

So you can look at the keep alive on the Master, Media and Client and port usage on the client. Here are some suggestions:

Windows Client:

Port Usage and Keep Alive - need to be created and a reboot to have them work.
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\
DWORD – TcpTimedWaitDelay  - Decimal Value of 30
DWORD – MaxUserPort – Decimal Value 65534
DWORD - KeepAliveTime – Decimal value of 510000
DWORD – KeepAliveInterval – Decimal Value of 3

Unix:

check the settings first:

# cat /proc/sys/net/ipv4/tcp_keepalive_time
  7200
# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
# cat /proc/sys/net/ipv4/tcp_keepalive_probes
9

reccomended settings:

# echo 510 > /proc/sys/net/ipv4/tcp_keepalive_time
# echo 3 > /proc/sys/net/ipv4/tcp_keepalive_intvl
echo 3 > /proc/sys/net/ipv4/tcp_keepalive_probes

The changes will be rendered persistent with an addition such as the following to /etc/sysctl.conf

## Keepalive at 8.5 minutes

# start probing for heartbeat after 8.5 idle minutes (default 7200 sec)
net.ipv4.tcp_keepalive_time=510

# close connection after 4 unanswered probes (default 9)
net.ipv4.tcp_keepalive_probes=3 

# wait 45 seconds for reponse to each probe (default 75
net.ipv4.tcp_keepalive_intvl=3

Then run : chkconfig boot.sysctl on

These don’t need a restart to take effect.

See if these help

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

TimWillingham's picture

One thing I forgot to mention is that we have a single file over 50GB that we can't get backed up on the client.  I can manually copy the file to the media server and back it up from there with no problem.

Mark_Solutions's picture

That could well tie in with the keep alive setting if that is the case, depending on how long it takes to copy it across

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.