Tuning Windows 2003 and 2008 for Symantec NetBackup
Symantec NetBackup is an enterprise class data protection product with a huge portfolio of features, platform and application support. Being used in heterogeneous environments can however lead to that the out-of-box operating system and NetBackup settings are not sufficient. NetBackup supports many platforms to run the master, media servers, and clients on, including Windows 2003 and Windows 2008 (also R2 in version 7). Unix and Linux platforms seems to be able to almost always better cope with the I/O load, whereas the Windows platform is not equally suited.
Fortunately, there are many tuning parameters available in the NT kernel, and some are relevant to NetBackup. This article will cover a few of these parameters and settings that I have discovered over the years working with NetBackup. Microsoft decided, for some reason, not to use good default values for high I/O load, although with Windows 2008, a lot more parameters are auto-tuned for this type of load. This article covers Windows 2003 as well as Windows 2008, and differences in tuning will be pointed out.
Being a backup product, I/O paths are the primary concern, as we need to optimize how the data can be moved in the best way to minimize infrastructure strains and keep the backup windows to a minimum. Also, what not everyone considers is that we also have to provide efficient means of restoring all the data, thus having fast I/O from tape and disk is an important part as well.
Note: All numeric values used for Windows registry parameters are in decimal mode, and not the default hexadecimal.
For an application such as NetBackup a key factor for success is to design the I/O paths in such a way that the maximum throughput is made possible on the server’s backplane. Typically the data I/O enters through the network interfaces, goes via the CPU, and then sent on the tape drives or disks.
In regards to network interfaces, it is preferred to team multiple interfaces for ingoing traffic. This does require configuration of the network switch to allow IEEE802.3ad link aggregations. It is important to allow the switch to distribute incoming packets in order to fully utilize the bandwidth.
Normal host-based teaming usually only support failover and outbound traffic load balancing. For NetBackup, outbound traffic is seldom useful, unless vaulting between media servers or using a network based Disk Pool appliance such as PureDisk, NetApp or DataDomain.
In regards to HBA for SAN connectivity, the I/O for disk and tape should be split. Tape I/O is synchronous and can impact the disk I/O severely. Also, use several HBA ports in order to distribute traffic to the tape drives. E.g. a 4GBit HBA port can serve up to four LTO3 drives, but real world experience show that a maximum of two drives per port works better, due to I/O interrupt handling and other hardware and kernel constraints. Also, if possible rather use several single port HBA, and distribute over the available I/O slots in the server. This typically improves the balancing of the I/O on the backplane, CPU, and memory.
It is of outmost importance to configure any HBA with persistent bindings in order not to “confuse” the Windows kernel if a path disappears and then later becomes active again, or if the server is rebooted. In many cases the kernel will allocate a new internal path name, making the path NetBackup is using non-functional. The symptom seen in NetBackup is MISSING PATH in the Device Monitor. The various HBA vendors have their own tools for configuring the settings on the HBA, so refer to respective vendor’s documentation tools when configuring persistent bindings.
Windows 2003 R2 with all applicable updates is preferred. Additional software for SAN connectivity is required when disk and/or tape drives are SAN attached, and should be the latest recommended versions
from the respective vendors.
Common on Windows is anti-virus software, and in order to maintain some performance, these must be configured to exclude NetBackup processes and directories. For instance, see tech note 295599 (Symantec, 2008b) for further information on excluding directories and processes for McAfee.
On any Windows server there are many services started automatically. Some can safely be left started, but one service in particular should always be disabled; the Removable Storage service.
This service tends to interfere with NetBackup’s device management and should be disabled in the Services section of the Computer Management Console. Follow the instructions in tech note 245559 (Symantec, 2003).
As a consequence of disabling the Removable Storage service, the system may log events in the system log regarding DCOM errors. This errors are harmless and the workaround is presented in tech note 240378 (Symantec, 2008a).
Another consequence is when backing up the NetBackup servers, bpbkar process logs an error as the RSM is not running. This can be solved by excluding the <system_drive>:\WINDOWS\ system32\ntmsdata directory on those servers. Please see tech note 247001 (Symantec, 2004) for more information.
Always strive to use the latest supported combination of device drivers and firmware for network adapters and HBAs. Today, most drives come with default settings that are pretty much optimum, we can however further configure the OS kernel, in order to remove some overhead and issues.
Disable device driver verification
Windows 2003 and 2008 comes default with random testing of device drivers, and by disabling this we can gain better performance, as we really don't want the kernel to spend time on randomly testing drivers for debugging, which we know are working fine . This parameter is documented for Windows 2003 and 2008, but no conclusive evidence found yet for Windows 2008R2.
|HKLM\SYSTEM\CurrentControlSet\Control\SessionManager\Memory Management\DontVerify RandomDrivers||1|
Disable Test Unit Ready
When using tape libraries and tape drives on a SAN where the tape drives are shared among several media servers, it is highly recommended to disable Test Unit Ready (TUR) functionality for the tape device drivers. Follow the procedure documented by Microsoft (2009a). The impact is primarily where NetBackup is configured for Shared Storage Option (SSO) for tape drives, as any Windows based media server potentially will send SCSI commands to the drives to check whether they are ready. In SSO configurations, a tape drive may very well be in use by another host, and any SCSI commands sent from another server would interfere, and backups and restore operations will experience problems such as slow performance or even failures.
It is important to properly size the virtual memory swap file prior installing NetBackup. A general recommendation is to have a swap file at least two times the size of physical memory and it must be preset to that size, and not auto-extended.
The reason to this is when a swap file must be extended automatically, the I/O operation in memory will be denied and a failure reported in NetBackup (most likely a status 81 for the jobs). This in turn will effectively abort the backup job on the media server. This is a behavior of the Windows operating system, and can only be avoided by pre-sizing the swap file.
By default the Windows 2003 operating system is optimized for file services, and thus will prioritize the file system cache in memory. For Media servers sending data directly to tape, NAS device, or other OpenStorage devices it may be better to optimize the kernel for applications instead. Media servers having Disk Storage Units (DSU) of Basic or Enterprise type may be better off with the default setting though, in order to have a file cache.
Two registry variables are of interest in tuning file system cache;
The default for Size variable is 3 which will maximize throughput for both file sharing as well as network applications in general.
The LargeSystemCache variable should be set to 0 in order to minimize the file system cache and thus allow more memory for network applications. On servers with plenty of memory, say 8GB or more, the settings may very well be left unchanged.
Disabling Last accessed
The NTFS file system records the last accessed time for each file and directory, adding to the I/O operations required when accessing files. An access is defined to any type of operation, such as directory listing, reading or writing or otherwise updating the file or directories.
If the last access information is not required by company or audit policies, the NetBackup master server can benefit from disabling it. As the catalog database consists of many thousands if not millions of files, having the kernel to update each file access adds overhead.
The variable has to be added and use the DWORD type. Set the value to 1 in order to disable last access time stamping.
Disabling 8.3 file names
The NTFS file system keeps a short name for every file in order to maintain compatibility with older operating systems. However, this setting is not required for a NetBackup master server, and by disabling it, we decrease the number of necessary I/O operations per file creation. By disabling it, no 16-bit applications must run on the master server.
There are a number of TCP parameters that can be tuned in order to accommodate for typical NetBackup I/O. The I/O pattern for a Windows server is normally not a sustained data-transfer, but rather short bursts of I/O.
TCP keepalive time
There may be a delay in detecting the loss of a connection from a NetBackup master server to a media server. In certain situations, there can be a delay on a NetBackup master server before it detects that the connection to a media server has been aborted. For example, if a media server goes down while running a backup, there may be a delay on the master server before it detects that the media server is no longer available. While at first it may appear that there is a problem with the NetBackup master server, this delay is actually a result of a certain TCP/IP configuration parameter called KeepAliveTime that is set to 7,200,000 (two hours, in milliseconds) by default. Decrease the value to 900000 (15 minutes).
The effect of this delay is that NetBackup jobs running on that media server appear to be active for a period of time after the connection to the media server has gone down. In some cases this can result in an undesirable delay before the current backup job fails and is subjected to the normal NetBackup retry logic for execution on a different media server, if one is available.
Another scenario where it is important to use a low timeout is where a firewall is in the I/O path. Typically this is the case in secure networks or when taking backup of servers in a DMZ or otherwise untrusted network.
Firewalls typically drop the session if no traffic occurs for a set time. NetBackup does not respond very well to this, and the jobs will fail. This usually happens during incremental backups, as there could potentially take a very long time before the client sends data to the media server. Set KeepAliveTime to a value lower than the firewall's timeout, and this problem is solved.
TCPWindowSize and Window Scaling
In Windows 2003, the use of a larger TCPWindowSize for gigabit network interfaces should be set to the maximum value 65535.
Windows 2008 (and R2):this parameter is obsolete and disregarded by the kernel.
For Windows 2003, it may also be useful to allow TCP window scaling in order to allow larger than 64KB size. Tuning this may actually not be necessary, but the trial method will have to prove whether it improves the I/O throughput. Windows supports the RFC1323 option.
The TCPWindowSize variable can set up to a value of 1GB. Once the variables is set and system rebooted the TCP/IP stack will support large windows.
Windows 2008/2008R2:As TCPWindowsSize is deprecated in Windows 2008 (and R2), this also holds true for Tcp1323Opts.
On media servers with many concurrent connections such as high multiplexing and many concurrent sessions to disk at the same time, it may be useful to set the variable to a higher value than default. The default is calculated as 128 * CPUs^2. Maximum value is 65535 (DWORD).
Windows 2008/2008R2:this parameter is obsolete and disregarded by the kernel.
By default this variable is calculated on CPU^2. This may not be the best setting for servers with 8 or more CPUs. For most large servers it is better to use a value equal to 4 x CPU.
Windows 2008/2008R2:this parameter is obsolete and disregarded by the kernel.
The default number of ports per IP address is only 5000. For a large NetBackup domain it may possibly not be sufficient in order to allow large amount of concurrent connections between Master server, Media servers and clients. The variable is really only useful on Master and Media servers, unless the client is heavily loaded as well, such as in cases when it serves as a web or database server.
Windows 2003 support up to 65534 concurrent ports per IP address. The variable does not exist by default, and must be created manually. The first 1024 ports are reserved, thus it makes little sense to set to max value. If a host has more than 60000 concurrent connections, we probably have other problems such as CPU and disk bottlenecks, but a value of 60000 would at least leave us ample room.
In Windows 2008, including Windows 2008R2, the way of setting this has change and we use the netsh command to configure start port and the range. By default, the start port is 49152, and the end port is 65535. This leaves us with 16383 usable dynamic ports. If the NetBackup environment is very large, we may still have to tune the available range. This is done by entering following commands to allow 60000 connections;
netsh int ipv4 set dynamicport tcp start=10000 num=50000
netsh int ipv4 set dynamicport udp start=10000 num=50000
netsh int ipv6 set dynamicport tcp start=10000 num=50000
netsh int ipv6 set dynamicport udp start=10000 num=50000
The UDP ports are just set to have the same range, but NetBackup does not really use UDP ports.
By default, the Windows operating system does not optimize the kernel settings for many concurrent threads. When the OS is started the kernel allocates structures for the kernel worker threads which will carry out the actual work that the running processes require, such as device driver I/O, the kernel itself and other internal components.
NetBackup put a very high load on the master and media servers as many processes are started on the master and media servers for each active job. Typically, the master server is maxed out with the default kernel threads settings when reaching a domain of approximately 300 clients.
We could spread the backup window for the clients, but that may not always be possible due to other constraints. What we can do is to allocate the maximum possible kernel threads, so that the kernel can serve as many processes as possible at any time.
We are interested in three variables covering kernel threads;
The DefaultNumberofWorkerThreads control the number of threads allocated for each work queue in the kernel. Note: Allocating too many threads may use more system resources than what is optimal.
Delayed work threads are used for work which are not real-time or otherwise time-critical. Memory for these threads may be swapped out from CPU cache and memory while in queue.
Worker threads for time-critical processes have high priority and the memory pages must stay in CPU cache or memory.
All three variables use DWORD as type. The AdditionalDelayedWorkerThreads and AdditionalCriticalWorkerThreads variables should already exist, but the RpcXdr\Parameters\DefaultNumberofWorkerThreads path and variable will have to be created.
The AdditionalDelayedWorkerThreads and AdditionalCriticalWorkerThreads variables should be set to a value of 16, and DefaultNumberofWorkerThreads to 64.
On media servers with many CPUs it can be beneficial to the I/O throughput to control which CPU’s handle network I/O and which CPUs handle tape or disk I/O. By controlling this we can tell the OS kernel thread scheduler not to do unnecessary context switches, but let the various I/O threads sit on their respective CPU. Context switching and memory page faults are very expensive in high I/O load applications such as NetBackup.
The CPU affinity can be configured by using the Interrupt Filter Configuration Tool (intfiltr.exe) available in the Windows 2003 Resource Kit Tools.
NOTE: Use great care when using this tool!!! And be on the physical console. The tool allows selecting the various devices present in the system. Select a network device and add it to the interrupt filter. Note: It may be necessary to select the “Don’t Restart Device when Making Changes” prior adding it to the filter in order to avoid service interruption or a crashed system.
Once the device is present in the filter, the CPU masking can be set by clicking on the “Set Mask” button in the “Interrupt Affinity Mask box”.
NOTE: Some devices may not work with the affinity setting. A reboot may be necessary, and if the device still does not work after a reboot, removal of the filter is required, and no CPU affinity can be used for that device.
On Windows 2008R2, the kernel provides a better control of resources using the NUMA (non-uniform memory access) architecture. Applications which demand high performance can be written so that the threads are distributed to several cores or maintained on a CPU. In general, using the principle of locality generates less context switches on the CPUs.
In Windows 2008, the intfiltr.exe tool has been replaced by the IntPolicy tool (Microsoft, 2007).
Microsoft (2003) Performance Tuning Guidelines for Windows Server 2003. [Online]. Available from: http://download.microsoft.com/download/2/8/0/2800a... (Accessed: 22 July, 2010)
Microsoft (2007) Interrupt-Affinity Policy Tool. [Online]. Available from: http://www.microsoft.com/whdc/system/sysperf/IntPo... (Accessed: 2 August, 2010)
Microsoft (2009a) Microsoft (2009) Windows Server 2003 cannot perform backup jobs to tape devices on a storage area network. [Online]. Available from: http://support.microsoft.com/kb/842411 (Accessed: 22 July, 2010)
Microsoft (2009b) Performance Tuning Guidelines for Windows Server 2008 R2. [Online]. Available from: http://www.microsoft.com/whdc/system/sysperf/Perf_... (Accessed: July 21, 2010)
Symantec (2003) How to disable the Removable Storage Manager service to avoid conflict with VERITAS NetBackup. [Online]. Available from: http://seer.entsupport.symantec.com/docs/245559.htm
Symantec (2004) Problems report showing Removable Storage Management Win32 1058 error. [Online]. Available from: http://seer.entsupport.symantec.com/docs/247001.htm
Symantec (2008a) GENERAL ERROR: After disabling Removable Storage Management (RSM) services on Windows 2000 and 2003, the system event viewer log reports Evt ID: 10005. NtmsSvc DCOM errors. [Online]. Available from: http://seer.entsupport.symantec.com/docs/240378.htm
Symantec (2008b) 3RD PARTY: NetBackup Services are randomly shutting down on Windows servers. [Online]. Available from: http://seer.entsupport.symantec.com/docs/295599.htm
Symantec (2010a) Symantec NetBackup ™ Backup Planning and Performance Tuning Guide - UNIX, Windows, and Linux - Release 6.5. [Online]. Available from: ftp://exftpp.symantec.com/pub/support/products/Net... (Accessed: July 21, 2010)