A potential for data loss has been discovered in NetBackup PureDisk 6.5 and later when the storage cache becomes incomplete due to an excessive amount of data segments on the system.

Article:TECH69065  |  Created: 2009-01-24  |  Updated: 2010-01-30  |  Article URL http://www.symantec.com/docs/TECH69065
Article Type
Technical Solution

Product(s)

Environment

Issue



A potential for data loss has been discovered in NetBackup PureDisk 6.5 and later when the storage cache becomes incomplete due to an excessive amount of data segments on the system.

Solution



Introduction:
The PureDisk Content Router (CR) uses an internal memory cache to hold data signatures for performance purposes.  In some instances, the system may have more signatures than the cache can hold.  This condition causes the cache to become incomplete and can lead to data loss in certain cases where the "incomplete flag" is not properly set.

Due to the requirements for this issue to occur, it has only recently been discovered, but is present in PureDisk Remote Office Edition 6.5 through 6.5.1.1.


What is Affected:
At the time of this writing, this problem is only known to occur with PureDisk the following versions:
- PureDisk Remote Office Edition 6.5
- PureDisk Remote Office Edition 6.5.0.1
- PureDisk Remote Office Edition 6.5.1
- PureDisk Remote Office Edition 6.5.1.1


How to Determine if Affected:
Data loss has been known to occur if BOTH of the following conditions have been met:
- One of the PureDisk versions mentioned above is being run
- Due to an excessive amount of data segments on the system, the max node signature count (as explained below) is exceeded.  

This has been observed on systems with the 8GB of minimum required memory, yet handle a volume of data that exceeds the available memory for data signatures.  Although far less likely and has not observed at the time of this writing, there is a possibility this issue could exist with Content Routers with more than 8GB of memory.


Verifying if the max node signature count is being exceeded:
1. When the PureDisk server is not busy with jobs, dump spoold cache information by identifying and sending an HUP signal to the process.
spasrvr:~ # ps -ef |grep spoold
root      7644     1 16 Mar24 ?        03:59:34 ./spoold

spasrvr:~ # kill -HUP 7644

2. Examine the log at the end of /Storage/log/spoold/spoold.log

March 25 16:42:03 INFO [47707600451712]: ===  Storage Cache Manager ===
March 25 16:42:03 INFO [47707600451712]: Cache load locking       : enabled, lock release at 200 nodes scanned
March 25 16:42:03 INFO [47707600451712]: Cache Garbage Collection : when free node count exceeds 500000 nodes (17.17MB)
March 25 16:42:03 INFO [47707600451712]: Lock Cache Pages         : Yes
March 25 16:42:03 INFO [47707600451712]: Cache Slab Size          : 1024 kb (36 bytes per node)
March 25 16:42:03 INFO [47707600451712]: Node replacement (random): max nodes: 174464740, allocs: 0, success: 0

In this example, the Max signatures the cache can hold is 174464740.
If allocs: is not ZERO, then there is a cache overflow, which can cause data loss.  Below is an example with the cache overflow that can cause data loss.

spoold.log.3:March 17 09:03:35 INFO [47602009149296]: Node replacement (random): max nodes: 116387496, allocs: 506776, success: 506776

March 25 16:42:03 INFO [47707600451712]: Node replacement tlogid  : 152736
March 25 16:42:03 INFO [47707600451712]: Hash Usage               : 4009241/4194304 (95.59%)
March 25 16:42:03 INFO [47707600451712]: Cache Usage              : 466.00 MB, 13091793 nodes, 480923 free, 0 hits, 1797873 miss
March 25 16:42:03 INFO [47707600451712]: Cache Completeness       : complete
March 25 16:42:03 INFO [47707600451712]: Cache MMU                : 477184 kb, 29127 nodes/slab, 1024 kb/slab
March 25 16:42:03 INFO [47707600451712]: Segment Object Usage     : 10473711 nodes, 0 hits, 1539993 miss
March 25 16:42:03 INFO [47707600451712]: Data Object Usage        : 7852518 nodes, 0 hits, 257880 miss

10473711 + 7852518 is the total actual number of signatures in the system at this point.  If this number is close to or larger than the max node value (174464740), the system is exposed to possible data loss and this should be addressed.  

3. Also examine previous spoold log files, as this information is dumped periodically.  Note that checking these values only once does not indicate that the problem will not exist, or that it hasn't existed in the past.  

Example: If a Content Router is configured with the 8GB of minimum required memory, the cache can hold approximately 116 million signatures which is often sufficient to store all signatures.  In the case that this amount is exceeded, data loss can occur.


Workaround:
Disabling compaction can prevent data loss from occurring:  
/opt/pdcr/bin/crcontrol --compactoff

However, this also prevents data from being expired and should only be considered as a temporary solution.  Also, this command will need to be repeated after every restart of the Content Router.  While disabling compaction will prevent data loss due to this issue, it is recommended to contact Symantec Technical Support if the above spoold logs appear to be close to exceeding the max signature count at any point.


Formal Resolution:
Symantec has acknowledged that the above mentioned issue (Etrack 1587284) is present in the current version(s) of the product(s) mentioned at the end of this article. The formal resolution to this issue is addressed in the following release:

- PureDisk Remote Office Edition 6.5.1.2 - Currently scheduled to be available the end of April, 2009.

Upon availability, 6.5.1.2 will be available in the Related Documents section below and at the following location:    http://www.symantec.com/business/support/downloads.jsp?pid=52672

It is recommended to apply 6.5.1.2 as soon as possible if potentially affected by this issue.  If PureDisk 6.5.1.2 is not yet available, or a workaround cannot be applied, please contact Symantec support, referencing this TechNote ID and Etrack 1587284 to obtain a fix for this issue.

The 6.5.1.2 release will contain a formal resolution to prevent this issue from occurring in the future and will contain a tool to detect if a Content Router is affected, or has been in the past.  If the Content Router is found to be affected, please contact Symantec Enterprise Technical Support for further instructions on how to proceed.

If further information becomes available about this issue, this document will be updated accordingly.  Subscribe to this document directly to be informed of any changes by clicking the following link:    http://maillist.support.veritas.com/notification.asp?doc=321531


Best Practices:
Symantec strongly recommends the following best practices:
1. Always perform a full backup prior to and after any changes to your environment.
2. Always make sure that your environment is running the latest version and patch level.
3. Perform periodic "test" restores.
4. Subscribe to technical articles.

How to Subscribe to Software Alerts
If you have not received this TechNote from the Symantec Email Notification Service as a Software Alert, please subscribe at the following link:  http://maillist.entsupport.symantec.com/subscribe.asp




Supplemental Materials

SourceETrack
Value1587284
DescriptionETrack (NetBackup) 1587284: NB_PUREDISK_REMOTE_OFFICE Data losses with 16TB multi-CR performance testing


Legacy ID



321531


Article URL http://www.symantec.com/docs/TECH69065


Terms of use for this information are found in Legal Notices