Duplication from a MSDP storage server to a PDDO storage server may fail silently and cause data loss when duplicated disk images contain 10 or more fragments

Article:TECH172357  |  Created: 2011-10-20  |  Updated: 2014-11-04  |  Article URL http://www.symantec.com/docs/TECH172357
Article Type
Technical Solution

Product(s)

Subject

Issue



Document History:
November 8, 2011 - The Workaround section of this TechNote was updated to include availability of a new EEB package for Appliances.
December 13, 2011 - The Workaround section of this TechNote was updated to include availability of new EEB hotfixes for NetBackup and Appliances.  See the Related Articles below.
December 20, 2011 - A resolution for this issue is now available.  See the Solution section below.

Introduction:
A potential for data loss has been found for backups that have been duplicated from a Media Server Deduplication Pool (MSDP) to a PureDisk Deduplication Option (PDDO) storage pool.  Data loss can occur on the PDDO storage pool (only) due to metadata being incorrectly imported into the PureDisk metabase.  The original image on the MSDP storage server is still restorable until it expires.

This issue can occur when the MSDP storage server is either a NetBackup media server or a NetBackup 5200/5220 Appliance, and affects both PureDisk storage pools as well as PureDisk Appliance (5000/5020) storage pools.

The issue only occurs when the image being duplicated has 10 or more fragments.  In certain cases, the number of fragments can be as high as 250 before the issue is encountered.  Please refer to the tables below for additional detail on the number of fragments and versions affected.
 


Error



Status codes 83 or 191 may be reported along with failures restoring or verifying images from a PDDO storage server.  Messages similar to the following may appear in the Job Details:

10/20/2011 17:06:33 - Info bptm (pid=25827) waited for empty buffer 0 times, delayed 0 times
10/20/2011 17:06:33 - end reading; read time: 0:00:01
10/20/2011 17:06:33 - Critical bptm (pid=25827) image open failed: error 2060018: file not found
10/20/2011 17:06:33 - Info bptm (pid=25827) EXITING with status 83 <----------
10/20/2011 17:06:48 - Error bpbrm (pid=25823) from client myhost: The following files/folders were not restored:
10/20/2011 17:06:48 - Error bpbrm (pid=25823) from client myhost: UTF - /var/largefile.tmp
10/20/2011 17:06:48 - Info myhost (pid=25827) StorageServer=PureDisk:mysserver; Report=PDDO Stats for (mysserver): read: 182019 KB, stream rate: 20.51 MB/sec, CR received: 182096 KB, dedup: 0.0%
10/20/2011 17:06:48 - Info tar (pid=25826) done. status: 5
10/20/2011 17:06:48 - Info tar (pid=25826) done. status: 83: media open error
10/20/2011 17:06:48 - Error bpbrm (pid=25823) client restore EXIT STATUS 83: media open error
10/20/2011 17:06:48 - restored from image myhost_1319120297; restore time: 0:00:29
10/20/2011 17:06:48 - Warning bprd (pid=25809) Restore must be resumed prior to first image expiration on Thu 03 Nov 2011 09:18:17 AM CDT
10/20/2011 17:06:49 - end Restore; elapsed time 0:00:31

10/20/2011 12:24:17 - Critical bpdm (pid=9237) image open failed: error 2060018: file not found
10/20/2011 12:24:18 - Error bpbrm (pid=9229) from client myhost: ERR - Unexpected EOF on archive file
10/20/2011 12:24:18 - Info myhost (pid=9237) StorageServer=PureDisk:mysserver; Report=PDDO Stats for (mysserver): read: 182022 KB, stream rate: 23.50 MB/sec, CR received: 182096 KB, dedup: 0.0%
10/20/2011 12:24:18 - Info tar (pid=9236) done. status: 3
10/20/2011 12:24:18 - Error bpbrm (pid=9229) ERR - Unexpected EOF reading image. The image information is not complete.
10/20/2011 12:24:18 - Error bpverify (pid=9218) from host myhost, ERR - Unexpected EOF reading image. The image information is not complete.
10/20/2011 12:24:18 - Error bpverify (pid=9218) Verify of policy 250_Frag_Policy, schedule Full (myhost_1319120297) failed, tar had an unexpected error.
10/20/2011 12:24:18 - Error bpverify (pid=9218) Status = no images were successfully processed.
10/20/2011 12:24:18 - end Verify; elapsed time 0:00:28
10/20/2011 12:24:18 - Info tar (pid=9236) done. status: 83: media open error
no images were successfully processed  (191)

 


Environment



What is Affected:

When using NetBackup PureDisk:

MSDP Source
Necessary Conditions For Data Loss (PureDisk)
7.0.1 + ET 2233961 EEB
>= 100 Fragments
7.1
>= ~250 Fragments
7.1.0.1 or 5200/5220 2.0
>= ~250 Fragments
7.1.0.1 EEB Rollup
>= 10 Fragments
7.1.0.2 or 5200/5220 2.0.1
>= 10 Fragments

 

When using NetBackup PureDisk 5000/5020 Appliances:

MSDP Source
Necessary Conditions For Data Loss (NBU 50x0 Appliance)
7.0.1 + ET 2233961 EEB
>= 100 Fragments
7.1
>= ~250 Fragments
7.1.0.1 or 5200/5220 2.0
>= ~250 Fragments
7.1.0.1 EEB Rollup
>= 10 Fragments if appliance version < 1.4, otherwise >= ~250 fragments
7.1.0.2 or 5200/5220 2.0.1
>= 10 Fragments if appliance version < 1.4, otherwise >= ~250 fragments

How to Determine if Affected:
To find out how if any images are (or may be) affected, execute the following command on the master server to check the number of fragments in each image:

# /usr/openv/netbackup/bin/admincmd/bpimagelist -l | grep IMAGE | awk '{print $6 ", Policy: " $7 ", Copies: " $21 ", Fragments: " $22/$21}'

Refer to the tables above.  If any images are returned with a number of fragments higher than list above, there is a potential for data loss if corrective action is not taken.


Cause



There are three potential causes for metadata being improperly imported into the PureDisk metabase during optimized duplication from MSDP to PDDO:

  1. MSDP does not properly format its PO (path object) queries for each fragment.  The query for the first fragment incorrectly returns all fragments that start with "1" (F1, F10, F11, F12...).  This query causes duplicate POs for each fragment >= 10 to be inserted into the PureDisk metabase.  These duplicate POs and the associated data are then removed by PureDisk data removal.
  2. MSDP inserts POs in batches of 1000 into the PureDisk metabase.  1000 POs translates to roughly 250 fragments (likely a few less than 250).  However, after inserting the first batch, the batch is not reset properly.  The result is that POs from the first batch are inserted along with POs in the 2nd batch, causing duplicate POs.  These duplicate POs and the associated data are then removed by PureDisk data removal.
  3. With NetBackup MSDP 7.0.1 where the Emergency Engineering Binary (EEB) from Etrack 2233961 is installed, a replication job is started for each fragment in the data set instead of 1 replication job for the whole data set.  Combined with the improper queries described in scenario 1 above, duplicate POs can be inserted into the PureDisk metabase.  These duplicate POs and the associated data are then removed by PureDisk data removal.

Solution



The formal resolution to this issue (Etrack 2585027) is included in the following release:

  • NetBackup MSDP 7.1 Maintenance Release 3 (7.1.0.3)

More information on NetBackup 7.1.0.3 is available in the Related Article linked below.

This issue was scheduled to be addressed in the following release:

  • NetBackup 5200/5220 Appliance 2.0.2

When this version is released, please visit the following link for download and readme information:
 http://www.symantec.com/business/support/index?page=landing&key=58991

Note that while applying a fix will correct future duplications from experiencing this issue, it will not correct duplications which have already occurred prior to applying the fix.

Workaround:
Disable the PDDO Data Removal and Metabase Garbage Collection policies on the PureDisk storage pool to prevent further data loss until a formal resolution can be applied.

To identify images that are vulnerable to being lost, but have not yet been lost (PDDO Data Removal has not run since the images were duplicated), run the following query on the PureDisk metabase:

SELECT dirname, basename, cdoref, hashref, count(*) FROM <ds_raw_x> WHERE modtype <> 'D' and type = 0 and statusid = 0 and basename like '%.img'  GROUP BY dirname, basename,cdoref,hashref HAVING count(*) > 1 order by basename;
 
...where <ds_raw_x> is replaced with the PDDO data selection (normally 3 for a PD appliance and 2 for PureDisk).
 
For images that have likely already been lost, the following command should be executed on the master server to determine what images contain 10 or more fragments. Only images residing on a PDDO storage server that were duplicated from an MSDP storage server are vulnerable to data loss.
 
# /usr/openv/netbackup/bin/admincmd/bpimagelist -l | grep IMAGE | awk '{if ($22/$21 >=10) print $6 ", Policy: " $7 ", Copies: " $21 ", Fragments: " $22/$21}'
 
Images that have been lost and cannot be verified by NetBackup should be expired to remove any remaining entries from the PureDisk metabase.

If duplicate POs are found in the metabase, please contact Symantec technical support for assistance with cleanup.

For NetBackup 7.1.0.2, an Emergency Engineering Binary (EEB) package is available containing a fix for this issue by following the link in the Related Articles section below.

For NetBackup 5200/5220 Appliances running version 2.0.1, an EEB package is available containing a fix for this issue by following the link in the Related Articles section below.

Note: No patches or EEBs are needed on the server or Appliance where the destination (PDDO) pool is located.  Corrective action is only required on the server or Appliance hosting the source (MSDP). 

Please subscribe to this document (see below) to receive notification of updates.

Best Practices:
Symantec strongly recommends the following best practices:
1. Always perform a full DR backup prior to making any changes to your environment.
2. Always make sure that your environment is running the latest version and patch level.
3. Perform periodic "test" restores.
4. Subscribe to technical articles.

How to Subscribe to Email Notification:
Directly to this Article:
Subscribe to this article by clicking on the Subscribe via email link on this page to receive notification when this article is updated with Release Information.
 
Software Alerts:
If you have not received information about this TechNote from the Symantec Email Notification Service as a Software Alert, you may subscribe via email and/or RSS using the links provided at the following pages:

NetBackup Enterprise Server: http://www.symantec.com/business/support/index?page=content&key=15143&channel=ALERTS

 


Supplemental Materials

SourceETrack
Value2585027
Description

spad sends duplicate POs when replicating images with 10 or more fragments




Article URL http://www.symantec.com/docs/TECH172357


Terms of use for this information are found in Legal Notices