TECH ALERT: Data loss of disk fragments can occur when master/media communication is interrupted during optimized duplication.

Article:TECH69597  |  Created: 2009-01-12  |  Updated: 2009-01-04  |  Article URL http://www.symantec.com/docs/TECH69597
Article Type
Technical Solution

Product(s)

Environment

Problem



TECH ALERT: Data loss of disk fragments can occur when master/media communication is interrupted during optimized duplication.

Solution



Note: This TechNote was updated May 1, 2009 with additional information.

Introduction:
When using OpenStorage (OST) optimized duplication, in the rare event that there is a break in communication between the master and media server, original disk images can become corrupted when fragments in the process of being duplicated are removed.  This is not a defect of any specific vendor, but rather an issue involving the behavior of NetBackup's disk manager bpdm during optimized duplication.

What is Affected:
The following product versions are affected on all supported platforms where optimized duplication occurs:
  • NetBackup Enterprise Server 6.5, 6.5.1, 6.5.2, 6.5.3, and 6.5.3.1
How to Determine if Affected:
To determine if an OST plug-in has been installed, examine the output of tpconfig -dsh -all_hosts:

# /usr/openv/volmgr/bin/tpconfig -dsh -all_hosts
==============================================================================
Media Server:                   mymedia
Storage Server:                 myapp.example.veritas.com
User Id:                        ost-myapp
  Storage Server Type:        NearStore
  Storage Server Type:        SnapVault
  Storage Server Type:        <OSTtype>
==============================================================================

tpconfig -dsh -all_hosts can be run from any media server and can be found in the following paths:
UNIX: /usr/openv/volmgr/bin
Windows: <install_path>\VERITAS\Volmgr\bin

When a media server is performing an optimized duplication of a disk image and one of the following occurs:
  • The master server crashes
  • The master server is manually stopped
  • Communication between the master and media servers is interrupted (for example, due to a network issue)
The media server will continue duplication of the current fragment, but it will fail.  A socket error will be reported in the bpdm log upon completion of that fragment:
14:55:05.986 [14922] <16> logERROR: cannot log message to error database, error = cannot connect on socket
14:55:05.986 [14922] <8> replace_disk_frag: cannot update image database to add completed fragment, error = cannot connect on socket

bpdm then proceeds to delete the fragment it was about to copy when it lost communication with the master server:
14:55:05.996 [14922] <4> xx03: /usr/openv/lib/ost-plugins/libstspi<vendor>.so:stspi_delete_image STS_EOK pathname =xx03_nbutest/nbutest03_1236281108_C1_F3_R3:1236281108:nbutest03_testpolicy:4:1:: deleted

In the log above, fragment 3 of copy 1 was deleted after a failed duplication.

When communication is restored with the master server and the duplication is retried, the optimized duplication fails, because the previously deleted image fragment cannot be found.  In this example, the job proceeds normally until reaching the point where fragment 3 is expected but not found:
03/05/2009 15:05:09 - begin Duplicate
03/05/2009 15:05:14 - Info Duplicate (pid=12603) Initiating optimized duplication from @aaaab to @aaaad
03/05/2009 15:05:15 - started process bpdm (pid=15749)
03/05/2009 15:05:08 - requesting resource LCM_nbutest03_alpxx03-nbutest
03/05/2009 15:05:08 - granted resource LCM_nbutest03_alpxx03-nbutest
03/05/2009 15:05:08 - started process RUNCMD (pid=12603)
03/05/2009 15:05:10 - requesting resource nbutest03_alpxx03-nbutest
03/05/2009 15:05:11 - granted resource MediaID=@aaaad;DiskVolume=alpxx03_nbutest;DiskPool=alpxx03_nbutest;Path=alpxx03_nbutest;StorageServer=alpxx03;MediaServer=nbutest03
03/05/2009 15:05:11 - granted resource nbutest03_alpxx03-nbutest
03/05/2009 15:05:12 - requesting resource @aaaab
03/05/2009 15:05:13 - granted resource MediaID=@aaaab;DiskVolume=xx03_ghnbutest;DiskPool=xx03_nbutest;Path=xx03_ghnbutest;StorageServer=xx03;MediaServer=nbutest03
03/05/2009 15:05:21 - begin writing
03/05/2009 15:05:44 - end writing; write time: 0:00:23
03/05/2009 15:05:46 - begin writing
03/05/2009 15:05:57 - end writing; write time: 0:00:11
03/05/2009 15:05:58 - Critical bpdm (pid=15749) image open failed: error 2060013: no more entries
03/05/2009 15:05:59 - Error bpduplicate (pid=12603) host ghnbutest03 backup id nbutest03_1236281108 optimized duplication failed, media open error (83).
03/05/2009 15:06:00 - Error bpduplicate (pid=12603) Duplicate of backup id nbutest03_1236281108 failed, media open error (83).
03/05/2009 15:06:00 - Error bpduplicate (pid=12603) Status = no images were successfully processed.

Optimized duplication can occur as part of a Storage Lifecycle Policy (SLP), Vault policy or via a manual duplication, either initiated from the GUI or via the bpduplicate command.

Status 83 and 191 failures for Duplication jobs in the Activity Monitor may point to this issue.

In the case of SLP (storage lifecycle manager), the duplication job then proceeds with a normal duplication - which also fails when the missing fragment isn't found.  This results in the parent job failing, reporting status code 191 in the Activity Monitor:
03/05/2009 15:06:00 - Error bpduplicate (pid=12603) Continuing with normal duplication.
03/05/2009 15:06:03 - started process bptm (pid=15786)
03/05/2009 15:06:00 - requesting resource nbutest03_alpxx03-nbutest
03/05/2009 15:06:00 - requesting resource @aaaab
03/05/2009 15:06:00 - reserving resource @aaaab
03/05/2009 15:06:01 - resource @aaaab reserved
03/05/2009 15:06:01 - granted resource MediaID=@aaaad;DiskVolume=alpxx03_nbutest;DiskPool=alpxx03_nbutest;Path=alpxx03_nbutest;StorageServer=alpxx03;MediaServer=nbutest03
03/05/2009 15:06:01 - granted resource nbutest03_alpxx03-nbutest
03/05/2009 15:06:01 - granted resource MediaID=@aaaab;DiskVolume=xx03_nbutest;DiskPool=xx03_nbutest;Path=xx03_ghnbutest;StorageServer=xx03;MediaServer=nbutest03
03/05/2009 15:06:11 - started process bpdm (pid=15791)
03/05/2009 15:06:15 - begin reading
03/05/2009 15:29:35 - end reading; read time: 0:23:20
03/05/2009 15:29:36 - begin reading
03/05/2009 15:47:25 - end reading; read time: 0:17:49
03/05/2009 15:47:25 - Critical bpdm (pid=15791) image open failed: error 2060013: no more entries
03/05/2009 15:52:26 - Error bpduplicate (pid=12603) host nbutest03 backup id nbutest03_1236281108 read failed, media open error (83).
03/05/2009 15:52:26 - Error bpduplicate (pid=12603) host nbutest03 backupid nbutest03_1236281108 write failed, media manager killed by signal (82).
03/05/2009 15:52:26 - Error bpduplicate (pid=12603) Duplicate of backup id nbutest03_1236281108 failed, media manager killed by signal (82).
03/05/2009 15:52:26 - Error bpduplicate (pid=12603) Status = no images were successfully processed.
03/05/2009 15:52:27 - end Duplicate; elapsed time 0:47:18 no images were successfully processed (191)

nbstserv then marks the SLP status of the image as COMPLETE, as seen in the below nbstlutil list -U output.  Note that information about the missing fragment is included, even though the fragment has already been erroneously deleted:
Image:
master_server= nbutest01
backup_id= nbutest03_1236281108
client= nbutest03
backup_time= 1236281108(Thu Mar  5 14:25:08 2009)
policy= nbutest03_testpolicy
client_type= 0
schedule_type= 0
storage_service_name= standard_nbutest03-xx03_2week
storage_service_state= 3(COMPLETE)
time_in_process= 1236282472(Thu Mar  5 14:47:52 2009)
class_id=

Copy:
master_server= nbutest01
backup_id= nbutest03_1236281108
copy_number= 1
expire_time= 1237490708(Thu Mar 19 15:25:08 2009)
try_to_keep= 1237490708(Thu Mar 19 15:25:08 2009)
destination= nbutest03_xx03-nbutest
duplication status= 3(COMPLETE)
job_id= 0
expire_type= 0(FIXED)
mpxstate= 0

Fragment:
master_server= nbutest01
backup_id= nbutest03_1236281108
copy_number= 1
fragment_number= 1
resume_count= 2
media_id= @aaaab
media_server= nbutest03
storage_server=
stu_type= 0(DISK)
stu_subtype= 6(STSDYNAMIC)
fragment_state= 1(ACTIVE)
fragment_size= 3349151744
delete_header= 1
fragment_id= @aaaab

Fragment:
master_server= nbutest01
backup_id= nbutest03_1236281108
copy_number= 1
fragment_number= 2
resume_count= 2
media_id= @aaaab
media_server= nbutest03
storage_server=
stu_type= 0(DISK)
stu_subtype= 6(STSDYNAMIC)
fragment_state= 1(ACTIVE)
fragment_size= 3207331840
delete_header= 1
fragment_id= @aaaab

Fragment:
master_server= nbutest01
backup_id= nbutest03_1236281108
copy_number= 1
fragment_number= 3
resume_count= 3
media_id= @aaaab
media_server= nbutest03
storage_server=
stu_type= 0(DISK)
stu_subtype= 6(STSDYNAMIC)
fragment_state= 1(ACTIVE)
fragment_size= 1723006976
delete_header= 1
fragment_id= @aaaab

Copy:
master_server= nbutest01
backup_id= nbutest03_1236281108
copy_number= 2
expire_time= 1236286346(Thu Mar  5 15:52:26 2009)
try_to_keep= 0(Wed Dec 31 19:00:00 1969)
destination= nbutest03_alpxx03-nbutest
duplication status= 3(COMPLETE)
job_id= 7
expire_type= 0(FIXED)
mpxstate= 0

Fragment:
master_server= nbutest01
backup_id= nbutest03_1236281108
copy_number= 2
fragment_number= 1
resume_count= 4
media_id= @aaaad
media_server= nbutest03
storage_server=
stu_type= 0(DISK)
stu_subtype= 6(STSDYNAMIC)
fragment_state= 2(TO_BE_DELETED)
fragment_size= 3349151744
delete_header= 1
fragment_id= @aaaad

Fragment:
master_server= nbutest01
backup_id= nbutest03_1236281108
copy_number= 2
fragment_number= 1
resume_count= 5
media_id= @aaaad
media_server= nbutest03
storage_server=
stu_type= 0(DISK)
stu_subtype= 6(STSDYNAMIC)
fragment_state= 2(TO_BE_DELETED)
fragment_size= 3349151744
delete_header= 1
fragment_id= @aaaad

Fragment:
master_server= nbutest01
backup_id= nbutest03_1236281108
copy_number= 2
fragment_number= 1
resume_count= 6
media_id= @aaaad
media_server= nbutest03
storage_server=
stu_type= 0(DISK)
stu_subtype= 6(STSDYNAMIC)
fragment_state= 2(TO_BE_DELETED)
fragment_size= 0
delete_header= 1
fragment_id= @aaaad

Fragment:
master_server= nbutest01
backup_id= nbutest03_1236281108
copy_number= 2
fragment_number= 2
resume_count= 4
media_id= @aaaad
media_server= nbutest03
storage_server=
stu_type= 0(DISK)
stu_subtype= 6(STSDYNAMIC)
fragment_state= 2(TO_BE_DELETED)
fragment_size= 3207331840
delete_header= 1
fragment_id= @aaaad

Fragment:
master_server= nbutest01
backup_id= nbutest03_1236281108
copy_number= 2
fragment_number= 2
resume_count= 5
media_id= @aaaad
media_server= nbutest03
storage_server=
stu_type= 0(DISK)
stu_subtype= 6(STSDYNAMIC)
fragment_state= 2(TO_BE_DELETED)
fragment_size= 3207331840
delete_header= 1
fragment_id= @aaaad

Fragment:
master_server= nbutest01
backup_id= nbutest03_1236281108
copy_number= 2
fragment_number= 3
resume_count= 4
media_id= @aaaad
media_server= nbutest03
storage_server=
stu_type= 0(DISK)
stu_subtype= 6(STSDYNAMIC)
fragment_state= 2(TO_BE_DELETED)
fragment_size= 0
delete_header= 1
fragment_id= @aaaad

The end result is no duplicated image and a corrupted original, now missing a fragment.

Formal Resolution:
The formal resolution to this issue (Etrack 1539913) is scheduled to be included in the following release:
  • NetBackup 6.5 Release Update 4 (6.5.4), scheduled for release in the second quarter of 2009.
When NetBackup 6.5.4 is released, please visit the following link for download and readme information:
 http://www.symantec.com/business/support/overview.jsp?pid=15143

Workaround:
Cancel all running jobs prior to shutting down NetBackup.  Please note that it can take an extended amount of time before all jobs cease.
If this workaround is unfeasible, please contact Symantec technical support, referencing this document ID and Etrack 1640198 to obtain a fix for this issue.

Best Practices:
Symantec strongly recommends the following best practices:
1. Always perform a full backup prior to and after any changes to your environment.
2. Always make sure that your environment is running the latest version and patch level.
3. Perform periodic "test" restores.
4. Subscribe to technical articles.

How to Subscribe to Email Notification:

Article Subscription:
Subscribe to this TechNote for any updates that are made to this article, by clicking on the following link:
 http://maillist.support.veritas.com/notification.asp?doc=323058

Software Alerts:
If you have not received this from the Symantec Technical Support Email Notification Service, please click on the following link to subscribe to future Notifications:
 http://maillist.entsupport.symantec.com/subscribe.asp
 


Supplemental Materials

SourceETrack
Value1539913
DescriptionData loss of disk fragments when master/media communication is interrupted (i.e master server shutdown).

SourceError Code
Value191
Descriptionno images were successfully processed

SourceError Code
Value82
Descriptionmedia manager killed by signal

SourceError Code
Value83
Descriptionmedia open error


Legacy ID



323058


Article URL http://www.symantec.com/docs/TECH69597


Terms of use for this information are found in Legal Notices