Video Screencast Help

Oracle DB restore fails where NB attempts to access non-existent media

Created: 18 Oct 2012 | 4 comments

I am trying to perform a restore of an Oracle database for a DR test and have run into a problem where NetBackup is attempting to access backups that are not accessible from the DR site.

Overview: My production Oracle database server is actually a pair of Solaris servers running Symantec HA with one node active and one node in standby.  My backups are scheduled through NetBackup which calls an Oracle RMAN script to backup the databases.  Full database backups are generated every night and archive log backups are run at noon and 6:00 PM.  Backups go to a pair of DataDomain devices, one in my primary datacenter and one at my disaster recovery site.  At my disaster recovery site, I have a single Solaris server as a database server.  Finally, I have an Oracle RMAN catalog that runs on a Windows server.  This catalog is shut down twice a day and a Windows backup of just the RMAN directory is made, again to the DataDomain devices.

As noted, I am testing my disaster recovery.  We have stopped the replication between the local and remote DataDomain devices.  I connect to the Unix server at the DR site and am attempting to recover the databases from backup.  I have successfully recovered the control files and started the recovery of the database files.  Here is the RMAN command I'm using:

connect catalog rman10@rmandb
connect target backup_dba
startup nomount
run {
Allocate CHANNEL CH1 type 'SBT_TAPE';
Allocate CHANNEL CH2 type 'SBT_TAPE';
Allocate CHANNEL CH3 type 'SBT_TAPE';
Allocate CHANNEL CH4 type 'SBT_TAPE';
Send 'NB_ORA_CLIENT=st31bora01, NB_ORA_POLICY=ORA_PASPROD';
set until time "to_date('14-Oct-12 01:57:50','dd-Mon-yy hh24:mi:ss')";
alter database mount;
restore database;
recover database;
alter database open resetlogs;
Release CHANNEL CH1;
Release CHANNEL CH2;
Release CHANNEL CH3;
Release CHANNEL CH4;
}

 

The first 67 of 120 files restore cleanly from the backup copy sitting on the DataDomain device at the DR site.  These first successful file restores make reference to "MediaID=@aaaae" which is the backup copy on the remove DataDomain device. 

For some reason, the 68th of 120 files (Oracle file #84) attempts to recover from Media ID @aaaac which is the copy on the DD at our primary site.  Since we are simulating the destruction of our primary datacenter, MediaID @aaaac is not available. 

10/18/2012 12:03:27 PM - begin Restore
10/18/2012 12:03:30 PM - restoring image st31bora01_1350200802
10/18/2012 12:03:31 PM - requesting resource @aaaac
10/18/2012 12:03:31 PM - Error nbjm(pid=5684) NBU status: 2074, EMM status: Disk volume is down    
10/18/2012 12:03:31 PM - Error nbjm(pid=5684) NBU status: 2074, EMM status: Disk volume is down    
10/18/2012 12:03:35 PM - end Restore; elapsed time: 00:00:08
allocation failed(10)

 

Oracle's recovery manager then fails over to the pervious backup and begins to restore files from that.  Oracle file #84 and two others are restored from the next previous backup.   The problem is that after these three files are restored, NetBackup then begins to request recovery from physical tape!  Again, since we are simulating the destruction of our primary datacenter, there is no tape drive or physical tapes available.  Below are some of the details from the NetBackup Administration console.  Note that APA999 is the naming convention we use for physical tapes.

10/18/2012 12:23:56 PM - begin Restore
10/18/2012 12:23:59 PM - 1 images required
10/18/2012 12:23:59 PM - media APA371 required
10/18/2012 12:24:03 PM - restoring image st31bora01_1349596319
10/18/2012 12:24:06 PM - Info bpbrm(pid=7216) telling media manager to start restore on client     
10/18/2012 12:24:10 PM - Info bpbrm(pid=7496) listening for client connection         
10/18/2012 12:24:14 PM - requesting resource APA371
10/18/2012 12:24:14 PM - awaiting resource APA371 A pending request has been generated for this resource request.
     Operator action may be required. Pending Action: No action.,
     Media ID: APA371, Barcode: APA371, Density: hcart, Access Mode: Read,
     Action Drive Name: N/A, Action Media Server: N/A, Robot Number: N/A, Robot Type: NONE,
     Volume Group: 000_00000_TLD, Action Acs: N/A, Action Lsm: N/A

 

So my question is: How do I get NetBackup to smarten up and only use the backups it has access to?

Any help you can offer would be greatly appreciated.

Ken

Comments 4 CommentsJump to latest comment

Marianne's picture

NBU will always request restore from Primary copy.

The assumption is that the disk copy has expired and tape is the only unexpired copy that is left?

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

khemmerl's picture

OK, so in the event there is a catestrophic failure at the primary site, how to I get NetBackup to validate which backups are actually available?

 

Ken Hemmerling
Alberta Pensions Services Corporation
Database Administrator
5103 Windermere Blvd. SW
Edmonton, AB T6W 0S9

Marianne's picture

Tapes are meant for long-term retentions and for DR. For this reason, most users eject tapes on a daily basis and send to DR site or vaulting company for safekeeping.

You need to ensure that all media needed for restore is available at DR site.

Or be prepared to perform a 'restore until ...' to another date that is not on tape but on DD only.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

khemmerl's picture

Earlier this year we upgraded out NetBackup so that it is aware of the two copies on the DataDomain boxes.  (Previously Netbackup only saw the copy at the primary site and the replication to the DataDomain device at the DR site was done by DD without any knowledge on the part of NetBackup.)

One of the things that is different this year is that prior to restoring we have to go into the NetBackup Admin console, navigate to Catalog, search for the backups associated with a particular policy, select them all, right-click and choose "Set Primary Copy". 

Where I get confused is that the Catalog view displays a Backup ID like "st31bora01_1350281779" but nowhere in the output of my RMAN backup logs is anything with a similar name.  Here's an example of some lines from my RMAN backup log:

channel T1: starting full datafile backupset
channel T1: specifying datafile(s) in backupset
input datafile fno=00020 name=/u02/oradata/pasprod/extlrg_tables03.dbf
channel T1: starting piece 1 at 12-OCT-12
piece handle=PASPROD_df20_20121012_135436 tag=DB_HOT_DAILY_PASPROD_20121012 comment=API Version 2.0,MMS Version 5.0.0.0
channel T1: finished piece 1 at 12-OCT-12
channel T1: backup set complete, elapsed time: 00:12:22

I'm trying to trace backwards to find the problem with my restore.  I have the Oracle file name, the piece handle and the tag.  How can I find the associated Backup ID?

 

Ken Hemmerling
Alberta Pensions Services Corporation
Database Administrator
5103 Windermere Blvd. SW
Edmonton, AB T6W 0S9