Video Screencast Help

bplabel -erase didn't actually leave assigned tapes alone. Ouch.

Created: 09 Oct 2012 • Updated: 10 Oct 2012 | 14 comments
RLeon's picture
This issue has been solved. See solution.

This happened. It is kind of a sad story with a moral lesson and a warning to all.

Netbackup 7.0.

So there was this media pool with some 50+ VTL tapes.
(Yes, I know, VTLs are a dying breed.)
Most of the tapes in the pool were assigned and full.
The rest were unassigned because all backup images inside them had expired.

I wanted to do a quick erase on each of the unassigned tapes in the pool.
This is a common practice for VTLs because only then would an unassigned tape actually has its data deleted inside the VTL.
Some VTLs do this automatically but our's isn't one that does.

For more information about erasing unassigned VTL tapes and why it is necessary:
  https://www-secure.symantec.com/connect/idea/optio...
  https://www-secure.symantec.com/connect/forums/can...
  https://www-secure.symantec.com/connect/forums/scr...

Anyhow, the Netbackup admin console GUI doesn't actually allow you to erase an assigned tape,
so I thought its command-line equivalent would behave the same.
If you do this to an assigned tape:
  bplabel.exe -erase -m vt0001 -d hcart3 -o -p VTL_pool -host nbumaster

It will say "Media is assigned, and will not be labeled".
And if you check the Activity Monitor, you will find a Media Erase job that says the following:
  Info bptm(pid=1076) INF - Waiting for mount of media id VT0001 on server nbumaster for reading.
  mounted; mount time: 00:00:07
  Info bptm(pid=1076) EXITING with status 0 <----------        
  Error bplabel(pid=3196) Media is assigned, and will not be labeled     
  end Erase; elapsed time: 00:00:14
  cannot overwrite media, data on it is protected(168)

So it won't let you erase assigned tapes by mistake.
Sounds reassuring.
If you still don't believe it, check the catalog, you will find that all backup images are still there on the tape.

So bplabel -erase does protect you from erasing assigned media.
Except when it doesn't.
What would actually happen - at least in my case - is that even though Netbackup would keep its word on not touching the tape,
the VTL would somehow still receive the erase command and delete all data on the tape. Magic.
You'll know when you try to restore some files from the tape and couldn't.
You'll get some kind of read timeout error.
The Netbackup catalog-data about the backup images on the tape becomes irrelevant,
because the data will not be on the tape anymore.

Now, when you are ready for this:
I scheduled and ran a daily-recurring script that does bplabel -erase against all tapes in the VTL media pool.
It should leave the assigned tapes alone right?
The script didn't get to run the second time when we figured out what happened.

In short: Netbackup kept its promise (apparently) while the VTL destroyed all.
I couldn't even import the tapes because our VTL apparently does a very good job in securely deleting the data associated with the tapes when it is told to erase them.

A stupid destructive script was clearly ran against the media pool,
and Netbackup had clearly skipped all assigned tapes and protected all catalog-data about the backup images on the tapes.
Our VTL vendor says that there is nothing they could do about it because the VTL was clearly told to erase the tapes.
Only that it shouldn't have been told to.
We haven't logged a support case with Symantec yet, because we're pretty sure that "the VTL shouldn't have done it!".
I know, we probably should. But not before we are sober.
 

Comments 14 CommentsJump to latest comment

Yasuhisa Ishikawa's picture

What would actually happen - at least in my case - is that even though Netbackup would keep its word on not touching the tape,
the VTL would somehow still receive the erase command and delete all data on the tape. Magic.

Well, how did you check if the VTL recieve the command?
I could not find any clue in messages you post. If NetBackup did not send erase command, who did it?

Have you already tried to restore after this test?

Authorized Symantec Consultant(ASC) Data Protection in Tokyo, Japan

RLeon's picture

Well, how did you check if the VTL recieve the command?

Have you already tried to restore after this test?

In my post:

You'll know when you try to restore some files from the tape and couldn't.
You'll get some kind of read timeout error.

It was actually the restore errors that clued us on checking the VTL.
In the VTL admin interface, we checked all the tapes that the bplabel -erase command ran against.
All of them have had their data deleted, and disk space reclaimed. Unrecoverable.
Meanwhile, Netbackup still happily tells me that all images on the assigned tapes are fine, despite that they were bplabel -erase'd.
Checked both the catalog and the media pool, and Netbackup did not lie.

Well, how did you check if the VTL recieve the command?

As said above, we actually checked from the VTL's own interface. All tapes in that pool, assigned or not, have had their data deleted and disk space reclaimed.

I could not find any clue in messages you post. If NetBackup did not send erase command, who did it?

I feel your pain....

To be fair, and also to be sure, we tested the whole scenario again, using dummy data.
The same thing happened.
Before bplabel -erase, we could restore the data.
After bplabel -erase, Netbackup would boldly claim that it didn't touch assigned tapes, while the VTL would show that the tape data has just been deleted.
Test a restore now, you could still see all the files in BAR because the catalog image db data about the images is untouched, but when you actually go ahead with the restore, you will get read timeout errors, because as we know, the tape is empty now, but Netbackup doesn't know it.
 

Yasuhisa Ishikawa's picture

Sorry, I could not make sense from your first post.

What you would to say is that NetBackup erase tapes althought it looks like failed when you run "bplabel -erase" with "-o" option. Right?

Now, I have tested with NetBackup 7.5 and mhvtl, but backups are not erased and can be restored.
I'm not sure why backups are erased in your environment, but if you have to keep data, you should not use "force" option like "-o".

Authorized Symantec Consultant(ASC) Data Protection in Tokyo, Japan

RLeon's picture

Sorry, I could not make sense from your first post.

No worries, I couldn't make sense of any of this at all either, especially now that you are telling me that it doesn't behave that way in 7.5 and mhvtl.

but if you have to keep date, you should not use "force" option like "-o".

The -o option was used because I run bplabel in a scheduled daily-recurring non-interactive script.
But then, even if I drop the -o and manually press 'yes' to all the promots, I bet the same thing would still happen.

As I said in the top post:

A stupid destructive script was clearly ran against the media pool

I admit to part of the responsibility of doing something risky to our backed up data. But let me add that it wasn't considered risky at the time because bplabel -erase was proven to skip assigned media.
And it does! But my data is gone. This is so confusing.

mph999's picture

What OS are you on ?

I've just run your command on my system (not VTL) but bear with me ...

Eg.

I have two tapes :

R0TP00 - Unassigned

R0TP01 - Assigned with data on it

I ran :

bplabel -erase -m <mediaid> -d 4mm -o -p NetBackup -host womble

Both times the tape was loaded, but, only R0TO00 was 'erased'.  For R0TP01 the message came back saying assigned.

Both times, I used a script that used dtrace to 'trace' the sg scsi ioctls.  Comparing the outputs, this was the only reall difference, that is , these lines only appeared when the tape was erased for sure ...

 

 
pid=18226 Entry: Command=bptm, Device=/devices/pci@1e,600000/pci@0/pci@9/pci@0,2/pci@1/scsi@2/sg@5,0:raw
pid=18226   CDB len=0xa, CDB addr=0xffffffff7fffe884
pid=18226   l2=0x144, p2=0x100a81460
pid=18226   l3=0x0, p3=0x0
pid=18226   Flags: 0x0, 0x0, 0x0, 0x0, 0x0, 0x0
pid=18226      SCSI CDB bytes: 0x4d, 0x0, 0x6e, 0x0, 0x0, 0x0
pid=18226 Exit: Command=bptm, retval=0, errno=0
pid=18226   CDB len=0xa, CDB addr=0xffffffff7fffe884
pid=18226   l2=0x144, p2=0x100a81460
pid=18226   l3=0x0, p3=0x0
pid=18226   Flags: 0x0, 0x80, 0x0, 0x0, 0x1f, 0x0
pid=18226 Entry: Command=bptm, Device=/devices/pci@1e,600000/pci@0/pci@9/pci@0,2/pci@1/scsi@2/sg@5,0:raw
pid=18226   CDB len=0x6, CDB addr=0xffffffff7fffe690
pid=18226   l2=0x50, p2=0x100a81460
pid=18226   l3=0x0, p3=0x0
pid=18226   Flags: 0x0, 0x0, 0x0, 0x0, 0x0, 0x0
pid=18226      SCSI CDB bytes: 0x12, 0x0, 0x0, 0x0, 0x50, 0x0
pid=18226 Exit: Command=bptm, retval=0, errno=0
pid=18226   CDB len=0x6, CDB addr=0xffffffff7fffe690
pid=18226   l2=0x50, p2=0x100a81460
pid=18226   l3=0x0, p3=0x0
pid=18226   Flags: 0x0, 0x80, 0x0, 0x0, 0x1f, 0x6
pid=18226 Entry: Command=bptm, Device=/devices/pci@1e,600000/pci@0/pci@9/pci@0,2/pci@1/scsi@2/sg@5,0:raw
pid=18226   CDB len=0x6, CDB addr=0xffffffff7fffe888
pid=18226   l2=0x0, p2=0x0
pid=18226   l3=0x0, p3=0x0
pid=18226   Flags: 0x0, 0x0, 0x0, 0x0, 0x0, 0x0
pid=18226      SCSI CDB bytes: 0x17, 0x0, 0x0, 0x0, 0x0, 0x0
pid=18226 Exit: Command=bptm, retval=0, errno=0
pid=18226   CDB len=0x6, CDB addr=0xffffffff7fffe888
pid=18226   l2=0x0, p2=0x0
pid=18226   l3=0x0, p3=0x0
pid=18226   Flags: 0x0, 0x80, 0x0, 0x0, 0x17, 0x0
 
I think these therefore are the 'scsi commands' sent in this sections :
 
pid=18226   Flags: 0x0, 0x0, 0x0, 0x0, 0x0, 0x0
pid=18226      SCSI CDB bytes: 0x4d, 0x0, 0x6e, 0x0, 0x0, 0x0
pid=18226   Flags: 0x0, 0x80, 0x0, 0x0, 0x1f, 0x0
pid=18226   Flags: 0x0, 0x0, 0x0, 0x0, 0x0, 0x0
pid=18226      SCSI CDB bytes: 0x12, 0x0, 0x0, 0x0, 0x50, 0x0
pid=18226   Flags: 0x0, 0x80, 0x0, 0x0, 0x1f, 0x6
pid=18226   Flags: 0x0, 0x0, 0x0, 0x0, 0x0, 0x0
pid=18226      SCSI CDB bytes: 0x17, 0x0, 0x0, 0x0, 0x0, 0x0
pid=18226   Flags: 0x0, 0x80, 0x0, 0x0, 0x17, 0x0
 
 
4d - log sense
6e - ?
12 - Inquiry
50 - XDWRITE
17 - Release
80 - Write filemarks
1f  - ?
 
Now, this is getting a little beyond my capabilities ... but only in this sections for the tape the was erased  did I see 0x50 (write filemark)  - clearly we're noty writing 'anything' to the tape that was assigned, and indeed, no "write filemarks" are sent.
 
What I can't determine is that if any of the commands in this section mean 'erase' at the sec.  Still at least it is a start, and this shows inf nothing else that there is a difference between the scsi commands sent when the tape is really erased, and when it is not because it is assigned.

 

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
RLeon's picture

Thank you for spending time testing things.

I just want to share some information about this potential pitfall.
We have accepted that whatever happened to our data, it cannot be reversed, and that I don't believe anyone could do anything to help now.

We are on Win 2003 SP2 32bit, connected to a VTL through FC HBAs.
I'm not familiar of anything like DTrace on Windows, and wouldn't know what to make out of the data that such a tool captures.
 

mph999's picture

I'm struggling ... there is no dtrace on windows ... you might need to log a call ....

From the above, I'm actually slighthly wrong , the scsi caomnds sent are just these ...

This is when the tape really was labelled :

The first hex number in the line is the 'command'

pid=18135      SCSI CDB bytes: 0x16, 0x0, 0x0, 0x0, 0x0, 0x0

 

pid=18135      SCSI CDB bytes: 0x12, 0x0, 0x0, 0x0, 0x50, 0x0
pid=18136      SCSI CDB bytes: 0x1a, 0x8, 0x1d, 0x0, 0x18, 0x0
pid=18136      SCSI CDB bytes: 0x12, 0x0, 0x0, 0x0, 0x60, 0x0
pid=18136      SCSI CDB bytes: 0xb8, 0x4, 0x0, 0x1, 0x0, 0x1
pid=18136      SCSI CDB bytes: 0xb8, 0x2, 0x0, 0x2, 0x0, 0x6
pid=18136      SCSI CDB bytes: 0x12, 0x0, 0x0, 0x0, 0x60, 0x0
pid=18136      SCSI CDB bytes: 0xa5, 0x0, 0x0, 0x0, 0x0, 0x2
pid=18129      SCSI CDB bytes: 0x12, 0x0, 0x0, 0x0, 0x50, 0x0
pid=18129      SCSI CDB bytes: 0xa2, 0x20, 0x0, 0x10, 0x0, 0x0
pid=18129      SCSI CDB bytes: 0x3, 0x0, 0x0, 0x0, 0x14, 0x0
pid=18129      SCSI CDB bytes: 0x19, 0x0, 0x0, 0x0, 0x0, 0x0
pid=18129      SCSI CDB bytes: 0x1a, 0x0, 0x0, 0x0, 0xc, 0x0
pid=18129      SCSI CDB bytes: 0x44, 0x1, 0x0, 0x0, 0x0, 0x0
pid=18226      SCSI CDB bytes: 0x4d, 0x0, 0x6e, 0x0, 0x0, 0x0
pid=18227      SCSI CDB bytes: 0x1a, 0x8, 0x1d, 0x0, 0x18, 0x0
pid=18227      SCSI CDB bytes: 0x12, 0x0, 0x0, 0x0, 0x60, 0x0
pid=18227      SCSI CDB bytes: 0xb8, 0x4, 0x0, 0x1, 0x0, 0x1
pid=18227      SCSI CDB bytes: 0x12, 0x0, 0x0, 0x0, 0x60, 0x0
pid=18227      SCSI CDB bytes: 0xa5, 0x0, 0x0, 0x0, 0x0, 0x1
pid=18226      SCSI CDB bytes: 0x4d, 0x0, 0x6e, 0x0, 0x0, 0x0
pid=18226      SCSI CDB bytes: 0x12, 0x0, 0x0, 0x0, 0x50, 0x0
pid=18226      SCSI CDB bytes: 0x17, 0x0, 0x0, 0x0, 0x0, 0x0

 

the 12th line down is :

 

pid=18129      SCSI CDB bytes: 0x19, 0x0, 0x0, 0x0, 0x0, 0x0
 
0x12  = Erase
 
Looking at the same lines when the tape was assigned, and not overwritten:
 
pid=18439      SCSI CDB bytes: 0x16, 0x0, 0x0, 0x0, 0x0, 0x0
pid=18439      SCSI CDB bytes: 0x12, 0x0, 0x0, 0x0, 0x50, 0x0
pid=18440      SCSI CDB bytes: 0x1a, 0x8, 0x1d, 0x0, 0x18, 0x0
pid=18440      SCSI CDB bytes: 0x12, 0x0, 0x0, 0x0, 0x60, 0x0
pid=18440      SCSI CDB bytes: 0xb8, 0x4, 0x0, 0x1, 0x0, 0x1
pid=18440      SCSI CDB bytes: 0xb8, 0x2, 0x0, 0x2, 0x0, 0x6
pid=18440      SCSI CDB bytes: 0x12, 0x0, 0x0, 0x0, 0x60, 0x0
pid=18440      SCSI CDB bytes: 0xa5, 0x0, 0x0, 0x0, 0x0, 0x3
pid=18433      SCSI CDB bytes: 0x12, 0x0, 0x0, 0x0, 0x50, 0x0
pid=18433      SCSI CDB bytes: 0xa2, 0x20, 0x0, 0x10, 0x0, 0x0
pid=18433      SCSI CDB bytes: 0x3, 0x0, 0x0, 0x0, 0x14, 0x0
pid=18488      SCSI CDB bytes: 0x4d, 0x0, 0x6e, 0x0, 0x0, 0x0
pid=18489      SCSI CDB bytes: 0x1a, 0x8, 0x1d, 0x0, 0x18, 0x0
pid=18489      SCSI CDB bytes: 0x12, 0x0, 0x0, 0x0, 0x60, 0x0
pid=18489      SCSI CDB bytes: 0xb8, 0x4, 0x0, 0x1, 0x0, 0x1
pid=18489      SCSI CDB bytes: 0x12, 0x0, 0x0, 0x0, 0x60, 0x0
pid=18489      SCSI CDB bytes: 0xa5, 0x0, 0x0, 0x0, 0x0, 0x1
pid=18488      SCSI CDB bytes: 0x4d, 0x0, 0x6e, 0x0, 0x0, 0x0
pid=18488      SCSI CDB bytes: 0x12, 0x0, 0x0, 0x0, 0x50, 0x0
pid=18488      SCSI CDB bytes: 0x17, 0x0, 0x0, 0x0, 0x0, 0x0
 
There is no 0x19 in this output, so in the case of Solaris at least, we do not send an erase command when the tape is assigned.
 
I have no idea how to do this in windows (very sorry) but if you were to log a call and shown this, then hopefully they can find someone who can, to see what is gping on and who sent what ....
 
If the NBU code behaves the same way in wiondows as Solaris, then it wasn't us (based on this test) but Windows doesn't have a 'pass thru' driver which Solaris does.
 
Basically, the above is like a network snoop but on teh scsi commands sent via the sg driver.
 
Hope that it helpss in some way, if only to give an idea how this could be approached.  
 
Martin

 

 

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
RLeon's picture

I fully follow the concept and logic behind your method, and understand the results.
Yea, Windows can be annoying. But I'm not worrying about the pass thru driver thing.

I understand what happened in our environment shouldn't have, because as you have shown, that Netbackup shouldn't send that scsi command to assigned tapes.

I just want to tell people not to do this against a media pool, like I did.
Or, at least script it so that it filters for the unassigned tapes before doing bplabel -erase. (I.e., Script the skipping logic yourself, don't let bplabel do it.)

mph999's picture

*** NOTE ***

Cannot edit this post for some reason, I'm not quite right as it happens, so have continued below ...

Martin

 

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
Will Restore's picture

man page says

-o  Unconditionally overwrites the selected media ID.

 

and it sounds like it did

 

 

Will Restore -- where there is a Will there is a way

mph999's picture

And here is the answer ...

A look through eTracks led to this ..

 

The customer was running the Windows Admin console and generated a list of 
unassigned VTL media. Unfortunately this list was generated from a non-Scratch 
production Volume Pool. Later without refreshing this list they went through 
and ran BPLABEL quik erase on the list of 200 tapes. Unfortunately during the 
time between the list was generated and the BPLABEL command was run some of the 
media had been reused and assigned.

... which led me to this ...

http://www.symantec.com/docs/TECH144380

The GUI checks if the media is assigned, the command line doesn't ...

So in summary, proir to NBU 7.5 an erase run from the command line will do exactly that, even on assigned media.

"When executed via the command line, the bplabel command will not check the assignment status of the tapes being labeled or erased. "

This fairly major flaw has been fixed at 7.5 (which is what I test on).

I would argue that you shouldn't run a erase command on an assigned media BUT, fully understand that this could happen and you would expect the software to protect against this, esp. if the GUI does (well, mostly as explained in the TN).

Big thanks to RLeon for taking the time to explain his issue, I certainly learnt something I didn't know today.  I also appreciate his calm, professional approach.

Martin

 

 

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
RLeon's picture

Martin:
I started this thread not intending to find a solution but you found me one anyway! Thanks!
It is unfortunate that the incident happened in a nbu domain that just so happened to be running 7.0, not 7.5.
Oh if only we had upgraded it last month when we have the resources to. But welp.
On the bright side, someone else lost 200 tapes! I only lost around 50...Heh.
This is no laughing matter.

wr:
I did read up on bplabel -o from both the 7.0 and 7.5 command guides.
If only I had interpreted it as something more than just a silent mode.
But then again, in 7.5, even with -o it does still skip over and ignore assigned medias, so I guess the victim count must have been pretty high at some point prior to 7.5.
 

Thanks all,

RLeon

RLeon's picture

I started this thread not intending to find a solution but you found me one anyway! Thanks!

....... For anyone who is here with the same problem, the solution marked here is not the kind that could rescue your data. Get over it. I marked it to be some kind of monument...

mph999's picture

Pleased to help - it was bugging me ...

I think clearly, although I 'could' argue that NBU was working as designed, that part is a bad design and as mentioned, has been changed.

Normally we 'know' about these potentail issues, but that one was new to me - perhaps not that many people run the erase command with a VTL from the command line.

Regards,

Martin

 

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805