Video Screencast Help
Protect Your POS Environment Against Retail Data Breaches. Learn More.

All restores not working but backup's are

Created: 23 Jan 2013 • Updated: 04 Feb 2013 | 15 comments
This issue has been solved. See solution.

Hi guys,

Since yesterday I have faced a very strange problem on my netbackup environment. I didn't made any change.

The problem is: all my restores doesn't work. Actually it starts until the point  is asks for media needed

09:11:50 (78636.xxx) Restore job id 78636 will require 1 image.
09:11:50 (78636.xxx) Media id XXX014 is needed for the restore.

It happens with any client with different media tapes and robots.

Backup are working fine. That's the most awkward.

Don't know where to start looking...

Yesterday evening I made a test. Just after I restarted netbackup services on master and media servers I runned a restore fine but after a while (about 15-30) minutes all restores got same status, waiting so long. For a given restore I could get another message

17:14:49 (77387.001) (77387.001) ERR - Timed out waiting for the media to be mounted and positioned.
17:14:49 (77387.001) (77387.001) ERR - Timed out waiting for the media to be mounted and positioned.

No problem with robots communication since backup are working fine.

I found a thread, not exactly related, since even small restores (single files) have same behavior.

In NetBackup 7.1, restore jobs for more than 100,000 files may be delayed for several hours before starting to write data (http://www.symantec.com/business/support/index?pag...)]

 

 

 

Comments 15 CommentsJump to latest comment

Nagalla's picture

hi,

are you using the Multi-plexing backups?

if yes, please try how restore is moving for non Multiplexing images?

does this restore issue is related to specific media or media server or a Drive?

you would need to isolage this issue in that level.

quebek's picture

Are all needed tapes in the library?? Run inventory of library and make sure tape which are needed for restore are there!

you can check that from CLI by issuing such command

vmquery -b -m XXXX014

if there would be NONE in the output the tape is not inside library

here is output for a tape from my env - first which is out of the library, second which is in the library:

C:\>vmquery -b -m QA8658
media   media  robot  robot  robot  side/  optical  # mounts/      last
 ID     type   type     #    slot   face   partner  cleanings    mount time
-------------------------------------------------------------------------------
QA8658  HCART  NONE     -      -     -       -           2     12/28/2012 20:50

C:\>vmquery -b -m QA8651
media   media  robot  robot  robot  side/  optical  # mounts/      last
 ID     type   type     #    slot   face   partner  cleanings    mount time
-------------------------------------------------------------------------------
QA8651  HCART  TLD      0       4     -       -          12     01/23/2013 13:44

Marianne's picture

Is the restore hanging at 'media id needed'?

If so, it seems that bprd process is hanging. See this discussion:

https://www-secure.symantec.com/connect/forums/master-7503-media-servers-60mp4 

Media mount timeout is a separate isssue.
Post ALL text in the job details as well as media server bptm log as File attachment.

 

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

ldias's picture

I had already check inventory and it was fine for all robots. No inconsistency.

In the morning I rebooted the master server and now everything is working.

It seems like a netbackup bug since we didn't change anything else.

Let's see how long it will take until ghosts come again ... :)

ldias's picture

@ Marianne van den Berg. Your considerations are very relevant.

It explains why after I restarted netbackup master server service restores worked for a few minutes and then stopped working. Most probably there was zombies bprd process still running. With the reboot of today, besides restart all process, all these zombies bprd process got killed.

In which situation(s) bprd process can hanging forever ? Is it a netbackup bug ?

 

Thanks in advance.

ldias's picture

 

Since I haven't had the problem so far, I marked @Marianne van den Berg post as solution. Seems the most plausible answer. If I had problem again I'll try those recommended actions https://www-secure.symantec.com/connect/forums/master-7503-media-servers-60mp4)

Solved the problem rebooting master server machine.

 

Marianne's picture

In which situation(s) bprd process can hanging forever ? Is it a netbackup bug ?

I have only ever seen the problem where FORCE_RESTORE_MEDIA_SERVER entry is in place.

I believe this is a bug. The problem is very few users have level 5 log of bprd available in order to log a support call. It is easy enough to kill hanging bprd process or else restart NBU.

I have once logged a call for NBU 5.1 on AIX master server. This was back in 2005.
At times the various engineers asked for more logs - bpcd on master, bptm and bpcd on media servers, etc...
After long-long time of allowing level 5 bprd logs, monitoring file systems, each day copying previous bprd log away and deleting, we managed to actually get a log that showed the internal comms failure.
Symantec issued new bprd binary.
Problem was never seen again in this environment, even through NBU upgrades.

It seems that Symantec has never acknowledged the problem across all platforms, so this is why the problem is still around.

 

 

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

ldias's picture

@Marianne van den Berg.

Need your help again. Having same problem again. Now it's becoming more frequently.

I had the problem during the morning. Only solved restarting machine. I tried to stop Netbackup, killing all bprd process and it didn't work.

Now I having the problem again. I don't know what else I can do ... What kind of logs may I look into ?

In bprd logs I don't see errors. There's a line in bprd logs from one of these restores that hanging...

line handle_image_status: clientname_1359229512 restfiles pid 14080 bpbrm pid 8486 status = 59
 

I didn't find line FORCE_RESTORE_MEDIA_SERVER on my master server.

 

 

 

 

Marianne's picture

Which OS is your master server?

What is NBU patch level on your master?

Please show us output of 'bpps' on the master when you see the hanging restore.
Also post bprd log as File attachment.
No guarantees that we will be able to find the issue in bprd log - as per my post above - it took many uploads of level 5 logs before Symantec backline engineer found the issue.

If there are lots of files to be restored , it is always best to select smaller chunks of files/folders at a time. We have seen situations where bprd process runs out of internal memory. This was supposed to have been resolved in some or other NBU patch. will see if I can find details.

 

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

ldias's picture

OS Version Red Hat Enterprise Linux Server release 6.3 64bits

NBU patch level of master and media server's 7.5.0.4

Output of bpps: file attached

ps -ef | grep bprd

root      3030     1  0 12:41 ?        00:00:02 /usr/openv/netbackup/bin/bprd
root      5104 15944  0 16:14 pts/2    00:00:00 grep bprd
root     14094     1  0 14:05 ?        00:00:00 bprd -dontfork -mpxmain
root     19966  3030  0 14:53 ?        00:00:00 /usr/openv/netbackup/bin/bprd
root     27670  3030  0 15:12 ?        00:00:00 /usr/openv/netbackup/bin/bprd

Found a TECH http://www.symantec.com/business/support/index?pag.... Don't know if there's something related to isse. Since in my case all restores are hanging,regardless the medias needed are diferent.

Bprd logs from the moment I started a restore job.

 

 

AttachmentSize
bpps_x.txt 8.27 KB
bprd_logs__restore_clientname_vmgat.txt 131.25 KB
ldias's picture

Found a post of someone else having same problem https://www-secure.symantec.com/connect/forums/una...

In case I want to try killing bprd process. Is that possible to kill just bprd process' and then restart it again. If there's backup running (RMAN and regular backup) can I stop it without interruption ?

Based on your comment "If Activity Monitor sits for more than 2 minutes with 'media needed', I know what to look for..." it's kind of an old problem.

From your past experiences Netbackup engineering team developed fixes for particular cases ?

ldias's picture

Found a post of someone else having same problem https://www-secure.symantec.com/connect/forums/una...

In case I want to try killing bprd process. Is that possible to kill just bprd process' and then restart it again. If there's backup running (RMAN and regular backup) can I stop it without interruption ?

Based on your comment "If Activity Monitor sits for more than 2 minutes with 'media needed', I know what to look for..." it's kind of an old problem.

From your past experiences Netbackup engineering team developed fixes for particular cases ?

Marianne's picture

I have only ever needed to kill the 'bprd -dontfork -mpxmain' process (PID 14094 in your case)  and cancel the restore job in Activity Monitor. This will in no way affect your Oracle backup.

As per my post above, I have only ONCE been able to get a permanent solution for this issue because we were able to collect level 5 logs at the right time for Support.

BTW - FORCE_RESTORE_... entries are located in bp.conf on the master server.

 

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

SOLUTION
ldias's picture

On last friday I did procedures you suggested early in the posts with a modification:

Step 1: Killed all bprd process

Step 2: Stopped netbackup master services

Step 3: Added file /usr/openv/netbackup/NON_MPX_RESTORE

Step 4: Restarted netbackup master services

Good to know I don't have to stop/start netbackup master services. If I have problem again I will try just killing bprd -dontfork -mpxmain' process and see if solves.

I found a TECH suggesting creating the file /usr/openv/netbackup/NON_MPX_RESTORE. I checked and  bprd -dontfork -mpxmain' process is no longer running. Since it's the culprit I hope it doesn't happen anymore.

If I am not mistaken it's related to multiplexing restore. I don't have multiplexing on my environment. What the consequences of having this file /usr/openv/netbackup/NON_MPX_RESTORE ?

And thanks for your accurate answers.

 

Marianne's picture

What the consequences of having this file /usr/openv/netbackup/NON_MPX_RESTORE ?

Our experience was that having this file caused real multiplexed restores to fail.
(Part of my original case history mentioned above). 
 

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links