Video Screencast Help

Big issue on the activity monitor with 7.5.0.5

Created: 25 Feb 2013 • Updated: 07 Mar 2013 | 50 comments
Fabrice P.'s picture
This issue has been solved. See solution.

Hello,

We installed 7.5.0.5 on last friday and since we have a lot of issues regarding the "Activity Monitor" which seems to be inconsistent. 

Basically, we have a lot of backup jobs (more than 200) that are either queued, active or "unknown". If we look more closely to the details, we can see that the jobs are actually finished.

  • We can see the info "done, status 0" but the job is stuck at the "validating image for client xxx".
  • The backup somehow successful because it is validated in the catalog and we can restore data from it.
  • OPSCenter do not see those jobs at all.

Also, if I check "Report/Problems" on the console we see a lot of "socket open failed" critical errors. It seems that some nbu process fail to communicate between each other.

We did not had any issue of this type before this 7.5.0.5 (we were in 7.5.0.4).

I logged a support case but it was more than 2 days ago and was supposed to be contacted 2h later.

...I don't have any answer yet.

Regards,

Operating Systems:

Comments 50 CommentsJump to latest comment

Mark_Solutions's picture

I cannot assist with the issue (but would like to be updated) but I would suggest phoning support back and asking to speak to the manager or asking for it to be "escalted" to get a faster response.

Keep us updated please

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Fabrice P.'s picture

Sure I will. I also noticed we have same issue on the duplication (ost - optimized) jobs. Status is 50 ("client process aborted") but they seems to be successful as well...

Overall, 10% of our weekend jobs were affected.

Authorised Symantec Customer ;)

angus the bull's picture

i upgraded to 7.5.0.5 last weekend and have managed to hit both issues, i have logged a call with support and i will update here when i get a fix

Fabrice P.'s picture

Me too I have an open case with the support. Maybe it's time to remove the download link ?

The issue is totally random, I did not get any problem on monday evening backups but yesterday I did.

Authorised Symantec Customer ;)

angus the bull's picture

does anyone know of a fix yet ?, 3 days have passed with my support call and apart from me sending screen shots of the issues i have heard nothing.

Fabrice P.'s picture

Nothing yet, I'm just sending logs...

Authorised Symantec Customer ;)

TimWillingham's picture

Seeing the same issue here after the upgrade to 7.5.0.5.

angus the bull's picture

i sent an email 1st thing friday morning asking for an update and symantec cant even be bothered to respond, why havent they pulled this software ? , anyone from symantec like to respond ?

Ankit Maheshwari's picture

same issue with one of my upgraded server..

can we move backup to 7.5.0.4?

.

Ankit Maheshwari

dthor's picture

I have been using NBU 7.5.0.5 for some time now can you send me a screen shot of your activity monitor of the problems you are seeing.  It seems I might have had the same issue and might be able to help.

Fabrice P.'s picture

They are still investigating on the logs files on my case. More than 250 affected jobs this weekend. This bug is annoying because it makes the whole platform almost unmanagable and it is very hard to be 100% sure that all the jobs were successfuly completed.

dthor, nothing particular to see on a screenshot, the issue is the jobs are in running or queued state but if you check the detail you will see that the job is actually completed (status 0) but is stuck at the validating client image" step. For duplication jobs, they exit with the status 50 ("client process aborted"). If you have the same issue, please open a ticket.

Authorised Symantec Customer ;)

Marianne's picture

I suggest all of you exchange case numbers via PM and forward to your Symantec case engineers.

If all the relevant engineers start talking to one another, the escalation to back line may be expedited.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

dthor's picture

Ok I just thought I might have had the same issue before and opened up a case I can look back at my cases and see if it is the same problem you are having

angus the bull's picture

i also have a 3rd issue with the monitor at 7.5.0.5 , i have a job that is finished but the GUI shows the job state as DONE but the type heading is empty as is the job policy, when you check the job details the client is blank

Fabrice P.'s picture

Yes, I have this issue too and some "unknown" jobs as well...

Authorised Symantec Customer ;)

dthor's picture

This has been happening since 6.5.x here is what I have done:

clear up error 50's
1. stop NetBackup (netbackup stop)
2. bpps -a , kill the remaining processes
3. cd to the /usr/openv/netbackup/db/jobs
4. rm the bpjobd.act.db
5. cd to the restart folder
6. pwd to verify, rm all files
7. cd..\trylogs
8. pwd to verify, rm all files
9. cd..\ffilelogs
10. pwd to verify, rm all files
11. restart NetBackup (netbackup start)

I have also installed the windows admin console and instead of doing the above I am able to cancel the jobs there.

I belive this is a case I opened up a few years ago Support might be able to reference it  411-938-339

TimWillingham's picture

I opened a case with them last week and escalated it this morning.  Support has found nothing on the issue so far.  Make sure you are opening cases on the problems.

I discovered another issue in the process of troubleshooting which is minor but irritating nonetheless: You can no longer resize the details window for Disk Pools under Devices.

I upgraded my Java version to 7.5.0.5 as well just to see if it made a difference, but no luck.

dthor's picture

this is what support had me do:

1. stop NetBackup (netbackup stop)
2. bpps -a , kill the remaining processes
3. cd to the /usr/openv/netbackup/db/jobs
4. rm the bpjobd.act.db
5. cd to the restart folder
6. pwd to verify, rm all files
7. cd..\trylogs
8. pwd to verify, rm all files
9. cd..\ffilelogs
10. pwd to verify, rm all files
11. restart NetBackup (netbackup start)

I have found another way to do this without having to stop and start nbu is to install the Windows Administrator Console.  I have been able to clear these up

TimWillingham's picture

That clears up existing jobs, but does it prevent future occurances?

Jonathan D.'s picture

Hi, I have the same problem. any answers yet?

Can you send me your call ref# that you open with symantec, i will do a reference with mine

Thanks

Mark_Solutions's picture

Interesting comment by dthor ...

So are you saying that this is simply a Java Console issue rather than a NetBackup issue? (i.e. the Java Console is not receiving all updates but the Windows ones does?)

If that is the case then that would not be quite so bad!

Does anyone see this issue that is using the Windows Admin Console?

Thanks

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Fabrice P.'s picture

The solution proposed by the support to dthor only cure the effect, not the root cause. I bet the issue will return on next backup run ! I'm using the Windows console *only* and I ran into those issues since day one.

My findings so far is it seems to be an in-process communication issue during the nbu activity spike (basically when the backup window kicks). At this particular time I get a lot of "socket open failed" on the report/errors console log.

Authorised Symantec Customer ;)

Mark_Solutions's picture

Fabrice .. thanks for that .. exactly what i needed to know .. so it really is a "NetBackup" issue with processes.

Have they asked you to do a netstat -a? Just wondering if this version uses a lot more ports for what ever reason and the system is just running out.

I already greatly increase the number of ports on a Master anyway when i do installations (At least on Windows Masters) but wondering if it has increased even more?

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

dthor's picture

I opened up a case about this 4/13/2010 411-938-339... It has not stopped the occurrance from happening again however I know I also opened up another case on this regard after upgrading to 7.5.0.3 but cannot not find it.  I know they gave me the same answer as before.  Let me ask my co worker if he remembers anything about this.

I don't know if this is a Java problem or not but I am not able to cancel them thru the GUI I have been able to cancel them using the Windows Console. 

Fabrice P.'s picture

FYI my case has been internaly escalated and I get the confirmation that similar situations have been encountered with 7.5.0.5. The whole thing is currently been reviewed by the engineering.

Stay tuned (and away from 7.5.0.5 for the time being) !

Mark, I was also wondering about an issue with tcp ports but I think in that case we would probably see other and more serious side-effects (jobs failed or going in timeout etc..).

Authorised Symantec Customer ;)

TimWillingham's picture

Got a call from support moments ago confirming engineering was working on this.  The good news is backups are working, you just can't tell it.

Get ready for 7.5.0.5.1 or 7.5.0.5a!!

Mark_Solutions's picture

Fabrice .. thanks for that .. keep us updated .. wonder if a memory leak has crept in again?

I think my money is on nbjm but it could be pxb_exchange, nbjm or nbsl

Is any process using unusually large amounts of memory (or CPU)?

Interesting to see what the file versions say on those files  - will take a look to see which ones were replaced in 7.5.0.5 and which versions they are

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Omar Villa's picture

Regarding the Socket Errors when the backup window start have you try to raise the File Descriptors maybe double them and see if this goes away? NBU 75 uses PBX more than the past versions so is probable that on 7505 the demand for File Descriptors is higher.

Just a thought.

Regards.

Omar Villa

Netbackup Expert

Twiter: @omarvillaNBU

TimWillingham's picture

Symantec recommends a minimum of 8000.  Our master was set to 65535:

[root]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 399360
max locked memory       (kbytes, -l) 3145728
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 32768
cpu time               (seconds, -t) unlimited
max user processes              (-u) 16384
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
 

Do you think that could still be too low?

Fabrice P.'s picture

How can I increase that File Descriptors limit number on my NBU Windows 2008 master server ?

I saw this but not sure if it apply : http://www.ehow.com/how_6103386_increase-file-handles.html

Authorised Symantec Customer ;)

Dip's picture

Thank you this forum....  We were looking to upgrade from 7.1.0.4 to 7.5.0.5 in a week. But now will wait and watch......

RDG's picture

Symantec is currently investigating the problem with high urgency. We will keep you posted.

TimWillingham's picture

Etrack Et3106719 has been created to address this issue.

Fabrice P.'s picture

I think the link to the download should be removed...

Authorised Symantec Customer ;)

fdassonville's picture

I  have the same problem in my netbackup infrastructure

Fabrice P.'s picture

Some update: I'm currently testing a EEB with the support to fix the issue. I will keep you informed.

Authorised Symantec Customer ;)

TimWillingham's picture

I installed the EEB last night also and AM looks normal this morning after a full slate of backups.  Symantec owned this problem!  Great job!

Fabrice P.'s picture

Good to know ! Full weekend backups will be the real test !

Authorised Symantec Customer ;)

RDG's picture

We have found an issue which causes our activity monitor to misreport success of some backups. We are testing a fix and will be releasing it shortly. The backups are successful and usable but the activity monitor reports them as not completing. We have a very small number of customers who have reported this issue and  a lot of  customers who have successfully installed 7505. Thank you for contributing to this forum and thank you for your patience.

Marianne's picture

Hopefully a TN will be created?

This post (and the (small??) amount of users who said that they had same problem) was enough to create wide-spread panic.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

CRZ's picture

Wide-spread panic?  A small (and vocal) group of users posted about this issue here, in addition to opening Support cases.  Support and Engineering followed their processes (which, admittedly, never seem to go fast enough when YOU'RE the one experiencing the problem) and through those processes, we're close to being able to say that we know with certainty what this issue is, how it arises and how to resolve it.

Keep in mind that all the indications we've had thus far are that despite how they are appearing in the Activity Monitor, the backups in question SUCCEEDED.  No data has been lost.  NetBackup is not indicating that your systems are protected when they're not - in fact, you could interpret it as the exact opposite - we're indicating that your system may not be protected when in fact it is!  And, as has been said previously in this thread, a very very large majority of the folks who have applied 7.5.0.5 haven't seen this issue at all.  As problems go...well, we'd prefer to have zero problems, obviously, but this one - while aggravating to those experiencing it - is rather tame when you place it on the "catastrophe" scale.  I'm not trying to make any excuses, and I much would have preferred to say "7.5.0.5 is defect free!" but I think we all knew I was never going to be able to say that despite all our best efforts prior to this release.  What's the saying?  "Stuff happens."  This sure happened!  It MAY have also hit the fan as well.  We're working hard to fix it AND get you feeling better.

Of course, we will release a TechNote - but we can't do it until we have all of the information.  We want to be sure we've fixed it before we tell you we've fixed it!  We also want to be sure we've narrowed the scope as much as we can, because as has been noted, this only affected a small number of folks who have applied 7.5.0.5.  We want to be able to provide a specific set of conditions so you can quickly determine if you might be affected (or not!) and if you'd need to call us for the EEB.

When the TechNote is released, the Late Breaking News will also be updated to point to the document.  I'm absolutely sure someone will let you know in this thread as well.  I don't want to give any specific dates on WHEN that'll happen, but know that it will be as soon as humanly possible.

Finally, you've probably already figured this out, but we're not pulling 7.5.0.5 over this issue.  Some people may find they'll need to install an EEB after they upgrade to 7.5.0.5.  Some people may want to wait on a 7.5.0.5 upgrade until we announce the availability of that EEB.  Some folks may apply 7.5.0.5 anyway and discover that they're not affected.  (Most folks, I hope!)  Some folks may apply 7.5.0.5, find their Activity Monitor IS affected, and live with it until they can call us up for an EEB.  And some folks will do something completely different which I haven't even dreamed up, because that's how the world works.  :)

What I WILL say is the ONLY way to get a defect addressed is to open a Support case and get Symantec working on it.  You can post here - and posting here is a great way to see if other people are sharing your trouble, and if they ARE, to get more people aware there are some cases that need to be resolved - but if you don't have that Support case, Connect alone will probably not get you help if you're experiencing a well and true defect in NetBackup.  This is probably obvious to everybody, but I add it here because there are some folks who try to use Connect as a substitute for our technical support - and sometimes you can even get away with that, but the truth is while Connect is great for some stuff, it should only be a SUPPLEMENT to a true tech support case when you have a "real" issue.  Let us help you!


bit.ly/76LBN | APPLBN | 761LBN

SOLUTION
Fabrice P.'s picture

I can confirm this morning that the issue is completely gone after the installation of the EEB. smiley

Thanks a lot for the reactivity and good job !

Authorised Symantec Customer ;)

HEMANPR's picture

Hello

I have the same issue. Where I can download this EBB?

Thanks

Please MARK AS SOLUTION If my Post Help You. I use the following Symantec Products: Veritas Netbackup 7.5 On Windows Enterprise 2003 SP2 - / - Symantec EndPoint 12.1.100.157 RU1 On Windows Standard 2003 SP2 - / - Symantec&nb

Fabrice P.'s picture

For the time being, you have to open a support ticket.

Authorised Symantec Customer ;)

dthor's picture

Does anyone have the EEB number I have opened up a case for my preperation on the upgrade to

7.5.0.5

CRZ's picture

Followup!

A TechNote AND hotfix are now available for this issue.

After upgrading a master server to NetBackup 7.5.0.5 (or applying NetBackup 5200/5220 Appliances 2.5.2 to an Appliance running as a master server), the Activity Monitor reports inconsistent information on job status for some jobs. Additionally, some jobs may not display in OpsCenter.
 http://symantec.com/docs/TECH203521

The short version is the Java GUI wasn't happy when it encountered a filename that contained an apostrophe/single quote (').

Our apologies to the early adopters who ended up discovering this defect!  But, happily, we now have this hotfix available to work around the issue.

The LBN has been updated with a link to this doc (may take some time for the cache to clear if you don't see it right now):
 http://symantec.com/docs/TECH178334


bit.ly/76LBN | APPLBN | 761LBN

Ray Esperanzate's picture

Just want to add that I am experiencing this issue as well, where the jobs are showing as active and stuck at the "validating backup image" step after going to 7.5.0.5.  It's quite random and doesn't seem to happen on the same clients all the time.  Going to open up a case now to get that EEB.

EDIT :  Looks like Chris posted something as i was typing my comment :)  Off to download and apply the EEB.

HEMANPR's picture

Thanks CRZ

Please MARK AS SOLUTION If my Post Help You. I use the following Symantec Products: Veritas Netbackup 7.5 On Windows Enterprise 2003 SP2 - / - Symantec EndPoint 12.1.100.157 RU1 On Windows Standard 2003 SP2 - / - Symantec&nb