Video Screencast Help

Disk Volume Pool Down (2074)

Created: 25 Feb 2013 • Updated: 27 Feb 2013 | 17 comments
Spartacus81's picture
This issue has been solved. See solution.

Friends,

given up on fixing disk pool volume down issue after spending whole morning.. this is the typical one..

nbdevquery and nbdevconfig works successfully.. you can inventory the disk pool from.. you can see the storage configuration properties under credentials tab.. disk pool stays alive less than 10secs and it goes down..  here is the spoold logs from that path.

gsrugnbapp01:/dedup_data/log/spad/gsrugnbapp01.groupservices.co.uk/spoold/spad

number of errors being reported of connection reset by peer.. i can bet my brand new BMW 325D on this.. the name resolution/nslookup and everything working okay.. no problem.. this problem arose over the weeked due where the media server (built in storage) went offline in the middle of the backup due to unexpected power outage..  looking at the logs below its suggesting connection reset by peer; REceive session shutdown

any suggestion will highly be appreciated.

 

February 25 13:49:19 INFO [140449875887872]: Task Type         : spad
February 25 13:49:19 INFO [140449875887872]: Started at        : Mon Feb 25 13:49:19 2013
February 25 13:49:19 INFO [140449875887872]: Task ID           : 0
February 25 13:49:19 INFO [140449875887872]: Thread ID         : 140449875887872
February 25 13:49:19 INFO [140449875887872]: Session ID        : 1563389238
February 25 13:49:19 INFO [140449875887872]: Data Selection    : 1
February 25 13:49:19 INFO [140449875887872]: Agent Name        : spoold
February 25 13:49:19 INFO [140449875887872]: Agent Address     : gsrugnbapp01.groupservices.co.uk:39478
February 25 13:49:19 INFO [140449875887872]: OS Family         : Linux-x86_64
February 25 13:49:19 INFO [140449875887872]: Software Version  : 7.0003.0012.091
February 25 13:49:19 INFO [140449875887872]: Protocol Version  : 6.6.1
February 25 13:49:19 INFO [140449875887872]: Crypto engine     : server
February 25 13:49:19 INFO [140449875887872]: Data Encryption   : enabled
February 25 13:49:19 INFO [140449875887872]: Data Compression  : enabled
February 25 13:49:19 INFO [140449875887872]: Accelerated Access: disabled
February 25 13:49:19 INFO [140449875887872]: task 0 [thread 140449875887872] for gsrugnbapp01.groupservices.co.uk:39478: spad handler running
February 25 13:49:48 INFO [140449875887872]: newEvent Adding new event. type:4
February 25 13:50:14 INFO [140449875887872]: Task Type         : spad
February 25 13:50:14 INFO [140449875887872]: Started at        : Mon Feb 25 13:50:14 2013
February 25 13:50:14 INFO [140449875887872]: Task ID           : 0
February 25 13:50:14 INFO [140449875887872]: Thread ID         : 140449875887872
February 25 13:50:14 INFO [140449875887872]: Session ID        : 1393531371
February 25 13:50:14 INFO [140449875887872]: Data Selection    : 1
February 25 13:50:14 INFO [140449875887872]: Agent Name        : spoold
February 25 13:50:14 INFO [140449875887872]: Agent Address     : gsrugnbapp01.groupservices.co.uk:39494
February 25 13:50:14 INFO [140449875887872]: OS Family         : Linux-x86_64
February 25 13:50:14 INFO [140449875887872]: Software Version  : 7.0003.0012.091
February 25 13:50:14 INFO [140449875887872]: Protocol Version  : 6.6.1
February 25 13:50:14 INFO [140449875887872]: Crypto engine     : server
February 25 13:50:14 INFO [140449875887872]: Data Encryption   : enabled
February 25 13:50:14 INFO [140449875887872]: Data Compression  : enabled
February 25 13:50:14 INFO [140449875887872]: Accelerated Access: disabled
February 25 13:50:14 INFO [140449875887872]: task 1 [thread 140449875887872] for gsrugnbapp01.groupservices.co.uk:39494: spad handler running
February 25 13:50:32 INFO [140449875887872]: newEvent Adding new event. type:1
February 25 13:50:46 INFO [140449875887872]: newEvent Adding new event. type:1
February 25 13:53:08 INFO [140449875887872]: newEvent Adding new event. type:1
February 25 13:53:08 ERR [140449875887872]: 25000: NetSelect: invalid pipe
February 25 13:53:08 ERR [140449875887872]: 25036: binaryMessageRead:3398, Could not receive binary message: expected 4 bytes got 0 bytes. Cause: connection reset by peer
February 25 13:53:08 ERR [140449875887872]: 25036: binaryMessageRead:3398, Could not receive binary message: expected 4 bytes got -1 bytes. Cause: connection reset by peer
February 25 13:53:08 INFO [140449875887872]: task 1 [thread 140449875887872] for gsrugnbapp01.groupservices.co.uk:39494: spad handler returns
February 25 13:53:08 INFO [140449875887872]: task 0 [thread 140449875887872] for gsrugnbapp01.groupservices.co.uk:39478: spad handler returns
February 25 13:53:08 INFO [140449875887872]: Receive session shutdown
February 25 13:53:08 INFO [140449875887872]: Receive session shutdown
February 25 13:53:08 INFO [140449875887872]: Uptime                  : 173.54 sec
February 25 13:53:08 INFO [140449875887872]: Uptime                  : 228.61 sec

February 25 13:53:08 INFO [140449875887872]: Messages Received       : 19
February 25 13:53:08 INFO [140449875887872]: Messages Received       : 8
February 25 13:53:08 INFO [140449875887872]: Bytes Received          : 3996
February 25 13:53:08 INFO [140449875887872]: Bytes Received          : 2104
February 25 13:53:08 INFO [140449875887872]: Seconds spent receiving : 0.00
February 25 13:53:08 INFO [140449875887872]: Seconds spent receiving : 0.00
February 25 13:53:08 INFO [140449875887872]: Messages Sent           : 19
February 25 13:53:08 INFO [140449875887872]: Messages Sent           : 8
February 25 13:53:08 INFO [140449875887872]: Bytes Sent              : 1820
February 25 13:53:08 INFO [140449875887872]: Bytes Sent              : 744
February 25 13:53:08 INFO [140449875887872]: Seconds spent sending   : 0.00
February 25 13:53:08 INFO [140449875887872]: Seconds spent sending   : 0.00
February 25 13:53:08 INFO [140449875887872]: 0.00 MB sent in 173.54 sec
February 25 13:53:08 INFO [140449875887872]: 0.00 MB sent in 228.61 sec
February 25 13:53:08 INFO [140449875887872]: spad completed at: Mon Feb 25 13:53:07 2013
February 25 13:53:08 INFO [140449875887872]: spad completed at: Mon Feb 25 13:53:07 2013
February 25 13:53:43 INFO [140449875887872]: Task Type         : spad
February 25 13:53:43 INFO [140449875887872]: Started at        : Mon Feb 25 13:53:43 2013
February 25 13:53:43 INFO [140449875887872]: Task ID           : 0
February 25 13:53:43 INFO [140449875887872]: Thread ID         : 140449875887872
February 25 13:53:43 INFO [140449875887872]: Session ID        : 291392991
February 25 13:53:43 INFO [140449875887872]: Data Selection    : 1
February 25 13:53:43 INFO [140449875887872]: Agent Name        : spoold
February 25 13:53:43 INFO [140449875887872]: Agent Address     : gsrugnbapp01.groupservices.co.uk:39541
February 25 13:53:43 INFO [140449875887872]: OS Family         : Linux-x86_64
February 25 13:53:43 INFO [140449875887872]: Software Version  : 7.0003.0012.091
February 25 13:53:43 INFO [140449875887872]: Protocol Version  : 6.6.1
February 25 13:53:43 INFO [140449875887872]: Crypto engine     : server
February 25 13:53:43 INFO [140449875887872]: Data Encryption   : enabled
February 25 13:53:43 INFO [140449875887872]: Data Compression  : enabled
February 25 13:53:43 INFO [140449875887872]: Accelerated Access: disabled
February 25 13:53:43 INFO [140449875887872]: task 0 [thread 140449875887872] for gsrugnbapp01.groupservices.co.uk:39541: spad handler running
February 25 13:54:17 INFO [140449875887872]: newEvent Adding new event. type:4
February 25 13:54:45 INFO [140449875887872]: Task Type         : spad
February 25 13:54:45 INFO [140449875887872]: Started at        : Mon Feb 25 13:54:45 2013
February 25 13:54:45 INFO [140449875887872]: Task ID           : 0
February 25 13:54:45 INFO [140449875887872]: Thread ID         : 140449875887872
February 25 13:54:45 INFO [140449875887872]: Session ID        : 1820742354
February 25 13:54:45 INFO [140449875887872]: Data Selection    : 1
February 25 13:54:45 INFO [140449875887872]: Agent Name        : spoold
February 25 13:54:45 INFO [140449875887872]: Agent Address     : gsrugnbapp01.groupservices.co.uk:39553
February 25 13:54:45 INFO [140449875887872]: OS Family         : Linux-x86_64
February 25 13:54:45 INFO [140449875887872]: Software Version  : 7.0003.0012.091
February 25 13:54:45 INFO [140449875887872]: Protocol Version  : 6.6.1
February 25 13:54:45 INFO [140449875887872]: Crypto engine     : server
February 25 13:54:45 INFO [140449875887872]: Data Encryption   : enabled
February 25 13:54:45 INFO [140449875887872]: Data Compression  : enabled
February 25 13:54:45 INFO [140449875887872]: Accelerated Access: disabled
February 25 13:54:45 INFO [140449875887872]: task 9 [thread 140449875887872] for gsrugnbapp01.groupservices.co.uk:39553: spad handler running
February 25 13:56:23 INFO [140449875887872]: newEvent Adding new event. type:1
February 25 13:57:42 ERR [140449875887872]: 25036: binaryMessageRead:3398, Could not receive binary message: expected 4 bytes got 0 bytes. Cause: connection reset by peer
February 25 13:57:42 INFO [140449875887872]: task 0 [thread 140449875887872] for gsrugnbapp01.groupservices.co.uk:39541: spad handler returns
February 25 13:57:42 INFO [140449875887872]: Receive session shutdown
February 25 13:57:42 INFO [140449875887872]: Uptime                  : 238.56 sec
February 25 13:57:42 INFO [140449875887872]: Messages Received       : 6

 

Operating Systems:

Comments 17 CommentsJump to latest comment

Mark_Solutions's picture

Hi

Sounds like corruption may have crept in but the repeated failures during lack of connection may have caused other issues

do a crcontrol --get mode first to seee what the state of play is with the storage

Next a crcontrol --queueinfo and --processqueueinfo to see what the process queue is up to (if anything)

The spoold log should actually have the answer to take a look there or if it is not too big post it on here.

I have had situations where we just need to kick queue processing off manually to bring it back to life due to 3 failures (runs as 12 noon so would have been trying at the time of the failure) (storaged logs will say something to the effect that queue processing has failed 3 time and the storage is being shut down - call support!)

Also when a fragement has been orphaned and needs manually deleting from the de-dupe catalog

Hopefully it is the first and manully running queue processing will bring it back to life (crcontrol --processqueue)

Hope this helps

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Spartacus81's picture

Thanks Mark... up'd the diskpool and ran quickly process queue info before it goes down on another session... fingers crossed to see if that helps...

gsrugnbapp01:/dedup_data/log/spad/gsrugnbapp01.groupservices.co.uk/spoold/spad # /usr/openv/pdde/pdcr/bin/crcontrol --queueinfo
total queue size : 2156929592
creation date of oldest tlog : Mon Feb 25 12:20:48 2013

gsrugnbapp01:/dedup_data/log/spad/gsrugnbapp01.groupservices.co.uk/spoold/spad # /usr/openv/pdde/pdcr/bin/crcontrol --processqueueinfo
Busy   : no
Pending: yes
gsrugnbapp01:/dedup_data/log/spad/gsrugnbapp01.groupservices.co.uk/spoold/spad # /usr/openv/pdde/pdcr/bin/crcontrol --processqueueinfo
Busy   : no
Pending: yes
gsrugnbapp01:/dedup_data/log/spad/gsrugnbapp01.groupservices.co.uk/spoold/spad # /usr/openv/pdde/pdcr/bin/crcontrol --processqueue
OK
gsrugnbapp01:/dedup_data/log/spad/gsrugnbapp01.groupservices.co.uk/spoold/spad # /usr/openv/pdde/pdcr/bin/crcontrol --processqueue
OK
gsrugnbapp01:/dedup_data/log/spad/gsrugnbapp01.groupservices.co.uk/spoold/spad #

 

 

 

 

 

Bugs comes in through open windows!

Mark_Solutions's picture

OK - so busy No but Pending Yes - so it is not running, or at least wasn't.

what does --getmode show?

As you have kicked off the processqueue it may resolve things but after one of these failure it wont actually be usable again until the getmode shwos YES for all of the important bits (PUT and GET)

Think we may need the storaged log for this one as it is obviously not playing ball at the moment.

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Spartacus81's picture

Mark its showing BUSY since i ran the queueprocessing and am constantly monitoring the storaged.log

tail -f storaged.log

and its moving so far so good.. disk pool hasnt gone down yet.. fingers crossed! i had couple of dupications pending which have now completed.. just noticed so things are sort of normal.. backups are scheduled to run at 22:00 and the diskpool is only 10TB 48% used (built in) so hopefully queueprocessing will finish that time...

--getmode is showing the restult as expected..

gsrugnbapp01:~ # /usr/openv/pdde/pdcr/bin/crcontrol --getmode
Mode : GET=Yes PUT=Yes DEREF=Yes SYSTEM=Yes STORAGED=Yes REROUTE=No COMPACTD=Yes RECOVERCRDB=No
 

--dsstat is also appearing.. inventory and storage server properties in the console all working okay..

one thing my little brain is struggling to understand this error 83/84 every now and again when you have client side dedupliation enabled.. when you change it to media server deduplication it works.... Grrr

Bugs comes in through open windows!

Spartacus81's picture

BINGO... the disk pool has gone down during the queueprocessing which ran okay for like half an hour...

trying to up it .. command works but --getmode gives the connection error???

Bugs comes in through open windows!

Mark_Solutions's picture

OK - looking good then - the processqueue has probably re-enabled the system so it should be OK now

Once it has finished kick it off another couple of times to get it down as it was quite large and need it running a couple of times after a break like this to get it back up.

The 83/84 error for client side de-dupe may cost you a BMW as this is when the client talks directly to the storage to write its backups so if it doesn't resolve it correctly it will fail.

If it is just one or two clients you could take a look at the clients ost-plugins directory and find the media server folder within it - delete it and let it try again or take a look to see if anything in there looks wrong.

I think you will be OK on your systems now as the PUT and GET are both saying Yes.

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Spartacus81's picture

Thanks Mark.. i have tried all those tricks already but no joy (83/84).. trust me i wont let my BMW go that easy..

did you see my last reply.. disk pool gone down again during the queue processing...

 

 

Bugs comes in through open windows!

Mark_Solutions's picture

Sorry - missed your last post - sorry it has gone down again

Is the storaged log small enough to zip up and post on here?

Have you rebooted the server? This tends to flush the logs and free up more space as the queue processing tends to fill the disk up (how is the disk space? - wondering if you need to run a garbage compaction to clean it all up too - if if will stay up that is!)

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Spartacus81's picture

They are quite big to be honest.. i have been struggling with it whole day yesterday and had no choice except to get Symanetc involoved since we are master specialist we get straight through to Advanced Engineer.. so long story around 1am this morning symantec confirmed there is a known bug with 7.5.0.4 where spoold crash before loading the fingerprints and finishing rebasing.. issue is been fixed in 7.5.5 which is only out last week and am so paranoid to go for it.. i was on 7.5.0.3 everything was working fine but all of sudden the Java console stopped working where i couldnt open up the schedules properties if i have storage lifecycle overridden option used.. the fix was to upgrade to 7.5.0.4 which i did; resolved the issue ont he master server but the on Java console it was still the same even upgrading it to 7.5.0.4. that was a couple of weeks ago. last week got hit by the diskpool issue and now 7.5.5.. will that ever going to easy..

Bug: http://www.symantec.com/business/support/index?page=content&id=TECH199067

  Fix: http://www.symantec.com/business/support/index?page=content&id=TECH199269

Mark are you running 7.5.5 anywhere?

am not going to upgrade i will have to go for an EEB...

Bugs comes in through open windows!

Mark_Solutions's picture

I dont have anyone running 7.5.0.5 yet (apart from OpsCenter)

I have see a couple of mentions on here so my initial feeling is to go for the EEB

Glad you got it sorted - sorry I couldn't be more help this time!

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Spartacus81's picture

Its okay mark i couldnt help myself too.. let me get that EEB and applied and i will post the feed back here before we close the thread.

while you are here do you use Java Console to manage your customers or you login on each server directly? just asking as i havent had any joy with JC yet...

Bugs comes in through open windows!

Mark_Solutions's picture

Just a personal thing but i cannot stand the Java Console!

Even for NetBackup Appliances I install the Windows Admin Console on their OpsCenter or Search/Index servers and use that for everything.

Guess I am just a Windows person at heart!!

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Spartacus81's picture

Applied the EEB last night only on to effected media servers (3 of 22) and all stable..

didnt apply the EEB on rest of them as they have not broken yet.. fingers crossed!!

Solution:

Bug: http://www.symantec.com/business/support/index?page=content&id=TECH199067

Fix: http://www.symantec.com/business/support/index?page=content&id=TECH199269

 

 

Bugs comes in through open windows!

SOLUTION
CRZ's picture

Your "Fix" link goes to 7.5.0.5.  Did you apply 7.5.0.5 or did you get an EEB for 7.5.0.4?

(Support would always rather you apply 7.5.0.5, of course. ;-) )


bit.ly/76LBN | APPLBN | 75LBN

Spartacus81's picture

i got the EEB as my bosses including myself will not go for 7.5.5 if it only out for like a week ago.. ;-)

in house testing in progress before we role out to our customers...

 

Bugs comes in through open windows!

CRZ's picture

PLEASE stop calling it "7.5.5" wink


bit.ly/76LBN | APPLBN | 75LBN

Spartacus81's picture

ohhhhh have i been must .. sorry for all the typos i made.. you are definitley very picky chris ;)

all, i meant 7.5.0.5... happy NOW!

Bugs comes in through open windows!