Video Screencast Help

VCS nodes keep rebooting

Created: 08 Apr 2010 • Updated: 25 May 2010 | 4 comments
This issue has been solved. See solution.

Hi

I wonder if you kind people can help me again.

I have a 3 node cluster on Sun x4240 servers, which I have installed VCS v5.0.
There are only about 9 service groups created on them which just have mounts and volumes, so no load on them.

The issue I am seeing is randomly one server in the cluster drops off the network and then I can't access it via the console as root.
This seems to happen for about 15 minutes then it fixes itself, then the other server does the same.

I have noticed that the heart beat connections go first.

My Cluster set up is:
Redhat 5.4 x86
VCS v5.0 RP3
Heartbeats on = eth1 and eth3 (100mb full duplex)
All the servers are built exactly the same with no variation.

has anyone come across this before?

Thanks

Sparmar

Comments 4 CommentsJump to latest comment

Marianne's picture

Check /var/log/messages. Look for the section prior to the server booting again.

We have recently seen the situation described in this TechNote:
http://seer.entsupport.symantec.com/docs/184301.htm

The solution was to track down & troubleshoot the process causing cpu usage to spike, leaving system unresponsive.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

vcs_man's picture

Is your VCS version mentioned correctly? or is that VCS 5.0MP3RP3?

I believe that VCS 5.0 Base version was not supported on RHEL 5.

Please give us following outputs..

#rpm -aq |grep VRTSvcs
#had -version

Also, when you said that you can not access the console once you lose the network, does it affect on all the hosts at the same time? I mean are you able to connect to any other node either on console or through ssh/telnet?

How its is configured in your network? Did you check from  your network side?

Thanks,
Mandar

sparmar's picture

Hi

Heres the output from had -version:

Engine Version=5.0
PSTAMP: Veritas-5.0MP3-07/16/08-02:01:00

And the output from rpm aq | grep VRTSvcs

VRTSvcs-5.0.30.00-MP3_GENERIC
VRTSvcsvr-5.0.30.00-MP3_GENERIC
VRTSvcsag-5.0.30.00-MP3_RHEL5
VRTSvcsor-5.0.30.00-MP3_RHEL5
VRTSvcs-5.0.30.00-MP3_RHEL5
VRTSvcsdr-5.0.30.00-MP3_RHEL5
VRTSvcsmn-5.0.30.00-MP3_GENERIC

I have installed RP3 as well.

There does seem to be a lot of LLT errors output in the messages logs on the servers which are off the network via the console.

I've also checked with our networks guys, and there doesn't seem to be any issues with the switch or the network, so I figure it must be down to the software, or I've not installed something correctly.

Thanks

sparmar

sparmar's picture

Just to let you all know, the issue was a faulty network card for one of the heart beats which has now been replaced.
Also there was an issue with a PCI card which connects via a fibre cable as a media server for Netbackup which seemed to hang the servers on reboot. (Keeps scanning down the lpfc)

So, it looks as if it was hardware related.

Many thanks for all the input in helping me get to some resolution.

Sparmar

SOLUTION