Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

DMP / VCS configuration looks ok, but OS shows errors

Created: 18 Apr 2013 | 12 comments
Remco Etten's picture

Goodday,

 

I have a VCS running, as far as I can tell, everything looks ok. It is a 2 node cluster directly (no switch) connected to a SUN/LSI/ 6180 diskarray. The connections look ok. DMP shows no errors but in the messages file on the OS it is constantly showing scsi write errors :

 

Mar 20 16:00:11 MIRTL01 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci111d,806e@0/pci111d,806e@4/pci1077,171@0,1/fp@0,0/disk@w20140080e518345e,2 (sd12):
Mar 20 16:00:11 MIRTL01  Error for Command: read(10)                Error Level: Retryable
Mar 20 16:00:11 MIRTL01 scsi: [ID 107833 kern.notice]  Requested Block: 288                       Error Block: 288
Mar 20 16:00:11 MIRTL01 scsi: [ID 107833 kern.notice]  Vendor: SUN                                Serial Number:   4^   9N7 
Mar 20 16:00:11 MIRTL01 scsi: [ID 107833 kern.notice]  Sense Key: Unit Attention
Mar 20 16:00:11 MIRTL01 scsi: [ID 107833 kern.notice]  ASC: 0x8b (<vendor unique code 0x8b>), ASCQ: 0x2, FRU: 0x0
Mar 22 10:41:00 MIRTL01 explorer: [ID 702911 daemon.notice] Explorer started

 

I don't understand where this is coming from, some advice as to where I should be looking? Logfiles are available if need be.

 

Thanks
Remco

Operating Systems:

Comments 12 CommentsJump to latest comment

starflyfly's picture

Hi, Remco

For this kind of scsi error, you'd better consulting hardware agent for further troubleshooting first.

If there is some disk array level mirror,which maybe readonly , or the lun  have reserve  key , this information maybe found. 

If the answer has helped you, please mark as Solution.

stinsong's picture

Hi Remco,

First, the error was reported by OS kernel which not from DMP or VxVM, so maybe you should ask OS vendor for that firstly.

This is a SCSI read error returned from the SCSI target from disk array, so check on disk array maybe helpful.

From VxVM/DMP perspective, I suggest you check on support mode and host configure about the array. Pls check on DMP configuration Guide for detail requirements:

http://www.symantec.com/business/support/resources...

Remco Etten's picture

Thanks for your reply, the thing I'm getting at is the fact that if DMP is working properly, it should not show these errors on the OS?

Let me clarify a bit more:

 

The OS messages I'm seeing are on the active node of the cluster. The DMPevents.log file are also showing entries which I do not fully understand but it leads to believe me that something is not right :

------LOGGING START------
Wed Feb 27 01:57:54.737: Enabled Disk array sun6180-0
Wed Feb 27 01:57:54.737: Enabled Disk array disk
Wed Feb 27 01:57:54.737: Added Dmpnode sun6180-0_0
Wed Feb 27 01:57:54.737: Added Dmpnode sun6180-0_1
Wed Feb 27 01:57:54.753: Dmpnode disk_0 has migrated from enclosure - to disk
Wed Feb 27 01:57:54.753: Disabled Disk array -
Wed Feb 27 01:58:07.000: Initiated SAN topology discovery
Wed Feb 27 01:58:07.000: Completed SAN topology discovery
Wed Feb 27 02:29:12.400: Lost 325 DMP I/O statistics records
Wed Feb 27 03:14:55.509: Lost 9262 DMP I/O statistics records
Wed Mar  6 01:25:06.258: Lost 10060 DMP I/O statistics records
Wed Mar  6 01:25:07.258: Lost 9996 DMP I/O statistics records
Wed Mar  6 01:25:08.258: Lost 8380 DMP I/O statistics records
Wed Mar 20 02:08:22.905: Lost 3917 DMP I/O statistics records
Wed Mar 20 02:08:23.915: Lost 6303 DMP I/O statistics records
Wed Mar 20 02:08:24.915: Lost 2935 DMP I/O statistics records
Wed Mar 20 03:08:28.062: Lost 9914 DMP I/O statistics records
Wed Mar 20 03:08:29.072: Lost 1276 DMP I/O statistics records

 

Thanks sofar! ;-)

Marianne's picture

There is nothing wrong with the paths to the disk - the errors are on the disk itself.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

Yasuhisa Ishikawa's picture

As per the documemt of DSM for this array, this ASC/ASCQ means the LU is in quiesce condition.

Please ask Oracle further information and root cause of this.

http://docs.oracle.com/cd/E19373-01/820-4737-13/ch...

Authorized Symantec Consultant(ASC) Data Protection in Tokyo, Japan

Remco Etten's picture

Hello yasuhisa and Marianne,

 

So you think the issue is on the array itself? I have checked the array and have no errors there. There are 2 volumes build on the array that are mapped towards the cluster. Would it be possible to derive the controller from the array that is causing the issue? I have explorers from both systems. The errors are only visible on the active node.

Thanks

Remco

Yasuhisa Ishikawa's picture

 

Yes, the array controller returns this ASC/ASCQ pair to the initiator(host) against read request. Wiring would  be fine as the host can receive this code correctly. Witing issue may cause FC link fault.
 
For first, you should ask Oracle what this codes mean and why it reported. Then ask both Oracle and Symantec if there are any known issues or similar cases.

Authorized Symantec Consultant(ASC) Data Protection in Tokyo, Japan

Marianne's picture

Have you had a look at the document that stinsong referred you to?

It is extremely important to verify correct host settings as well as array settings.

For one - If this is Solaris Sparc, MPxIO must be disabled.  
6180 array settings is covered on the lsst page of the doc.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

Remco Etten's picture

Hi Marianne,

Yes, I have checked and doublechecked that mpxio is disabled. Settings on firmware are correct. They are using the correct host settings (avt enabled). I was hoping that we could offline each controller at a time to see if the problem persists on a particular controller but customer is not willing to try this.

Thanks

 

stinsong's picture

Hi Remco,

If you have checked all points everyone talked, I suggest you could map 2 new other LUN from the array and create DG/volume on it with IO going, to test if there is same error on the new LUNs. So that we could see if it's the array controller issue or LUN issue.

sunny_anthony@symantec.com's picture

It looks to be some issues with the HDD/luns assigned itself.

Try to issue SCSI inquiry to the devices.

vxscsiinq /dev/vx/rdmp/<devicename> which i assume will pass, if it is then try to see any error events on array events logs.

hashim alnajjar's picture

Dear

try to run iostat -En and post the logs, if you have any hardware error it should be in these outputs.

and also please let us know the version of OS you are using, because obviosly you have a OS or hardware related issues.