Troubleshooting Failing Disks, Missing Disks and the "failed was" status

Article:TECH200618  |  Created: 2012-12-06  |  Updated: 2014-04-17  |  Article URL http://www.symantec.com/docs/TECH200618
Article Type
Technical Solution


Issue



This article contains a procedure for troubleshooting the "failed," or "failed was" status, as reported by vxdisk.

 


Error



# vxdisk -o alldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
disk_0       auto:cdsdisk    -            (vxfendg)    online
disk_1       auto:cdsdisk    -            (vxfendg)    online
disk_2       auto:cdsdisk    -            (vxfendg)    online
disk_3       auto:cdsdisk    datadg01     datadg       online
disk_4       auto            -            -            error
disk_5       auto:cdsdisk    datadg03     datadg       online
disk_6       auto:cdsdisk    datadg04     datadg       online
disk_7       auto:cdsdisk    -            (sambadg)    online
disk_8       auto:cdsdisk    -            -            online
disk_9       auto:cdsdisk    -            -            online
sda          auto:none       -            -            online invalid
-            -         datadg02     datadg       failed was:disk_4


Solution




Table of Contents



1. Introduction
2. "Failed" versus "failing" disks
3. Making an emergency backup of the disk group configuration
4. Has the disk been excluded in vxvm.exclude?
5. Has the disk been overwritten by another logical volume manager solution?
6. Determining if a disk can be reattached
7. Verifying that a disk is readable to the operating system
8. Have the paths to the disk been disabled?
9. Restoring the disk group configuration using vxconfigrestore
10. Restoring the disk group configuration manually, using UDIDs and Disk IDs
11. Restarting the volume



 

1. Introduction

(Back to top)

This article contains a procedure for troubleshooting the "failed," or "failed was" status, as reported by vxdisk.

A "failed" status is a record of a disk that is no longer accessible. This is often caused by sustained I/O errors to the disk that prevents it from being read by the operating system (OS). It may also be the result of corruption within the Veritas private region.

The private region is the portion of the disk where Veritas stores records about the disk group, such as disks, volumes, subdisks and plexes. This can be contrasted with the public region, which contains the actual volumes, including user data.




 

2. "Failed" versus "failing" disks

(Back to top)

The status of "failed" should not be confused with a status of "failing." This article primarily discusses the "failed" status, as reported by vxdisk. For information on troubleshooting the "failing" status, see http://www.symantec.com/docs/TECH61915.




3. Making an emergency backup of the disk group configuration

(Back to top)

Before making any further changes, use vxconfigbackup to create an emergency backup of the private region for the remaining disks in the affected disk group.

Vxconfigbackup does not back up the actual data that is contained within the volumes. Instead, it backs up the Veritas private region configuration database that resides on the disks, along with some information about the disks themselves. The configuration database stores information about which disks are contained by the disk group, volume structures, plexes and subdisks.

If vxconfigbackup is not available, vxprivutil can be used to dump a copy of the configuration database.
 


More details about vxconfigbackup and vxprivutil, including syntax and examples, can be found in this article:

"Using vxconfigbackup and vxprivutil to back up the disk group configuration of the Veritas private region"
http://www.symantec.com/docs/TECH201329


 

 

4. Has the disk been excluded by vxvm.exclude?

(Back to top)

/etc/vx/vxvm.exclude maintains a list of paths, controllers and products that are excluded. Check to see if the disk, or its associated path or controller, is listed in this file.

If the value of "exclude_all" is 1, all devices will be excluded.


Figure 1 - Default contents of /etc/vx/vxvm.exclude

 
exclude_all 0
paths
#
controllers
#
product
#
 

 



5. Has the disk been overwritten by another logical volume manager solution?

(Back to top)

If vxdisk shows that the disk type includes the words "LVM" or "ZFS," then the disk may have been overwritten by another logical volume manager (LVM) solution. It is also possible that there is a problem with the SAN zoning which may have caused disks to be presented to the wrong systems. Before making any further changes, ensure that the disk is not supposed to be zoned to another system.

To bring a disk back into its original, Veritas disk group, the disk must first be removed from the control of the other LVM solution and then initialized for Veritas, using vxdisksetup. Refer to the documentation for the appropriate vendor for information about removing a disk from the control of their LVM solution.




6. Determining if a disk can be reattached

(Back to top)
 

Vxreattach is used to restore the original disk media name and reattach the disk back to the disk group. It can normally only be used if the status of the disk is "online" (see Figure 2).

Run vxreattach, using the "-c" argument, to determine if a disk can be reattached to the disk group.


Figure 2 - Using vxreattach, with the "-c" argument, to check if a reattach is possible


Syntax:

vxreattach -c <disk_media_name>


Example, with typical output:

# vxreattach -c disk_4
datadg datadg02

 

In this case, "datadg" is the name of the disk group while "datadg02" is the disk media name, as shown by vxdisk.

# vxdisk -o alldgs list
DEVICE       TYPE           DISK        GROUP        STATUS
disk_0       auto:cdsdisk   -            (vxfendg)   online
disk_1       auto:cdsdisk   -            (vxfendg)   online
disk_2       auto:cdsdisk   -            (vxfendg)   online
disk_3       auto:cdsdisk   datadg01     datadg      online
disk_4       auto:cdsdisk   -            (datadg)    online
disk_5       auto:cdsdisk   datadg03     datadg      online
disk_6       auto:cdsdisk   datadg04     datadg      online
disk_7       auto:cdsdisk   -            (sambadg)   online
disk_8       auto:cdsdisk   -            -           online
disk_9       auto:cdsdisk   -            -           online
sda          auto:none      -            -           online invalid
-            -         datadg02     datadg       failed was:disk_4

 




If vxreattach -c returns a disk group and disk media name, without returning any errors, proceed with reattaching the disk (Figure 3). If a reattach is not possible, a V-5-2-238 error will appear.

Figure 3 - Using vxreattach to reattach a disk to the disk group


Syntax:

vxreattach -br <disk_media_name>


Example, with typical output:

# vxreattach -br disk_4
 

Notice that vxdisk now shows a disk media name, "datadg02," for disk_4.

# vxdisk -o alldgs list

DEVICE       TYPE           DISK        GROUP        STATUS
disk_0       auto:cdsdisk   -            (vxfendg)   online
disk_1       auto:cdsdisk   -            (vxfendg)   online
disk_2       auto:cdsdisk   -            (vxfendg)   online
disk_3       auto:cdsdisk   datadg01     datadg      online
disk_4       auto:cdsdisk   datadg02     datadg      online
disk_5       auto:cdsdisk   datadg03     datadg      online
disk_6       auto:cdsdisk   datadg04     datadg      online
disk_7       auto:cdsdisk   -            (sambadg)   online
disk_8       auto:cdsdisk   -            -           online
disk_9       auto:cdsdisk   -            -           online
sda          auto:none      -            -           online invalid

 




7. Verifying that a disk is readable to the operating system

(Back to top)

Use native OS commands to confirm that the OS can read the disk, including the disk label.

 

  • Use commands, such as prtvtoc, fdisk, lspv or diskinfo to read the disk label.
  • Use dd to read a block from the disk.

Veritas depends on the OS device drivers to communicate with disks. If the OS is unable to read a disk, Veritas will also fail to read it. If a disk does not have a label, or the label has been corrupted, Veritas will not recognize the disk. Completing these steps will assist with identifying the source of a disk outage.
 


More details, including syntax and examples, can be found in this article:

"Verifying that a disk is readable by the OS"
http://www.symantec.com/docs/TECH201356




8. Have the paths to the disk been disabled?

(Back to top)

 

Use vxdmpadm to determine the status of the paths to the disks (Figure 4).

Veritas will disable a path if serious or sustained I/O errors occur. When all paths to a disk are disabled, the server will be unable to read or write to the volume. If a path has been disabled, review the syslog for events that are reported by "vxdmp," or "scsi" for I/O errors.

Although a path can be re-enabled using "vxdmpadm enable," vxdmp should automatically evaluate the status of a path in five minute intervals using a scsi inquiry. If the query is successful, the path is automatically re-enabled. If a path remains disabled beyond this interval, it is possible that I/O errors are still being detected, warranting further investigation. Paths are not automatically re-enabled If the diskgroup has been disabled, or if vxesd is stopped. The behavior of vxdmp in response to disabled paths can be modified via the DMP tunables, which can be viewed using "vxmpadm gettune."
 


Note: Although the syslog may show that vxdmp is the source of an I/O error, vxdmp itself is not usually the origin. Veritas depends on the OS device drivers to communicate with disks. When I/O errors occur, they are reported to Veritas by the device drivers. Vxdmp will report the errors that have been passed to it by the device drivers and may disable a path in response to the events.



Figure 4 - Example of a disabled path, as reported by vxdmpadm


Syntax:

vxdmpadm getsubpaths


For example:

# vxdmpadm getsubpaths

NAME         STATE[A]   PATH-TYPE[M] DMPNODENAME  ENCLR-NAME   CTLR  
======================================================================
sdm          ENABLED(A)   -          disk_0       disk         c8    
sdp          ENABLED(A)   -          disk_0       disk         c3    
sdb          ENABLED(A)   -          disk_1       disk         c8    
sdc          ENABLED(A)   -          disk_1       disk         c3    
sdq          ENABLED(A)   -          disk_2       disk         c8    
sdt          ENABLED(A)   -          disk_2       disk         c3    
sdd          ENABLED(A)   -          disk_3       disk         c8    
sdf          ENABLED(A)   -          disk_3       disk         c3    
sdi          DISABLED      -          disk_4       disk         c8   
sdl          DISABLED      -          disk_4       disk         c3   

sde          ENABLED(A)   -          disk_5       disk         c8    
sdh          ENABLED(A)   -          disk_5       disk         c3   
sdk          ENABLED(A)   -          disk_6       disk         c8   
sdn          ENABLED(A)   -          disk_6       disk         c3   
sdr          ENABLED(A)   -          disk_7       disk         c8   
sdu          ENABLED(A)   -          disk_7       disk         c3   
sdg          ENABLED(A)   -          disk_8       disk         c8   
sdj          ENABLED(A)   -          disk_8       disk         c3   
sdo          ENABLED(A)   -          disk_9       disk         c8   
sds          ENABLED(A)   -          disk_9       disk         c3   
sda          ENABLED(A)   -          sda          disk  c2   

 

 



9. Restoring the disk group configuration using vxconfigrestore

(Back to top)

If a vxreattach is not possible, use vxconfigrestore to recover the disk group.

Vxconfigrestore does not restore the actual data that is contained within the volumes. It only restores the Veritas configuration database that is located within the private region of the disks. The configuration database stores information about which disks are contained by the disk group, volume structures, plexes and subdisks.
 

 


More details about vxconfigrestore, including syntax and examples, can be found in this article:

"Restoring the disk group configuration using vxconfigrestore"
http://www.symantec.com/docs/TECH201366

 



10. Restoring the disk group configuration manually, using UDIDs and Disk IDs

(Back to top)

If using vxconfigrestore is not possible, another method for recovering the disks is to compare the UDID or Disk ID attributes of the disks with the records that are contained with the private region configuration database.
 

 


More details about comparing UDIDs and Disk IDs, including syntax and examples, can be found in this article:

"Restoring the disk group configuration manually, using udids or disk IDs"
http://www.symantec.com/docs/TECH201367




11. Restarting the volume

(Back to top)

Once the original disk has been added back to its disk group, additional steps may be needed to recover the volume. Use vxprint to determine the current status (Figure 5).

 

  • For mirrored volumes:
    • If at least one plex was not affected by the outage, the other plexes should be resynchronized when they are reattached to the volume.  It may be necessary to use vxrecover to initiate this process (Figure 6).
    • If all plexes were affected by the outage, it may be necessary to manually review each plex to determine which contain the most recent updates.

WARNING: Do not simply force-start a mirrored volume. This may cause a plex that contains old, or corrupt, blocks to overwrite a plex that contains up-to-date data. A procedure for manually determining the most up-to-date mirror plex can be found in this article:

"Manually determining which mirror plex contains the most recent data and then resynchronizing"
http://www.symantec.com/docs/TECH202503


  • For non-mirrored volumes:
    • It may be necessary to manually restart the volume using vxvol after adding the disk back to the disk group (Figure 5).



Figure 5 - Using vxprint to determine the status of a volume


Syntax:

vxprint -g <disk_group> -ht


Example, with typical output:

In this case, vxprint shows that the volume "vol1" is disabled. The plex status is "IOFAIL," which indicates that a sustained I/O interuption to the volume has occured. After the associate disk is added back to the disk group, the volume will need to be restarted manually using vxvol.


#vxprint -g datadg -ht

dg datadg       default      default  1000     1336573086.38.Server101

dm datadg01     disk_3       auto     65536    2027264  -
dm datadg02     disk_4       auto     65536    2027264  -
dm datadg03     disk_5       auto     65536    2027264  -
dm datadg04     disk_6       auto     65536    2027264  -

v  locks        -            ENABLED  ACTIVE   102400   SELECT    -        fsgen
pl locks-01     locks        ENABLED  ACTIVE   102400   CONCAT    -        RW
sd datadg04-01  locks-01     datadg04 0        102400   0         disk_6   ENA

v  vol1         -            DISABLED ACTIVE   6010880  SELECT    -        fsgen
pl vol1-01      vol1         DISABLED IOFAIL   6010880  CONCAT    -        RW

sd datadg01-01  vol1-01      datadg01 0        2027264  0         disk_3   ENA
sd datadg02-01  vol1-01      datadg02 0        2027264  2027264   disk_4   ENA
sd datadg03-01  vol1-01      datadg03 0        1956352  4054528   disk_5   ENA

 




Figure 6 - Using vxvol to start a volume and using vxprint to review any changes in the status of the volume


Syntax:

vxvol -f <disk_group> -fa startall


Example, with typical output:

# vxvol -g datadg -fa startall


Vxprint now shows that the volume has been started.

#vxprint -g datadg -ht
dg datadg       default      default  1000     1336573086.38.Server101

dm datadg01     disk_3       auto     65536    2027264  -
dm datadg02     disk_4       auto     65536    2027264  -
dm datadg03     disk_5       auto     65536    2027264  -
dm datadg04     disk_6       auto     65536    2027264  -

v  locks        -            ENABLED  ACTIVE   102400   SELECT    -        fsgen
pl locks-01     locks        ENABLED  ACTIVE   102400   CONCAT    -        RW
sd datadg04-01  locks-01     datadg04 0        102400   0         disk_6   ENA

v  vol1         -            ENABLED  ACTIVE   6010880  SELECT    -        fsgen
pl vol1-01      vol1         ENABLED  ACTIVE   6010880  CONCAT    -        RW

sd datadg01-01  vol1-01      datadg01 0        2027264  0         disk_3   ENA
sd datadg02-01  vol1-01      datadg02 0        2027264  2027264   disk_4   ENA
sd datadg03-01  vol1-01      datadg03 0        1956352  4054528   disk_5   ENA

 



Figure 7 - Using vxrecover to finish the recovery, or start the resynchronization, of a volume


Syntax:

vxrecover -sb <volume>


Example, with typical output:

# vxrecover -sb vol1


Vxprint now shows that "vol1" is "ACTIVE."

# vxprint -g datadg -ht
dg datadg       default      default  10000    1336408747.34.Server101

dm datadg01     disk_3       auto     65536    2027264  -
dm datadg02     disk_4       auto     65536    2027264  -
dm datadg03     disk_5       auto     65536    2027264  -
dm datadg04     disk_6       auto     65536    2027264  -

v  locks        -            ENABLED  ACTIVE   102400   SELECT    -        fsgen
pl locks-01     locks        ENABLED  ACTIVE   102400   CONCAT    -        RW
sd datadg04-01  locks-01     datadg04 0        102400   0         disk_6   ENA

v  vol1         -            ENABLED  ACTIVE   102400   SELECT    -        fsgen
pl vol1-01      vol1         ENABLED  ACTIVE   102400   CONCAT    -        RW
sd datadg01-01  vol1-01      datadg01 0        102400   0         disk_3   ENA
pl vol1-02      vol1         ENABLED  ACTIVE   102400   CONCAT    -        RW
sd datadg04-02  vol1-02      datadg04 102400   102400   0         disk_6   ENA
pl vol1-03      vol1         DISABLED ACTIVE   102400   CONCAT    -        RW
sd datadg03-01  vol1-03      datadg03 0        102400   0         disk_5   ENA

 

 

keywords: failed, failing, failed disk, failed disks, failing disk, failed disks,




Article URL http://www.symantec.com/docs/TECH200618


Terms of use for this information are found in Legal Notices