Solaris hosts using the sd SCSI driver with VxVM 5.1SP1 and later must increase the Timebound timeout from the default 300 seconds

Article:TECH186667  |  Created: 2012-04-17  |  Updated: 2012-04-20  |  Article URL http://www.symantec.com/docs/TECH186667
Article Type
Technical Solution

Product(s)

Issue



Solaris hosts that use the 'sd' SCSI driver for 5.1SP1 VxVM disks must increase the Enclosure Timebound timeout setting to a value greater than the default 300 seconds. When a Solaris host uses the sd driver, the SCSI timeout (with the defaults) is 300 seconds [ sd_io_time x sd_retry_count ].  A SCSI timeout value of 300 seconds will cause the Enclosure Timebound timeout to fail the entire device rather than have the SCSI timeout fail the device path.  The SCSI variable sd_io_time is commonly set to at least 60 seconds in /etc/system as part of configuration requirements from array vendors (EMC and Hitachi, for example). This setting is most important for external, SAN connected disks that are most likely to experience a unresponsive disk device due to a transport failure (i.e. SAN fabric failure). Internal disk (with only a single path to disk) are affected (the device will get disabled) however there are no alternate paths to try and SCSI would have disabled the device at the 300 second point anyway.

This tunable is Enclosure level. The change can be done online and is non-disruptive. The setting is persistent across reboots.


Error



 NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 49/0x32


Environment



Solaris hosts using the 'sd' SCSI driver with 5.1SP1 and later for SAN disks.  Solaris hosts that use the embedded LeadVille driver 'ssd'  are not affected by this issue.  Internal drives (excepting Fibre attach) that use the sd driver have only one path to the device therefore a path failure is fatal in any case. Both SCSI and Enclosure Timebound iotimeout would return fatal errors at the default 300 seconds.


Cause



sd_io_time defaults to 60 seconds. sd_retry_count default is 5.   60 seconds x 5 retries = 300 seconds before a SCSI command reports failure to DMP. This means that the Enclosure Timebound timeout is triggered before or at the same time SCSI has timed out the command. This leads to unnecessary device failure instead of the expected path failure.

This value should be high enough to allow the underlying SCSI driver to fail a path to a device before the iotimeout expires. If the iotimeout expires before the path is failed ( unresponsive device due to SAN fabic failure, etc.) alternate paths are not tried and the DMP node (whole device) will be disabled.  A value of more than 300 would be required to ensure the other paths are tried before the device is disabled.

The total SCSI timeout defaults for the Solaris embedded driver ssd are different in that there is a built in 20 second delay for port events and the ssd_retry_count is 3, yeilding a total unresponsive device timeout of:

[  (ssd_io_time ) 60 X (ssd_retry_count ) 3 + 20 second FCP timer = 200 seconds total ]


Solution



Symantec's recommendation is to increase the value of the iotimeout to at least 15 seconds longer than the sd driver timeout period (300 seconds). Care must be taken to not exceed the allowed timeout in the upper layer applications running ( i.e. Database instance or other application using the DMP device ). Use the command 'vxdmpadm getattr enclosure <enclosure_name>' to obtain a display of the current settings. Timebound with a 300 second iotimeout is the default.

#vxdmpadm getattr enclosure emc0 
ENCLR_NAME      ATTR_NAME                     DEFAULT        CURRENT
============================================================================
emc0           iopolicy                      MinimumQ       MinimumQ
emc0           partitionsize                 512            512
emc0           use_all_paths                 -              -
emc0           failover_policy               Global         Global
emc0           recoveryoption[throttle]      Nothrottle[0]  Nothrottle[0]
emc0           recoveryoption[errorretry]    Timebound[300] Timebound[300]   <-- iotimeout at 300 seconds
emc0           redundancy                    0              0
emc0           dmp_lun_retry_timeout         0              0
emc0           failovermode                  -              -
 

For this example, we increase the value to 315 seconds:

#vxdmpadm setattr enclosure emc0 recoveryoption=timebound iotimeout=315

The change is instant-online, non-disruptive and persistent across reboots.





Article URL http://www.symantec.com/docs/TECH186667


Terms of use for this information are found in Legal Notices