What is the general flow of an i/o between a volume and a disk (Including DMP) ?

Article:TECH32080  |  Created: 2004-01-13  |  Updated: 2009-01-22  |  Article URL http://www.symantec.com/docs/TECH32080
Article Type
Technical Solution

Product(s)

Environment

Issue



What is the general flow of an i/o between a volume and a disk (Including DMP) ?

Solution



What is the general flow of an i/o between a volume and a disk (including DMP) ?

The general flow of i/o through a multi-pathed stack can be summarized in the diagram below:

                    Application
                         |
                        FS (i.e VxFS or UFS)
                         |
                       vxio (VxVM)
                         |
                       vxdmp (DMP)
                         |
                   -------------
                  |             |
                 sd             sd (OS disk driver)
                  |             |
                 HBA           HBA (Host Bus Adaptor
                  |             |
                 SAN           SAN
                  |             |
                   -------------
                         |
                     Disk/LUN

Applications that run on top of the file system typically send I/O requests to files while databases fire them directly on raw devices. In the former case, based upon the pathname of file, the I/O will then be directed to the relevant file system. The file system will intercept this request and channel its own i/o to the volume device that it resides on:  /dev/vx/[r]dsk/diskgroup_name/volume_name. As I/Os enter this device they are received by the VxVM kernel driver, vxio. The latter maintains the volume-plex-subdisk-disk configuration in the kernel. For a given I/O, vxio will ascertain from the in-memory volume configuration, which disk(s) the I/O is destined to be serviced by, and sends the I/O (buffer) to the relevant dmp metanode for that disk device.

The dmp metanode is a pseudo device located in /dev/vx/[r]dmp/ and is a representation of the disk with all its paths. When the I/O is directed at the dmp metanode device, it is handled by the dmp kernel module, vxdmp. The vxdmp driver creates its own buffer to service the I/O and piggybacks the incoming (vxio) I/O on this buffer. DMP will select one of the sub-paths to send the I/O and passes the buffer to the disk driver instance for that path. The buffer will include the lbolt value (the number of clock ticks since boot time) at the time of firing the I/O to the disk driver and also the number of times the I/O has been retried. dmp will now wait on the I/O as it leaves its domain and is now in the SCSI disk driver.

Note:
  • If VxVM foreign device support is used, the I/O will bypass the dmp layer, i.e vxio passes the buffer directly to the relevant third-party driver.
  • On Solaris, I/Os will enter the dmp layer even if only one path exists.
  • In case of HP-UX, even though vxio sends the buffer on a raw dmp metanode, dmp sends it down the block interface of the SCSI driver.
  • In case of AIX, if the lun is single pathed, then the I/O is sent directly to the SCSI driver, bypassing dmp. This fastpathing is done only for single pathed luns. Also, if the SCSI driver returned the I/O with error, the I/O is then retried through vxdmp.

The scsi disk driver (sd, ssd etc) will now process the i/o and send it to the relevant HBA driver which eventually sends it across to the relevant disk/LUN. On completion of the I/O, the buffer returns all the way up the stack back to the calling system call that initiated the I/O.


What happens if there is an I/O failure?

If there is an I/O failure in the disk sub-system, the error will be propagated up the I/O stack. If the SCSI disk driver detects the error, it will go through its own timeout & retry mechanism (controlled by sd_retry_count and sd_io_time in case of Solaris sd driver). When exhausted, the scsi disk driver should return the buffer back to dmp with the B_ERROR flag set. The buffer is then placed on the dmp error processing queue. The dmp error daemon will then test the problematic path to see if it is a transient or permanent problem. This is done by sending a SCSI inquiry ioctl to the path to check if the device is accessible. If the SCSI inquiry succeeds, then dmp will re-issue the I/O down the same path. This process will be repeated dmp_retry_count times by vxdmp. If all the retries are exhausted then it is interpreted as media error. DMP assumes that some driver above it in the stack may wish to retry the I/O (with data relocation etc) and hence the error is returned to vxio without marking the dmp metanode as "failed". (No vxdmp message is logged).

If the SCSI inquiry fails, DMP will log the path as "disabled" then check the state of the device by sending SCSI inquiries to the other paths in the pathset. If at least one path inquiry succeeds, then the i/o is re-issued down the next enabled & active path. If all the path inquiries fail, the device is determined to be a dead LUN and the DMP metanode is marked as "failed". The vxio buffer that has piggybacked on the buffer with the I/O error is then extracted and passed back to vxio, which in turn logs a vxio read/write error for the relevant subdisk and block offset.

vxio will then initiate its own error processing. Depending on the configuration of the volume, the plex with the failed subdisk will be detached. If the volume has redundancy, then the error will not be reported to the upper layer (file system etc) as the volume is tolerant to the failure. If the volume has no redundancy, then the I/O to the volume has failed and is propagated to the file system layer above. The file system will log the failed I/O and return error to the system call of the application. In VxFS, depending on the failure and the mount options, the file system may itself be disabled (logged).

Once a dmp metanode is marked as failed, subsequent I/Os on that dmpnode (to that lun) will return immediately from dmp. If multiple I/O buffers are returned with error (say for different luns/regions), these are processed serially by vxdmp.

When does a temporarily "failed" path get enabled again?

A path may be marked as "failed" due to a transient error due to cabling/switch or any other issues.   The retry mechanism as described above, is meant to avoid disabling a path straightaway in case the I/O failed due to a transient error. Once a path is marked as "failed", it will remain in that state unless the dmp restore daemon checks the health of the path and determines it to be good. The dmp restore daemon wakes up every "dmp_restore_daemon_interval" seconds and checks if the paths are okay, by doing SCSI inquiry ioctl. The paths chosen for health check depend on the "dmp_restore_daemon_policy". If it finds the path to be good, the same is marked as active & enabled and is available for sending down I/Os. dmp_restore_daemon_interval defaults to 300 seconds and can be changed with vxdmpadm

What is a "disabled" path?

A given path or paths under a given controller can be marked as "disabled" through administrative intervention. When vxdmpadm command is used to disable path(s) in this manner, they are marked as "disabled". Such paths can be brought back to "active enabled" state only through the administrative command - vxdmpadm enable.

Active Active (A/A) and Active Passive (A/P) arrays and dmp behavior
In case of A/A arrays, dmp does load balancing by distributing the I/O across the multiple paths that exist to a given lun. In case of A/P arrays, the I/Os are fired through the primary path(s) of the dmp metanode, which correspond to the "active" port on the array. The secondary paths of the dmpnode will correspond to the "passive" port. If the primary fails due to some reason, dmp fails over to the secondary path so that the secondary becomes the active path. As and when the primary comes back, dmp would failback to the primary path.  In case of A/PF (explicit failover) arrays, a command will be sent to enable the new path in case of failover / failback, while in case of auto trespass mode, the first I/O that goes through the new path will make it the active path.  

Insane device behavior
It has been seen in extreme and rare scenarios that devices can go "insane", where i/o failures are not returned for long periods of time, yet SCSI inquiries respond promptly. This could cause a hang conditions for the i/o. Consider such a case where vxdmp fire's the i/o to the sub system which holds on to the i/o for ten minutes but then returns it as failed. DMP will test the path with a SCSI inquiry which succeeds, then re-issues the i/o to the problem path as it appears as a transient error. This i/o may also take ten minutes and the process repeats until dmp_retry_count is exhausted. In such cases i/o's have taken many minutes to fail, clearly cause an issue for the upper level application.

To counter this insane device behavior and in order to prevent a hang, a tunable threshold in DMP was introduced, dmp_failed_io_threshold. In an insane device scenario, when an errored buffer is eventually returned to DMP from the layer below, DMP will compare the current lbolt value against the one sent in the buffer when the I/O was initiated. If the time exceeds that of dmp_failed_io_threshold then the buffer does not go through DMP error processing (no DMP test & retry) and the error is immediately returned to vxio (No vxdmp messages logged). This mechanism prevents the retry mechanism for delaying the i/o error any further.

If using mirrored volumes, the threshold value will lessen the time to detach the plex contain the insane device, such that normal i/o can continue with the remaining plex. If using simple volumes with no redundancy, it may be preferable to keep retrying the device even if this prolongs the hang. If this is the case, the threshold should be increased to a large value. Without the threshold limit, insane devices could be retried over a long period of time. If this scenario does occur, the driver stack below dmp should be investigated.

To configure and display dmp_failed_io_threshold:
Solaris
Configure in /etc/system for 3.2 and /kernel/drv/vxdmp.conf for 3.5 onwards. The value can be displayed with echo "dmp_failed_io_threshold/DD" | adb -k or prtconf -vP on 3.5+. The defaults are:
3.2                                 60000 1/100ths sec (10mins)
3.2 Patch5                      57600 secs (16hrs)
3.5, 3.5MP1 & 3.5MP2    600 secs (10mins)
3.5MP3                          57600 secs (16hrs)
4.0                                 600000000 microsecs (10 mins)

HPUX
Not applicable

AIX
in 3.2.2.1 onwards, configure in smit vxvm, default is 600 secs (10mins)

Linux
Not applicable

If using 3.2patch05 or 3.5MP3, a message "Reached dmp Threshold IO TimeOut" will be logged when this scenario is encountered.
Hung I/O
The components in the I/O stack should always complete the I/O request by signalling success or failure. If any component holds onto the I/O without signalling success or failure, the I/O will hang. As more requests are channelled through the I/O subsystem, the hung I/O may cause a bottleneck and the whole system may ground to a halt. In such hung systems, the cause of the hang should be investigated
Note: VxVM/DMP has no initial timeout mechanism and relies upon the error notification of the driver layers beneath (sd etc). If the sub layer drivers do no detect the error or hold on to the I/O, then DMP will be unable to process the error and rectify the problem (such as using another path).


Legacy ID



268035


Article URL http://www.symantec.com/docs/TECH32080


Terms of use for this information are found in Legal Notices