Avoiding Service Interruption with I/O Shipping Technology
One of the new features that has been introduced within Storage Foundation Cluster File System HA 6.0.1 is I/O Shipping. This new feature improves resiliency across Cluster File System nodes reducing downtime and improving resiliency. I/O Shipping is about being able to ship I/O from one cluster node to another when a problem in the I/O path occurs. Once enabled, this work transparently for the user, automatically reacting to failures.
You may think in your backup job running in your media server that is writing to disk, your batch process that has been working for a few hours or your database transaction that should not be interrupted. Any outage in the path down to the storage may affect those jobs and therefore impact your business. There is no need to suffer any outage when working in a cluster and all the I/O can still being made by the other nodes in the cluster.
To prove how this feature works a four node cluster will be used. Each node has read and writes access to the same file system (/data01) using Cluster File System. Write workload will be generated to the file system from node cfs02 and disk1 will be abruptly unplugged from that node. All the I/O generated from node cfs02 to the missing disk will be shipped to the other nodes in the cluster.
I/O Shipping is not enabled by default, so first thing is to enable it using vxdg command:
# vxdg -g dg01 set ioship=on
Once we have enabled I/O shipping, let´s simulate some workload on the directory. Because it is a RAID0 volume layout, all the five disks are used:
Now a path failure is going to be simulated by removing the first disk from cfs02 node.
Now the disk is presented as local failed:
And the disk has disappeared from the OS output, but notice that the writes continue in the other disks:
Taking a look to the disk activity in the other nodes, we can observe that they all are writing to the disk that failed locally on cfs02 node.
In order to recover the original situation we just need to attach the storage back (or fix whatever issue made the path to fail). Once the path is recovered, the disk is presented to the OS again and the I/O is performed locally again:
All disks are fine now:
Therefore, I/O Shipping technology enhances service availability by not missing any transaction or breaking any running job. It allows the application to continue running, avoiding any recovery. Once the issue has been fixed, it again transparently for the application ship the I/O to the local node.
You may have noticed a little drop on performance, given that I/O goes trough the private links. We are already working on a new release that will bring an exciting technology to avoid that performance impact. Keep tuned!