Everything you think you know about clustering is wrong: If your clustering is expensive, you're doing it wrong
Created: 13 Apr 2012 | Updated: 16 Apr 2012 | 4 comments
A couple days ago, I blogged about the related myths of complexity and unreliability regarding high availability (HA) clustering. Today I want to spend a little time on the myth that clustering is expensive.
Early high availability clustering was as simple as it was primitive. Shared storage was generally limited to two nodes via dual-attached SCSI disks and communication between nodes typically consisted of each node just pinging the other periodically to check its state. If the standby node decided the active node was dead, it would respond to that failure by firing local copies of the failed node's startup scripts to restart applications that had been running there.
But SAN and NAS technologies which allowed many more nodes to share a common storage pool rendered this approach obsolete, and HA cluster vendors responded by supporting cluster sizes greater than two nodes. This required that the cluster solution become much more sophisticated, and cluster vendors took divergent approaches to that. At Symantec (Veritas at the time), we responded by replacing our FirstWatch product with Veritas Cluster Server (VCS), designed and built from the ground up with larger, more modern clusters in mind. Simple ping heartbeats were replaced by more sophisticated cluster communications carrying cluster membership and resource state change messages, delivered to a sophisticated HA engine executing in lock-step on all nodes in the cluster. This approach allowed us to support cluster sizes of up to 32 nodes, and the introduction of Service Group Workload Management ensured VCS would make intelligent failover target selections in a large cluster.
While large clusters meant we no longer needed two nodes for every critical application, many people's mindsets were still stuck in 1997 and they continued to view HA clustering in a 2-node, active/passive context. Well, if you deploy two servers for every critical app, of course clustering is expensive.
Quick example of how larger clusters can deliver an economy of scale and reduce the cost of HA...
Back in the days before Symantec and Veritas Software merged, I was a high availability solutions architect in the professional services group at Veritas, and a frequent customer of mine had a voracious appetite for Veritas Cluster Server. They were in rapid expansion mode at the time, and every time they rolled out a new customer-facing application their standards called for it to be deployed in a VCS cluster. It seemed I was visiting this customer every other month deploying a new 2-node VCS cluster and over time, they'd built up quite the collection of 2-node, active/passive clusters. Something like 30 of them, if I recall correctly.
When it came time for a technology refresh, the server engineers and architects cringed at the thought of replacing a bunch of server hardware that had spent nearly all its time idle, so they asked me back in to assist with rearchitecting their HA environment. When we were done, we had reduced their 30 or so clusters down to around half a dozen, with each cluster consisting of five to seven nodes, including a single spare. The result of all this was a reduction of around 20 clustered servers. As I mentioned earlier, VCS supports cluster sizes of up to 32 nodes and I've seen customers deploy clusters that size or close to it, allowing them to include just 2 or 3 spares in the cluster.
HA clustering is very much like insurance in the sense that yes, it's an added cost, but people deploy HA for much the same reasons that they buy insurance, or install redundant power sources in their data center...the cost of not having it when it's needed is unacceptable.
In my next post, we'll take a look at how VCS has evolved over time into an operational management tool.