Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.
Storage and Availability Management

Everything you think you know about clustering is wrong: If your clustering is expensive, you're doing it wrong

Created: 13 Apr 2012 • Updated: 11 Jun 2014 • 4 comments
Eric.Hennessey's picture
+1 1 Vote
Login to vote
A couple days ago, I blogged about the related myths of complexity and unreliability regarding high availability (HA) clustering. Today I want to spend a little time on the myth that clustering is expensive.
 
Early high availability clustering was as simple as it was primitive. Shared  storage was generally limited to two nodes via dual-attached SCSI disks and  communication between nodes typically consisted of each node just pinging the other periodically to check its state. If the standby node decided the active node was dead, it would respond to that failure by firing local copies of the  failed node's startup scripts to restart applications that had been running there.
 
But SAN and NAS technologies which allowed many more nodes to share a common storage pool rendered this approach obsolete, and HA cluster vendors responded by supporting cluster sizes greater than two nodes. This required that the cluster solution become much more sophisticated, and cluster vendors took divergent approaches to that. At Symantec (Veritas at the time), we  responded by replacing our FirstWatch product with Veritas Cluster Server (VCS), designed and built from the ground up with larger, more modern clusters in mind. Simple ping heartbeats were replaced by more sophisticated cluster communications carrying cluster membership and resource state change messages, delivered to a sophisticated HA engine executing in lock-step on all nodes in the cluster. This approach allowed us to support cluster sizes of up to 32 nodes, and the introduction of Service Group Workload Management ensured VCS would make intelligent failover target selections in a large cluster.
 
While large clusters meant we no longer needed two nodes for every critical application, many people's mindsets were still stuck in 1997 and they continued to view HA clustering in a 2-node, active/passive context. Well, if you deploy two servers for every critical app, of course clustering is expensive.
 
Quick example of how larger clusters can deliver an economy of scale and reduce the cost of HA...
 
Back in the days before Symantec and Veritas Software merged, I was a high availability solutions architect in the professional services group at Veritas, and a frequent customer of mine had a voracious appetite for Veritas Cluster Server. They were in rapid expansion mode at the time, and every time they rolled out a new customer-facing application their standards called for it to be deployed in a VCS cluster. It seemed I was visiting this customer every other month deploying a new 2-node VCS cluster and over time, they'd built up quite the collection of 2-node, active/passive clusters. Something like 30 of them, if I recall correctly.
 
When it came time for a technology refresh, the server engineers and architects cringed at the thought of replacing a bunch of server hardware that had spent nearly all its time idle, so they asked me back in to assist with rearchitecting their HA environment. When we were done, we had reduced their 30 or so clusters down to around half a dozen, with each cluster consisting of five to seven nodes, including a single spare. The result of all this was a reduction of around 20 clustered servers. As I mentioned earlier, VCS supports cluster sizes of up to 32 nodes and I've seen customers deploy clusters that size or close to it, allowing them to include just 2 or 3 spares in the cluster.
 
HA clustering is very much like insurance in the sense that yes, it's an added cost, but people deploy HA for much the same reasons that they buy insurance, or install redundant power sources in their data center...the cost of not having it when it's needed is unacceptable.
 
In my next post, we'll take a look at how VCS has evolved over time into an operational management tool.

Comments 4 CommentsJump to latest comment

Glen B's picture

Eric, you are on target about the evolution of clustering, both old and new customers need to review what their BC objectives are and look at what they can do to meet those objectives. I look forward to your next blog on VCS and hope that you throw in some VOM awareness.

Glen Bellomy                                     

0
Login to vote
Eric.Hennessey's picture

Thanks, Glen!

My latest post includes a bit about VOM, and I'll have more posts coming up drilling down into some VOM specifics in the near future.

Cheers!

Business Continuity Solutions Evangelist

0
Login to vote
Seann Herdejurgen's picture

N+1 clustering is what it's all about.  As you said, most people think of two-node active/passive clusters where N=1.

Here's a real ROI on why you need to increase your capacity horizontally instead of vertically.  Let's say you have a total capacity of 16 CPUs and 256GB of memory for running multiple database instances, and that capacity is split evenly between two different servers (i.e. two 8 CPU / 128GB systems).  Using this configuration, you should only plan to use 8 CPUs and 128 GB memory so you can sustain the failure of a single node in the cluster.

Let's say you need to double your database capacity.  One way would be to scale vertically and upgrade each database server to 16 CPUs and 256GB of memory, for a total capacity of 32 CPUs and 512GB of memory.  Another way to double your capacity is to scale horizontally and add a third node (8 CPU / 128GB) to your cluster, for a total capacity of 24 CPUs and 384GB of memory.  In either scenario, if a single node fails, you still have a total capacity of 16 CPUs and 256GB of memory available to run multiple database instances.

Using a horizontal approach to clustering allows you to increase the utilization on each of your nodes while still retaining the capacity necessary to support your business.  As an example, in a 2-node cluster, you need to make sure that your maximum utilization is 50%, but in a 3-node cluster you can increase your utilization to 67%, or in an N-node cluster you can increase your utilization to (N-1)/N %.  Here's a quick reference so you don't have to do the math in your head:

Cluster

Size

Maximum

Utilization

2-node 50%
3-node 67%
4-node 75%
5-node 80%
6-node 83%
7-node 86%
8-node 88%
9-node 89%
10-node 90%

When a database vendor licenses their software by total available CPU capacity, growing the number of cluster nodes results in fewer overall CPUs and increases your maximum available capacity utilization while maintaining high availability.  Scaling horizontally requires additional infrastructure hardware and software licensing, but it may be more advantageous to go that route depending on your situation.

If you want even higher availability, then you can go with N+2 clustering, but that is a story for another day...

0
Login to vote
Eric.Hennessey's picture

Nice way to illustrate the case, Seann.

You mention N+2, which is something I've seen in larger N+ clusters, typically where N = something more than 10 or 12. With that many systems, it just makes sense to have more than one spare.

Business Continuity Solutions Evangelist

0
Login to vote