Some of the data I plan to analyze in this blog involve scale/size, including utilization. I've heard industry analysts and experts cite storage utilizations rates around 40%; Nicholas Carr recently spoke at Symantec and noted that at roughly 30%, storage was much more utilized than CPU (~12%). Since running a production website, I've gained a visceral awareness of what I previously knew conceptually. Namely, storage utilization is a tradeoff between two goals: keeping spare capacity in case it is needed, and storage cost.
I analyzed aggregate data uploaded to SORT by customers to look deeper into utilization. This analysis, of file system utilization, suggests that only measuring average utilization is misleading. By looking at the overall distribution, a richer picture emerges.
As seen in Figure 1 below, storage utilization is bimodal, i.e., it has two distinct peaks, one at 1% and one at 100%. In the context of...