Sun's Comparision of VxFS and ZFS Scalability is Flawed
The comparisons with VxFS appear to be objective, but in fact the performance comparisons are chosen quite selectively. In addition, the most recent white paper contains a few significant errors.
Going through the most recent white paper from beginning to end, the first thing to strike me were some significant errors in the discussion of file system scalability. Errors include the claims that "The maximum size of a Veritas File System is 32TB", and that "Solaris ZFS uses 128-bit block addresses".
Maximum Supported File System Size
As far as VxFS is concerned, the author confuses the maximum supported file system size of VxFS with the theoretical scalability of the disk layout. While it's true that the maximum supported file system size of VxFS 4.1 on Solaris is 32 Tbyte, the author goes on to claim that this is the maximum scalability of VxFS. That's obviously incorrect since the maximum supported file system size of VxFS 5.0 on Solaris is 32 Gblock (256 Tbyte with an 8 Kbyte block size). It's also a bit strange that the author chose to cite VxFS 4.1 since VxFS 5.0 was released July 11, 2006, almost a year before the paper was finished.
The maximum supported file system size of VxFS in a given release represents the largest file system that we're confident will work in a customer environment based on our testing and the scalability of algorithms used in VxFS. It grows over time as CPUs increase in speed, memory becomes cheaper, algorithms are improved, and as customer requirements dictate. It has grown over time, and will continue to grow in the future, assuring customers the continued ability to maintain all of their data on VxFS file systems. The maximum supported file system size does not represent, in any sense, a theoretical maximum file system size.
Nevertheless, the white paper compares the maximum supported file system size for VxFS 4.1 with the theoretical scalability of the ZFS disk layout.
Although I have not contacted Sun's support organization to check, it's unlikely that the maximum supported file system size of ZFS is 2^128 blocks since it's highly improbable that Sun has actually tested a ZFS file system to anywhere near 2^128 blocks in size. If Sun hasn't tested a file system size anywhere near this large, how can they claim to support it?
Theoretical Maximum File System Size
So lets do a real comparison of "theorteical maximum file system size". The theoretical maximum file system size of VxFS, with the current version 7 disk layout, is 2^85 bytes (32 yottabytes or 32,768 zettabytes). With some knowledge of the VxFS disk layout, this is easy to calculate. In a multi-device file system, VxFS reserves 16 bits for the device number, 56 bits for the file system block number, and has a maximum file system block size of 8192 bytes (2^13 bytes). Putting these together we get a theoretical maximum file system size of 2^16 devices * 2^56 blocks/device * 2^13 bytes/block == 2^85 bytes.
(Note that going past 2^63 bytes in a file or 2^64 blocks in a file system will probably require a change to the Operating System APIs, which currently use 64 bit fields to hold these numbers).
Of course that's just with the current disk layout. VxFS has gone through 6 revisions of it's disk layout since 1989 and each time we've provided an online upgrade from the previous version(s) of the disk layout. Further, we support older versions of the disk layout for several years after we introduce a newer version, so upgrading is relatively painless for our customers.
This year has seen the first shipment of 1 Tbyte disk drives. If current trends continue (which seems unlikely), we'll see 512 Ebyte (exabyte) disk drives in about 21 years, around which time a new version of the VxFS disk layout would be required. Of course we'll probably have revised the disk layout to offer other new features before than, but this should give some feeling for the scalability of the current layout.
ZFS is Not Quite 128 Bits
Sun's paper contains another error when it's claim that the scalability of ZFS is "128 bits". While I've already discussed the mistake of comparing the maximum supported file system size of a VxFS file system to the theoretical maximum scalability of ZFS , it appears that the maximum size of a ZFS file system is a good deal less than 2^128 blocks that is claimed. Based on this description of the ZFS disk layout, a block pointer consists of a 32 bit device number, a 63 bit block offset (number), and some other information (see the description of block pointers at the start of Chapter 2 or look at the definition of blkptr_t in /usr/src/uts/common/fs/zfs/sys/spa.h in the Open Solaris source code).
Since ZFS block offsets are always in units of 512 bytes, this means the maximum size of a ZFS file system is 2^32 devices * 2^63 blocks * 2^9 bytes/block == 2^104 bytes This is not exactly the 2^128 blocks claimed in the white paper.
Now, 2^104 bytes for ZFS is still a lot more than the 2^85 bytes for VxFS, but for all practical purposes they're the same -- larger than anything required for the forsee-able future. And, frankly, I think allowing ZFS 2^104 bytes is a overly generous since that includes an assumption of 4 billion devices. I have a difficult time imagining more than 100,000 disk devices in a data center. If we limit ZFS to 131,072 devices (2^17 devices), then the maximum file system size drops to 2^87 bytes which is pretty darn close to VxFS.
Sun has made a number of comparisons between ZFS and VxFS. In the area of scalability, Sun has considerably exaggerated any differences that might exist between the two file systems. In subsequent blog entries I'll look at some of the other issues Sun raises, particularly performance.
Much like benchmark results, claims of scalability need to be examined carefully and treated with skepticism.