Compression - A short explanation
Every now and then, there would be an aggrieved user who feels that he was not given the full measure by his tape. For example, for LTO3 tapes, the label on the box specifies 400GB and he is not getting anything close to this. To add to the confusion, LTO tapes are often marketed with 2 capacities, e.g., LTO3 would be marketed as 800GB/400GB. What do all these numbers mean?
In the example above, 400GB is the native capacity of the LTO3 tape, i.e. it can hold 400GB of uncompressed data. The LTO3 tape can store 800GB of data if the data is compressed to half its normal size. In this case, the compression ratio is 2:1. It is quite difficult to achieve this level of compression. Typically, with a mix of file types, one would expect a compression ratio of 1.2 – 1.3.
How the type of data affect compression ratio?
To understand why some type of files compress better than others, we will use a simple scheme to compress data. Suppose we have to compress this string of 35 characters
We can put a digit in front of each character to denote the number of such consecutive characters in the string. We then end up with this string
which is only 10 characters long. We have achieved a compression ratio of 35:10 or 3.5:1 which is very good. If we were to this scheme to compress this second string, we would get.
which is 20 characters long and the compression ratio is 10:20 or 0.5:1. We ended up with a string which is double the size of what we started with.
In terms of LTO3 tapes, with a 3.5:1 compression ratio, we would be able to cramp 1.4TB of data onto one tape, while the 0.5:1 compression ratio means that one tape can only hold 200GB of data. While the former compression ratio is hardly achieved, the latter compression ratio is no so uncommon.
We have just seen the effect of compressing an already compressed data. It can result in a bigger file than what we started with. This phenomenon is applicable to all compression algorithms regardless of their sophistication.
What are the kinds of compressed files that we are likely to backup? In addition to the obvious zipped files, all audio-visual files, like MP3’s, AVI’s, JPEG’s are compressed and will compress badly. If you are unsure about how a particular file will be compressed by BE, just zip it up and compare the before and after file sizes. The compression ratio may not be the same as that achieved by BE because of the different compression algorithms used, but it should give you a good indication of how well the file compresses when backed up with BE.
BE is poor in reporting compression ratio less than 1 which you will get when the compressed version of the file is bigger than the original file. The compression ratio in such cases is reported as 1:1.
Software vs. Hardware Compression
Hardware compression is only available if you are using a tape drive to store your backup data. If you are using a B2D folder to store your backup data, then you should use software compression because hardware compression is not available.
When software compression is used, the data is compressed on whatever machine it is on before it is sent to the media. This means that if the data is on a remote server, it will be compressed on the remote server before it is sent to the media server. This increases the CPU load on the remote server, but it saves on bandwidth between the remote and media servers.
On the other hand, hardware compression is done by the tape drive. Hence, there is no CPU load on both the remote or media servers. However, uncompressed data is sent from the remote server to the media server which requires more bandwidth than that required for software compression.
Compression and Encryption
The basis of all compression algorithms is searching and eliminating redundant patterns of data. The more patterns repeat in a file, the better it compresses. To foil frequency analysis, a good encryption algorithm will randomize data in the final encrypted file as much as possible. This means that the data does not repeat. Thus encrypted data compresses very badly.
LTO3 and below does not support hardware encryption. A common mistake is specifying software encryption and hardware compression. When the encrypted data is sent to the tape drive, there is nothing to compress. For LTO3, if you require both encryption and compression, then both have to be software to achieve any compression. The data is compressed and then encrypted on the server. The result is that regardless of whether hardware or software compression, the compression ratio of the data on the tape is 1:1.
Hardware encryption is available for LTO4 and above. For these tapes, you can specify both hardware compression and encryption. The raw data is sent to the tape drive where it is compressed and then encrypted. BE will compare the raw data sent to the tape drive and the amount of data that is written to tape and it would be able to calculate the compression ratio.
Same job, different compression ratio
You may encounter a situation where the backup job spans a couple of tapes and the compression ratio for each of the tape is different. This can happen because compression ratio is calculated from the amount of data sent to the tape (circled in green in the diagram below) and the amount of data that is written onto the tape (circled in red).
If the job is big backing up many resources, then it is conceivable that all the databases are backed up to the first tape and the second tape contains mainly audio-visual files. Databases are mainly text which compresses very well and will have a high compression ratio, while audio-visual files compresses badly. If you check the statistics, the first and second tape will have different compression ratio although they are the product of the same job.
I hope that I have explained some of the mysteries surrounding compression. This article is by no means exhaustive, but it should be a good start to understanding compression.