Dedupe ratio predictor
I have the idea to write a small application in C++ (so it is portable between unix and Windows) to traverse the backup file tree and compute for each file (if smaller than a block) or block an MD5 hash. Block size shall be a parameter.
There is no need to have an incremental run as only the hashing is computed, but in version 2, this can be added using some portable free DB.
All the hash numbers shall be processed by either Excel or awk to produce a histogram which will give some idea of the expected dedupe ratio.
I wonder if anybody had it done already or you think something is wrong with this idea.
I'll apreciate your inputs.