gzip vs dedup: I shrink, therefore I am

I stole “I shrink, therefore I am” from my wife’s good friend Arun Verma, who is incredibly creative, and makes some of the best lamps ever. He also does websites and ads if you are interested.

I have a macbook and use vmware fusion to run a windows XP VM. I keep all my data on a hosted folder on the mac’s operating system. So the VM is basically programs and user settings. In addition I have several images which I work with: Red Hat Enterprise, Ubuntu, Win 2K3 etc. Not atypical of someone who either develops or tinkers with technology.

My problem is that out of a 120GB hard disk, I am upto 100GB, and a whopping 60GB of that is virtual images. I have about 8. So I wanted to see if I could compress the virtual images in some fashion. I decided to run a small test of how much dedup would buy me over gzip

w2k3.vhd: Original size: 1.6GB
w2k3.vhd.gz: 712 MB

Further Analysis of the image showed that there were
14K Zero Filled Blocks, and
About 40K blocks occurred more than once

So an in-image-Dedup Optimization: 14K + 40K blocks ~ 200MB
Next I added a windows XP image:
wxp.vhd: 2GB
gzip wxp.vhd –> 921 MB
23K Zero Blocks
43K Additional Blocks Repeated between this and previous image
Dedup Optimization: 66K*4K ~ 250MBClearly gzip would win over a simple dedup. Even with two images xp and w2k3 I guess there are just not enough blocks to make dedup shine. Less than 10% of the blocks are being found. Cloning in some sense avoids large matches in a small set of images like on the desktop.
So the obvious next question was well how about dedup + gzip. Here things got a little more interesting:
gzip + dedup on w2k3.vhd: 720 MB (yes larger than just gzip)
gzip + dedup on wxp.vhd: 963 MB (also larger than gzip)
I was not expecting it to be larger. The raw file is not, but if you add the metadata you have to keep for the blocks, it begins to add up. Its close to gzip + metadata. Which means that gzip does a pretty good job with zero filled blocks and also the repeated blocks.
PS: Blocks in this context are 4K

Written by RS

September 10, 2009 at 10:39 pm

