shareVM- Share insights about using VM's

Simplify the use of virtualization in everyday life

Posts Tagged ‘dedup

EMC FAST Fully Automated Storage Tiering for storage savings

leave a comment »

Chuck Hollis, VP Global Marketing, CTO, EMC, describes FAST over 3 blog posts. The technology has been in Beta usage by several customers in 2009.

The premise

When you analyze the vast majority of application I/O profiles, you’ll realize that a small amount of data is responsible for the majority of I/Os; almost all of it is infrequently accessed. 

The principle

Watch how the data is being accessed, and dynamically cache the most popular/ frequently accessed data on flash drives, usually the small amount, and the vast majority of infrequently accessed data on big, slow SATA drives.

The storage savings solution

FAST Place  the right information on the right media based on frequency of access
Thin This (virtual) provisioning allocate physical storage when it is actually being used, rather than when it is provisioned.
Small Compression, single-instancing and data deduplication technologies eliminate information redundancies.
Green A significant amount of enterprise information is used *very* infrequently.  So infrequently, in fact, that the disk drives can be spun down, or at the least  be made semi-idle. 
Gone Policy-based lifecycle management – Archiving and Deletion, Federation to the cloud through private and public cloud integration.
The information can get shopped to a specialized service provider as an option

 

… and life goes on!

One thing hasn’t changed, though. The information beast continues to grow

Written by paule1s

December 11, 2009 at 9:29 am

EMC Celerra NAS SAN Deduplication

leave a comment »

The EMC Celerra Deduplication is substantially different in concept, implementation and its benefits from the block-level deduplication offered by NetApp, Data Domain and others in their products. To understand the differences, let us first look at the comparison of data reduction technologies:

Data reduction technologies

Technology Typical Space Savings Resource footprint
File-level deduplication 10% Low
Fixed block deduplication   20% High
Variable-block deduplication   28% High
Compression 50% Medium

 

  • File-level deduplication provides relatively modest space savings.
  • Fixed-block deduplication provides better space savings, but consumes more CPU to calculate hashes for each block of data, and more memory to hold the indices used to determine if a given hash has been seen before.
  • Variable-block deduplication provides slightly better space savings; but the difference is not significant when applied to file system data. it is most effective when applied to data sets that contain repeated but block-misaligned data, such as backup data in backup-to-disk or virtual tape library (VTL) environments.
  • Compression is different from file-level or block-level deduplication in the granularity at which it applies. It is described as infinitely variable, bit-level, intra-object deduplication.  It offers the greatest space savings of all the techniques listed for typical NAS data, and is relatively modest in terms of its resource footprint. It is relatively CPU-intensive but requires very little memory.

The storage space savings realized by  compression is far greater than those offered by the other techniques and its resource requirements are quite modest by comparison. However, compression has a disadvantage in that there is a potential performance “penalty” associated with   decompressing the data when it is read or modified. This decompression “penalty” can work both ways. Reading a compressed file can often be quicker than reading a non-compressed file. The reduction in the size of data that you must retrieve from the disk more than offsets the additional processing required to decompress the data. 

Celerra Data Deduplication

Celerra Data Deduplication combines file-level deduplication and compression to provide maximum space savings for file system data based on

  • Frequency of file access: files that are not “new” (creation time older than a configuration parameter), or  not “hot”, i.e., in active use (access time or modification time older than a configuration parameter)
  • File size: It avoids compressing files either if the files are small and the anticipated space savings are minimal, or  if the file is large and its decompression could degrade performance and impact file  access service levels.

Deduplication is enabled at the file system level and is transparent to access protocols. Mark Twomey‘s post provides an excellent overview of Celerra Data Deduplication.

The space reduction process

Celerra Data Deduplication has a flexible policy engine that specifies data for exclusion from processing and decides whether to deduplicate specific files based on  their age. When enabled on a file system, Celerra Data Deduplication periodically scans the file system for files that match the policy criteria and then compresses them. The compressed file data is hashed to  determine if the file has been identified before. If the compressed file data has not been identified before, it is copied into a hidden portion of the file system. The space that the file data occupied in the user portion of the file system is freed and the file’s internal metadata is updated to reference an existing copy of the data. If the data associated with the file has been identified before, the space it occupies is freed and the internal file metadata is updated. Note that Celerra detects non-compressible files and stores them in their original form. However, these files can still benefit from file-level deduplication.

Celerra Data Deduplication employs SHA-1 (Secure Hash Algorithm) for its file-level deduplication. SHA1 can take a stream of data less than 2 bits in length and produce a 160-bit hash, which is designed to be unique to the original data stream. The likelihood of different files hashing the same value is so substantially low that a collision rate has been reported after 2^69  hash operations. Unlike in compression, you can disable file-level deduplication in Celerra Data Deduplication.

Designed to minimize client impact

Celerra Data Deduplication processes the bulk of the data in a file system without affecting the production workload. All deduplication processing is performed as a background asynchronous operation that acts on file data after it is written into the file system. This avoids latency in the client data path, because access to production data is sensitive to latency. By policy, deduplication is performed only for those files that are not in active use. This avoids introducing any performance penalty on the data that clients and users are using to run their business.

Written by paule1s

December 7, 2009 at 12:51 pm

NetApp features for virtualization storage savings

with 3 comments

The feature set that gives customers storage savings is described in a 42 minute informative video on Hyper-V and Netapp storage – Overview. I have summarized it in a 5 minute long post below.

Enterprise System Storage Portfolio

The Enterprise product portfolio consists of the FA series, V Series storage systems. These systems have a unified storage architecture based on the Data ONTAP, OS running across all storage arrays. Data ONTAP provides a single app interface and supports protocols such as FC-SAN, FCoE-SAN, IP-SAN (iSCSI), NAS, NFS, CIFS. The V-Series controllers also offer multiple vendor array support, i.e., they can offer the same features on disk arrays manufactured by Netapp’s competitors.

Features

  • Block-level de-duplication, or de-dupe, retains exactly one instance of each unique disk block. When applied to live production systems, it can reduce data 95% for full backups, especially when there are identical VM images created from the same template, and as much as 25%-55% for most data sets.
  • Snapshot copies of a VM are lightweight because they share the same disk blocks with the parent and do not require as much space for the copy as the parent. If a disk block is updated with a snapshot, e.g., if a configuration parameter is customized for an application, or when a patch is applied, the Write Anywhere File Layout (WAFL) file system associates the updated block with the snapshot copy and writes to the disk, leaving the original block and its referrers intact. Snapshot copies therefore impose negligible storage performance impact on running VM’s.
  • Thin provisioning allows users to define storage pools (Flexvol) for which storage allocation is done dynamically from the storage array on demand. Flexvol can be enabled at any point in time while the storage system is in operation.
  • Thin replication between disks provides data protection. Differential Backups and mirroring over the IP network works at the block level copying only the changed blocks – compressed blocks are sent over the wire It enables virtual restores of full, point in time data at granular levels
  • Double parity RAID, called Raid DP, provides superior fault tolerance and provides 46% saving vs mirrored data or RAID 10. You can think of it as being a RAID 6 (RAID 5 + 1 Double Parity disk). RAID DP can lose any two disk in the raid stripe without losing any data. It offers availability equivalent to RAID 1 and allows lower cost /higher capacity SATA disks for applications. The industry standard best practice is to use RAID 1 for important data, RAID 5 for other data.
  • Virtual Clones (Flex clones). You can clone a volume / LUN or individual files. Savings = size of the original data set minus blocks subsequently changed in clone. Enables ease of dev and test cycles. Typical use cases: Build a tree of clones (clone of clones), clone a sysprep‘ed vhd, DR testing, VDI

There are several other videos on the same site that show the setup for the storage arrays. They are worth seeing to get an idea of what is involved to get all the machinery working in order to leverage the above features. It involves many steps and seems quite complex. (The hallmark of an “Enterprise-class” product? 😉 ) The SE’s have done a great job of making it seem simple. Hats off to them!

Netapp promises to reduce your virtualization storage needs by 50%

leave a comment »

50% Storage Savings Guarantee

NetApp‘s  Virtualization Gurantee Program promises that you will install 50% less storage for virtualization than if you buy from their competition, when you

  • Engage them for planning your virtualization storage need
  • Implement best practices recommended by them
  • Leverage features like, De duplication, Thin provisioning, RAID DP (Double Parity RAID), NetApp Snapshot copies

If you don’t use 50% less storage, you can get the required additional capacity at no additional costs

I learned about this in a 42 minute informative video on Hyper-V and Netapp storage – Overview

Written by paule1s

November 29, 2009 at 11:20 am

DropBox dedup only in the cloud

with 3 comments

I had observed in my earlier article that DropBox performs de-duplication in the cloud. This would mean that de-duplication is not performed at the client. In order to test my hypothesis, I performed the following experiment:

I first looked at the size of the DropBox folder on Windows and found it to be 1,723,871,232 bytes.

Next, in the DropBox client, I opened the DropBox folder and simply duplicated the contents of the Public folder by copying the 1.68MB file and pasting it as its copy. I looked at the size of the folder once again and it had doubled to 3,446,513,664 bytes.

If DropBox had been performing dedup at the client, then it should have detected the duplicate blocks between the parent and its copy at source and the folder should not have grown in size at all. As a result, my conclusion is that DropBox dedup’s only in the cloud but not at the client.

Wait, there’s more:

I repeated the same experiment on the Mac after deleting the duplicate file. Here’s what I started out with:

Last login: Thu Sep 17 15:30:58 on ttys000
mace1s:~ paule1s$ du -k DropBox
1152 DropBox/Photos/Sample Album
1516 DropBox/Photos
1682636 DropBox/Public
368 DropBox/sharevm
1684880 DropBox

Notice that the total size of the folder (the last line of the listing above) is 1.68GB.

Next, in the DropBox client, I opened the DropBox folder and simply duplicated the contents of the Public folder by copying the 1.68MB file and pasting it as its copy. I looked at the size of the folder once again and saw:

mace1s:~ paule1s$ du -k DropBox
1152 DropBox/Photos/Sample Album
1516 DropBox/Photos
2600140 DropBox/Public
368 DropBox/sharevm
2602384 DropBox

This is very interesting. I had expected the storage requirements to double to 3,369,760 however, they grew by approx. 1GB. What happened to the remaining 682MB? Did the DropBox client truncate the file? If so, why?

Readers, can you shed some light?

Written by paule1s

September 17, 2009 at 5:34 pm

Compressed VM file transfer using DropBox

with 2 comments

I am using DropBox for transferring compressed files including VM’s  between my environment at home, a Mac running Windows XP SP3 in VMware Fusion 2.0.5 and the test machine, a Windows XP SP3 system located in the office lab. Each machines has a DropBox  folder linked to the same account.

Neat product!

I love the simplicity and ease of use. A lot of thought has gone into making the product easy to install, the integration with the host OS (Windows and Mac) is seamless and sets a benchmark for how UI’s for downloadable products should be designed.

Usage model

I compress each file using the Mac’s native file compression and drop into into my DropBox folder. DropBox seems to follow a two-step file transfer process:

  1. It first uploads the file completely from the source DropBox folder to the DropBox folder in the cloud
  2. After the upload is complete, the file is then downloaded from the DropBox folder in the cloud to the destination DropBox folders.

Setup

Speed ratings are from here. I have been able to correlate these speeds with the end-to-end transfer times.

Transfer Type

Speed Rating for my ISP

Observed DropBox Transfer Rate

Upload

120 KB/sec

70 KB/sec

Download

360 KB/sec

210 KB/sec

Near real-time transfer for uncompressed files

DropBox transfers uncompressed files almost instantaneously between the two machines. The files are transferred sequentially and seem to arrive in order. For example,  I transferred a 1.72 GB folder containing 400 photographs and the photos started appearing sequentially 10 – 15 seconds apart.

Compressed files

Compressed files are transferred as a unit, although dedup applies to blocks contained within it. The transfer times are as recorded below:

Original Size

Compressed Size

Upload Time

Download Time

Total Time

4.30 GB

1.6800 GB

6h 40m

2h 12m

8h 52m

2.15 GB

0.6714 GB

2h 27m

0h 48m

3h 15m

1.10 GB

0.2371 GB

0h 56m

0h 18m

1h 14m

Dedup works well with compressed files

DropBox examines the file to be transferred and builds an index of blocks to be transferred. Its de-duplication technology is smart enough to figure out when not to transfer blocks that are duplicates, i.e., have already been transferred before. For example, when I tried to transfer two clones, the first one took a long time to transfer ( a few hours), but the second transfer was very rapid (under five minutes).

Since I am using the free account, I deleted a 2GB VM from my DropBox folder in order to begin my next transfer. I was pleasantly surprised to see that the next VM transfer was very rapid. I suspect this was because the VM that was transferred earlier was still residing in DropBox’s cache even though I had deleted it, so that DropBox discovered common/duplicate blocks and did not upload them from my Mac.

Summary

Nifty tool. Love it. Will use it a lot.

A few feature requests

  • Subfolders: I would like to organize the files by date and category.
  • Timers: I would like to time the uploads and downloads easily.
  • Profile my usage and suggest how long an end-to-end transfer will take
  • Speed up compressed file transfers – improve my effective transfer rate  from ~60% to ~80%- I would like to saturate the available bandwidth for uploads and downloads

Thanks 🙂

Written by paule1s

September 13, 2009 at 5:42 pm

gzip vs dedup: I shrink, therefore I am

leave a comment »

[reposted from rosensharma.wordpress.com]

I stole “I shrink, therefore I am” from my wife’s good friend Arun Verma, who is incredibly creative, and makes some of the best lamps ever. He also does websites and ads if you are interested.

I have a macbook and use vmware fusion to run a windows XP VM. I keep all my data on a hosted folder on the mac’s operating system. So the VM is basically programs and user settings. In addition I have several images which I work with: Red Hat Enterprise, Ubuntu, Win 2K3 etc. Not atypical of someone who either develops or tinkers with technology.

My problem is that out of a 120GB hard disk, I am upto 100GB, and a whopping 60GB of that is virtual images. I have about 8. So I wanted to see if I could compress the virtual images in some fashion. I decided to run a small test of how much dedup would buy me over gzip

w2k3.vhd: Original size: 1.6GB
w2k3.vhd.gz: 712 MB

Further Analysis of the image showed that there were
14K Zero Filled Blocks, and
About 40K blocks occurred more than once

So an in-image-Dedup Optimization: 14K + 40K blocks ~ 200MB
Next I added a windows XP image:
wxp.vhd: 2GB
gzip wxp.vhd –> 921 MB
23K Zero Blocks
43K Additional Blocks Repeated between this and previous image
Dedup Optimization: 66K*4K ~ 250MBClearly gzip would win over a simple dedup. Even with two images xp and w2k3 I guess there are just not enough blocks to make dedup shine. Less than 10% of the blocks are being found. Cloning in some sense avoids large matches in a small set of images like on the desktop.
So the obvious next question was well how about dedup + gzip. Here things got a little more interesting:
gzip + dedup on w2k3.vhd: 720 MB (yes larger than just gzip)
gzip + dedup on wxp.vhd: 963 MB (also larger than gzip)
I was not expecting it to be larger. The raw file is not, but if you add the metadata you have to keep for the blocks, it begins to add up. Its close to gzip + metadata. Which means that gzip does a pretty good job with zero filled blocks and also the repeated blocks.
PS: Blocks in this context are 4K

Written by RS

September 10, 2009 at 10:39 pm