shareVM- Share insights about using VM's

Simplify the use of virtualization in everyday life

EMC Celerra NAS SAN Deduplication

leave a comment »

The EMC Celerra Deduplication is substantially different in concept, implementation and its benefits from the block-level deduplication offered by NetApp, Data Domain and others in their products. To understand the differences, let us first look at the comparison of data reduction technologies:

Data reduction technologies

Technology Typical Space Savings Resource footprint
File-level deduplication 10% Low
Fixed block deduplication   20% High
Variable-block deduplication   28% High
Compression 50% Medium

 

  • File-level deduplication provides relatively modest space savings.
  • Fixed-block deduplication provides better space savings, but consumes more CPU to calculate hashes for each block of data, and more memory to hold the indices used to determine if a given hash has been seen before.
  • Variable-block deduplication provides slightly better space savings; but the difference is not significant when applied to file system data. it is most effective when applied to data sets that contain repeated but block-misaligned data, such as backup data in backup-to-disk or virtual tape library (VTL) environments.
  • Compression is different from file-level or block-level deduplication in the granularity at which it applies. It is described as infinitely variable, bit-level, intra-object deduplication.  It offers the greatest space savings of all the techniques listed for typical NAS data, and is relatively modest in terms of its resource footprint. It is relatively CPU-intensive but requires very little memory.

The storage space savings realized by  compression is far greater than those offered by the other techniques and its resource requirements are quite modest by comparison. However, compression has a disadvantage in that there is a potential performance “penalty” associated with   decompressing the data when it is read or modified. This decompression “penalty” can work both ways. Reading a compressed file can often be quicker than reading a non-compressed file. The reduction in the size of data that you must retrieve from the disk more than offsets the additional processing required to decompress the data. 

Celerra Data Deduplication

Celerra Data Deduplication combines file-level deduplication and compression to provide maximum space savings for file system data based on

  • Frequency of file access: files that are not “new” (creation time older than a configuration parameter), or  not “hot”, i.e., in active use (access time or modification time older than a configuration parameter)
  • File size: It avoids compressing files either if the files are small and the anticipated space savings are minimal, or  if the file is large and its decompression could degrade performance and impact file  access service levels.

Deduplication is enabled at the file system level and is transparent to access protocols. Mark Twomey‘s post provides an excellent overview of Celerra Data Deduplication.

The space reduction process

Celerra Data Deduplication has a flexible policy engine that specifies data for exclusion from processing and decides whether to deduplicate specific files based on  their age. When enabled on a file system, Celerra Data Deduplication periodically scans the file system for files that match the policy criteria and then compresses them. The compressed file data is hashed to  determine if the file has been identified before. If the compressed file data has not been identified before, it is copied into a hidden portion of the file system. The space that the file data occupied in the user portion of the file system is freed and the file’s internal metadata is updated to reference an existing copy of the data. If the data associated with the file has been identified before, the space it occupies is freed and the internal file metadata is updated. Note that Celerra detects non-compressible files and stores them in their original form. However, these files can still benefit from file-level deduplication.

Celerra Data Deduplication employs SHA-1 (Secure Hash Algorithm) for its file-level deduplication. SHA1 can take a stream of data less than 2 bits in length and produce a 160-bit hash, which is designed to be unique to the original data stream. The likelihood of different files hashing the same value is so substantially low that a collision rate has been reported after 2^69  hash operations. Unlike in compression, you can disable file-level deduplication in Celerra Data Deduplication.

Designed to minimize client impact

Celerra Data Deduplication processes the bulk of the data in a file system without affecting the production workload. All deduplication processing is performed as a background asynchronous operation that acts on file data after it is written into the file system. This avoids latency in the client data path, because access to production data is sensitive to latency. By policy, deduplication is performed only for those files that are not in active use. This avoids introducing any performance penalty on the data that clients and users are using to run their business.

Written by paule1s

December 7, 2009 at 12:51 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: