EMC Celerra NAS SAN Deduplication
The EMC Celerra Deduplication is substantially different in concept, implementation and its benefits from the block-level deduplication offered by NetApp, Data Domain and others in their products. To understand the differences, let us first look at the comparison of data reduction technologies:
Data reduction technologies
|Technology||Typical Space Savings||Resource footprint|
|Fixed block deduplication||20%||High|
- File-level deduplication provides relatively modest space savings.
- Fixed-block deduplication provides better space savings, but consumes more CPU to calculate hashes for each block of data, and more memory to hold the indices used to determine if a given hash has been seen before.
- Variable-block deduplication provides slightly better space savings; but the difference is not significant when applied to file system data. it is most effective when applied to data sets that contain repeated but block-misaligned data, such as backup data in backup-to-disk or virtual tape library (VTL) environments.
- Compression is different from file-level or block-level deduplication in the granularity at which it applies. It is described as infinitely variable, bit-level, intra-object deduplication. It offers the greatest space savings of all the techniques listed for typical NAS data, and is relatively modest in terms of its resource footprint. It is relatively CPU-intensive but requires very little memory.
The storage space savings realized by compression is far greater than those offered by the other techniques and its resource requirements are quite modest by comparison. However, compression has a disadvantage in that there is a potential performance “penalty” associated with decompressing the data when it is read or modified. This decompression “penalty” can work both ways. Reading a compressed file can often be quicker than reading a non-compressed file. The reduction in the size of data that you must retrieve from the disk more than offsets the additional processing required to decompress the data.
Celerra Data Deduplication
Celerra Data Deduplication combines file-level deduplication and compression to provide maximum space savings for file system data based on
- Frequency of file access: files that are not “new” (creation time older than a configuration parameter), or not “hot”, i.e., in active use (access time or modification time older than a configuration parameter)
- File size: It avoids compressing files either if the files are small and the anticipated space savings are minimal, or if the file is large and its decompression could degrade performance and impact file access service levels.
The space reduction process
Celerra Data Deduplication has a flexible policy engine that specifies data for exclusion from processing and decides whether to deduplicate specific files based on their age. When enabled on a file system, Celerra Data Deduplication periodically scans the file system for files that match the policy criteria and then compresses them. The compressed file data is hashed to determine if the file has been identified before. If the compressed file data has not been identified before, it is copied into a hidden portion of the file system. The space that the file data occupied in the user portion of the file system is freed and the file’s internal metadata is updated to reference an existing copy of the data. If the data associated with the file has been identified before, the space it occupies is freed and the internal file metadata is updated. Note that Celerra detects non-compressible files and stores them in their original form. However, these files can still benefit from file-level deduplication.
Celerra Data Deduplication employs SHA-1 (Secure Hash Algorithm) for its file-level deduplication. SHA1 can take a stream of data less than 2 bits in length and produce a 160-bit hash, which is designed to be unique to the original data stream. The likelihood of different files hashing the same value is so substantially low that a collision rate has been reported after 2^69 hash operations. Unlike in compression, you can disable file-level deduplication in Celerra Data Deduplication.
Designed to minimize client impact
Celerra Data Deduplication processes the bulk of the data in a file system without affecting the production workload. All deduplication processing is performed as a background asynchronous operation that acts on file data after it is written into the file system. This avoids latency in the client data path, because access to production data is sensitive to latency. By policy, deduplication is performed only for those files that are not in active use. This avoids introducing any performance penalty on the data that clients and users are using to run their business.