Data Compression vs. DeduplicationBy Drew Robb
August 11, 2010
Cuba Gooding Junior certainly did a fine job in the movie "Jerry McGuire." Tom Cruise gave one of the best performances of his career in that same film. Gooding won the Oscar.
A similar thing might be happening in the data storage. Deduplication is winning all the plaudits, yet potentially bigger and better things are going on that the media is largely ignoring.
It's easy to see why dedupe gets all the attention. When you boast 20-to-1 data reduction rates, that means 1 TB takes up only 50 GB of space. But what is often missed in all this is that data compression of larger volumes of data recovers far more storage.
EMC, for example, recently made several announcements on dedupe and data compression. Its Data Domain Boost release received maximum fanfare, its own press conference and room during Chairman Joe Tucci's keynote. Another announcement on data compression wasn't even printed off and handed out. It was just mentioned briefly among other news.
But that data compression technology, being given away as a free added feature for EMC Clariion storage arrays, could have far bigger impact in the overall storage networking picture. What EMC is saying is that it can now bring about a 2-to-1 compression ratio of large quantities of data on Clariion. So, if you have 100 TB on an array, this new feature frees up 50 TB. In contrast, Data Domain dedupe appliances at best free up a few TB and cost a significant amount in added hardware.
"Clariion and its compression announcement are not getting enough love," said Greg Schulz, an analyst with StorageIO Group. "Being able to reduce a storage array from 100 TB to 50 TB is huge for storage administrators."
He pointed out that for sheer volume of TBs, the Clariion announcement will exert a far bigger impact than the Boost release. Being a long established best seller among the EMC disk array ranks, the company ships many PBs of Clariion arrays every month. Data Domain is probably lagging behind this by several orders of magnitude.
These data compression functions first appeared in file-based systems a couple of years ago. This technology now operates on block-level data in Clariion-based SANs.
"EMC Celerra boxes have had data compression on the file side for a few years," said Barry Ader, senior director of product management at EMC. "Now that we have added it for the block side, the average space reclamation rate is 50 percent. Clariion is the first block-level storage device to have compression."
In essence, what it does is compress inactive data within the system. EMC has designed algorithms to detect such data, so the bits and bytes can be crushed closer together in a similar way to a Zip file.
Schulz, though, doesn't think it's a case of dedupe versus compression. He believes organizations must employ several technologies to reduce the data footprint.
"You have to look at the bigger picture of reducing the data footprint via different techniques," he said. Dedupe is one way, but there is also compression, thin provisioning and other methods.
FAST Times at Hopkinton High
As well as its dedupe and compression announcements, EMC has also released the second iteration of Fully Automated Storage Tiering (FAST). When EMC introduced FAST last year, it was widely criticized due to not being able to operate below the Logical Unit Number (LUN) level. Other data storage products on the market could do this, so EMC was perceived as being behind the times. This release of FAST remedies that deficiency so that users of Clariion and Celerra arrays will now be able to harness it to move data at a sub-LUN level.
Ader explained that FAST can be deployed in a variety of applications. The most obvious one is to set up tiers of storage. Tier One for often accessed, high priority data; Tier Two for less used data; Tier Three as an archive for older files. But in some cases, it is being teamed up with solid state drives (SSD) to create a smaller Tier Zero to supply even more horsepower for the most demanding data sets.
"People are using FAST to move data around in different tiers, while using SSD as a Tier Zero for highest performance," said Ader.Alternatively, it is being used to establish a FAST cache rather than a storage tier. What that means is up to 2 TB of non-volatile cache can be made available to EMC Clariion and Celerra arrays.
Drew Robb is a freelance writer specializing in technology and engineering. Currently living in California, he is originally from Scotland, where he received a degree in geology and geography from the University of Strathclyde. He is the author of Server Disk Management in a Windows Environment (CRC Press).