Data Deduplication – Part 2 – Approaches

Many organizations have problems with large numbers of duplicate files prematurely filling up their file systems.  Some of the reasons for these problems were discussed in the previous post.

The primary task of deduplication is, of course, to find duplicated (or redundant) files.   A common method for doing this is to process each file, or block of data, using one or many algorithms.   These algorithms create what is called a hash – a specific value that reflects the contents of the data.

In theory, each unique file will have its own hash value.  If you add a character, delete a space, or do ANYTHING to a file, its hash changes.  For this reason, it’s almost certain that files will have unique values relative to ONLY that specific file.  If two or more files have identical hash values, the assumption is that the two files are identical.  Data handling capabilities of the deduplication hardware or software maintain a database of hash values – if more than one file is detected having the same hash as the other, the second and later copies of the file will be deleted and replaced with a ‘flag’ or pointer that refers to the original file, with the original hash value.  Because it isn’t impossible for two files to have the same hash value, some systems also do further comparison between the files to assure that the contents aren’t identical.

Data Deduplication DiagramAn example given in the first blog on this subject mentioned a 50 mb message to 100 employees.  If the mail system undergoes deduplication, then only ONE instance of the 50 mb message will be stored: the remaining instances will refer back to the original file.  When an employee wants to view the message again, the flag for the file(s) will refer back to the original file, and the original file will be played.

Actual data deduplication is available from many companies, and can be found in hardware versions, software versions, or both.  At this time, data deduplication is primarily targeted at organizations storing large amounts of data.  It hasn’t migrated down to the end user.  For end users, duplicate file finder software may help to reduce file redundancy, but may result in deletion of duplicate files, rather than ‘flagging’ them as references to another copy.

Deduplication is available in a few different ways.   It can be part of a backup process, in which duplicate files are detected and flagged while the source is being backed up.  It can also be implemented so that it creates a hash of data when it is written to storage – duplicate hashes will generate flags that prevent redundant data from being stored.

Depending on the vendor, hash files can be created from block data (equal sized blocks of data) or file data (a hash on a complete file).  Other methods are also available, and technologists continue to improve the capabilities and performance of deduplication.

The significant reductions in data storage requirements have, in many cases, made it possible to perform backups to hard drives, rather than tape.  To effectively store the backup drives, it is often prudent to make copies of the hard drive(s) containing the backup data.   A hard drive duplicator, like those offered by Aleratec, make this a fast, easy, and affordable process.

Mark Brownstein is a technology journalist and technology consultant who specializes in explaining and interpreting new technologies, and clarifying how to integrate these new products into current systems. He has been Editor-In-Chief at computer technology and networking publications, has held significant editorial positions at major technology magazines, and is a frequent contributor to various technology magazines. He has written seven books. He is Microsoft Certified, and spends much of his time testing hardware and software products, running his own networks, and learning the best ways to get computer systems running and to keep them running.

Tagged with: , , , , , ,
Posted in Hard Drive

Leave a Reply

Your email address will not be published. Required fields are marked *