Many organizations have problems with large numbers of duplicate files prematurely filling up their file systems. Some of the reasons for these problems were discussed in the previous post.
The primary task of deduplication is, of course, to find duplicated (or redundant) files. A common method for doing this is to process each file, or block of data, using one or many algorithms. These algorithms create what is called a hash – a specific value that reflects the contents of the data.
In theory, each unique file will have its own hash value. If you add a character, delete a space, or do ANYTHING to a file, its hash changes. For this reason, it’s almost certain that files will have unique values relative to ONLY that specific file. If two or more files have identical hash values, the assumption is that the two files are identical. Data handling capabilities of the deduplication hardware or software maintain a database of hash values – if more than one file is detected having the same hash as the other, the second and later copies of the file will be deleted and replaced with a ‘flag’ or pointer that refers to the original file, with the original hash value. Because it isn’t impossible for two files to have the same hash value, some systems also do further comparison between the files to assure that the contents aren’t identical.
An example given in the first blog on this subject mentioned a 50 mb message to 100 employees. If the mail system undergoes deduplication, then only ONE instance of the 50 mb message will be stored: the remaining instances will refer back to the original file. When an employee wants to view the message again, the flag for the file(s) will refer back to the original file, and the original file will be played.
Actual data deduplication is available from many companies, and can be found in hardware versions, software versions, or both. At this time, data deduplication is primarily targeted at organizations storing large amounts of data. It hasn’t migrated down to the end user. For end users, duplicate file finder software may help to reduce file redundancy, but may result in deletion of duplicate files, rather than ‘flagging’ them as references to another copy.
Deduplication is available in a few different ways. It can be part of a backup process, in which duplicate files are detected and flagged while the source is being backed up. It can also be implemented so that it creates a hash of data when it is written to storage – duplicate hashes will generate flags that prevent redundant data from being stored.
Depending on the vendor, hash files can be created from block data (equal sized blocks of data) or file data (a hash on a complete file). Other methods are also available, and technologists continue to improve the capabilities and performance of deduplication.
The significant reductions in data storage requirements have, in many cases, made it possible to perform backups to hard drives, rather than tape. To effectively store the backup drives, it is often prudent to make copies of the hard drive(s) containing the backup data. A hard drive duplicator, like those offered by Aleratec, make this a fast, easy, and affordable process.