Data Deduplication Pt 1 – The Problem

Organizations – and even many individuals – wind up with duplicate files on their computer disks.  In just a few words, what is happening is that exact duplicates are saved – often again and again, and again.

For an individual, this may be as simple as downloading the same music or video files a few times. It may be easier to get another copy than it is to search for it on your hard drive.

For organizations, it could be considerably worse.  For example, an organization may handle storage for its employees on a central server or storage network.  To each user, it may seem as if the files are stored locally on the computer’s hard drive, when in fact, the data comes from a space on the storage drives that is dedicated to that particular user.

Now, imagine a scenario when one user creates a Word file and sends it to colleagues within the company for comment.  One file – to ten colleagues.  One file, in ten user areas in the storage system.  One file – with ten duplicates.  If the colleagues respond and make no changes or suggestions, these duplicate files will remain on the server.

If the colleagues respond with changes, the changed file will go out to those in the review list, and create another batch of duplicate files.  The more changes that are made, the more duplicates created.  A simple, relatively small, file can leave a much larger footprint, just as a result of duplication.

Or, for example, an employee finds a great video on YouTube.   Or, perhaps, the CEO decides to send a personal message to all employees as a video attachment to company mail.  A simple 50 megabyte video greeting sent to 100 employees quickly eats up 5 gigabytes of company storage.

Problems also occur if a file is saved using a different name.  Same content.  Different name.  A diligent employee looking for duplicates on his or her system could reasonably expect that files with very similar names, and the same size may be duplicates, and may consider deleting the duplicates.  Files of similar sizes with different names may be more of a challenge to determine.

Confusion can occur when the same file is saved to different drives or directories.  For example, an employee may save the file that he or she just CAN’T lose by copying it to one or more folders so that, if something happens to the file in one location, it can still be retrieved from the secondary folder.

The problem of duplicated files is a big one.  According to some companies involved in data deduplication, effectively detecting and correcting these problems could save as much as 90% of a drive’s size in some cases.  In most cases, the scope of the problem is probably considerably less, but even for individuals, the issue of duplicate files is real.

In the next part of this blog, we’ll look at some of the methods available for dealing with file duplication.

Mark Brownstein is a technology journalist and technology consultant who specializes in explaining and interpreting new technologies, and clarifying how to integrate these new products into current systems. He has been Editor-In-Chief at computer technology and networking publications, has held significant editorial positions at major technology magazines, and is a frequent contributor to various technology magazines. He has written seven books. He is Microsoft Certified, and spends much of his time testing hardware and software products, running his own networks, and learning the best ways to get computer systems running and to keep them running.

Tagged with: , , , ,
Posted in Hard Drive

Leave a Reply

Your email address will not be published. Required fields are marked *