Get RID of Dirty Data

If your information environment has been in use for 10 years or more, chances are very good that your data is not as clean as you need it to be. On average, 25% of the data in a typical infrastructure is dirty – duplicates, multiple spellings and codes, or just plain wrong – which introduces an enormous amount of ambiguity into the business decision making process.

Nearly everyone has the problem, but what can we do about it? Master Data Management provides a comprehensive approach to getting the data quality monster under control, but with such an overwhelming amount of dirt in the environment, it can be difficult to get started.

Garbage is garbage, and in today’s Big Data landscape, if the data is dirty, you need to clean it up or throw it out. Ignoring the garbage doesn’t make it disappear.

The task is daunting, but achievable. If you are serious about getting RID of dirty data, the right plan of attack can get the dirty data mess tidied up.

Recognize, Isolate, and Deal with it (RID)

Recognize

Do you know what clean data looks like? This is important. Without this vision, it will be difficult to distinguish dirty data within your environment. Identifying bad data (which is different from dirty data) is easier; there are myriad business and system rules designed to pick out obviously wrong field characteristics.

Imagine that you need to pick up a pint of strawberries at the grocery store, but when you get to the produce section, there is a watermelon in the same bin as the berries. The watermelon in this analogy represents the “bad” data. It is unlikely that you will end up taking home a watermelon instead of the strawberries.

But, how do you know if the strawberries you select are bug- and pesticide-free? As you look over each pint, you may see bruising and discoloration on some of the strawberries in each container. This is akin to dirty data. One way to make sure that you select the best strawberries available is to examine each container and compare the strawberries within it to your vision of the ideal strawberry. In this example, you already know what a good strawberry looks like, so you can decide which container you purchase based on this idea. Likewise, if you know what good data looks like, finding the dirty data will become as simple as selecting the perfect strawberry.

Isolate

Once you can recognize dirty data, the next step is to isolate it. This will ensure that your business decision making process is based on clean data, and not muddied with the dirty data.

Going back to the strawberry analogy, as you find containers that have undesirable strawberries in them, you may put them to the side, isolating them from the rest of the containers that have not yet been examined. If you were in the strawberry business, you may have a similar process to set aside packages with blemished fruit for further examination prior to sending them to market. After the strawberry quality expert takes another look at those packages, it may be possible to take the unblemished fruit from several of the packages containing blemished berries and to create a package that is marketable.

This is the same process – granted, without the summer sweetness of fresh strawberries – that needs to happen with your dirty data. Define the bin where the dirty data will be contained, and get your data quality expert, commonly referred to as the Data Steward, prepared for the next step…dealing with it.

Deal with it

The role of the Data Steward in any Master Data Management effort is extremely important. Once the dirty data is cordoned off from the rest of the data population, the Data Steward determines whether the data should be cleaned or discarded.

If the data is to be cleaned, the Data Steward works with the business process and technology owners to ensure that all rules concerning the data are considered in the data cleansing solution. With multiple parties involved, this is no easy task. To be successful, the Data Steward needs to navigate both the business and technology environments to ensure that the data is cleaned appropriately.

In the best of situations, the Data Steward is also able to determine where the dirty data originated, and work with the business and technology owners to ensure that this source of dirty data is addressed by a system or process change so that no additional dirty data is generated.

Summary

In the past 2 years, more data has been produced than in all of the previous years combined. With these volumes of data flowing through the super-charged information engines in businesses today, getting RID of the 25% of dirty data is an important challenge to address. We have become experts at identifying and excluding bad data, but dirty data is a new test of existing capabilities. The organizations that win the dirty data race will be able to trust their data, and use it to make better business decisions.

Understanding dirty vs. bad data and developing a comprehensive strategy to address dirty data can be challenging. Data Stewards require trusted partners experienced with data quality management who understand complex systems and who have the know-how to reach across various business groups to affect real solutions. For assistance with recognizing, isolating and dealing with your dirty data, contact the Data Integrity Response Team at ComResource.

“The only thing worse than dirty data is making business decisions based on dirty data.”