Responding to a Dirty Data Infection

Click here To read part one of this series, “The High Cost of Treating a Dirty Data Infection.”

The best thing you can do to protect your information is to make sure that dirty data never enters your environment. It’s simple to detect bad data, and normal data validation routines keep it at bay. But dirty data can be a slight variance in a minor field. With Customer Data, it’s easy to see how invalid data can wreak havoc in your environment.
Let’s take a look at a common data governance issue caused by dirty data. Kathryn Smith, Catherine Smith, and Kathy Smith are all listed with separate customer profiles on different systems, but they are actually the same person. Which scenario is most likely to occur in your organization?

  • You send each customer profile a copy of your marketing materials, tripling your marketing expense.
  • You detect that these profiles represent the same person, but you address your materials to Kathy, who prefers to use her full name, not the nickname.
  • You quarantine the record so that no marketing materials are delivered to Ms. Smith.

If any of the scenarios described above becomes your reality, any marketing materials that are delivered to Ms. Smith are likely to be redirected to her trash can.

Dirty data behaves much like a contagion. To be successful, it needs a host, an entry point, a way to spread within the affected system, and a connection to infiltrate additional systems. Fortunately, it can be stopped at any step in this progression. Follow these basic guidelines to limit your exposure to a dirty data contagion:

  1. Make sure that only valid data enters your environment. 
Much like biological diseases, prevention is more cost-efficient than treatment.
  2. Construct your databases so that potential threats are obvious.
If you can identify dirty data before it enters your system, you can resolve a lot of headaches before they happen.
  3. Install validation logic between your databases to prevent the spread of the contagion.
If some dirty data gets through your primary defenses, logic embedded within the data transfer mechanisms can help to limit its impact to your environment.

Making sure that your systems and business processes can recognize, quarantine, and dispose of dirty data, whether from internal or external connections, helps to ensure that any resulting infection is localized and manageable.

Levels of Dirty Data Infections

Although not as dangerous to life and limb as a biological epidemic, dirty data behaves similarly. There are four levels of dirty data infection, increasing at each step in severity and complexity. At each increase in level, there is an associated increase in cost to identify, isolate and eradicate the dirty data.

Level 1 Infection        

A Level 1 infection represents a small-scale, local issue. Only a few datasets are infected, their user base is small, and most business processes will function as expected. This type of infection is easy to overlook, as the audience is generally a couple of users, and the impact to the normal flow of business is minimal. Executive leadership may become aware of the problem only after it is resolved.

Resolve a Level 1 Infection by locating and eradicating the dirty data either by fixing or deleting the infected records.

Level 2 Infection        

Level 2 Infections are a bit trickier, involving multiple datasets and their users, and making a few business processes unwieldy. This scenario is harder to resolve quietly as the audience is much larger than a Level 1 Infection. As more business processes are affected, the volume is turned up for the technical support team, forcing that team to operate in a crisis mode where mistakes are more common.

Resolve a Level 2 Infection by isolating the affected systems and eradicating the dirty data, just as with a Level 1 Infection. In addition, examine all of the connected systems, or routes of contagion, for signs of the infection. Activate contingency plans for all affected business processes. Get coffee and order pizzas.

Level 3 Infection        

Level 3 represents an epidemic of dirty data, with nearly all systems and datasets infected. The majority of business processes cannot function normally, and visibility is at an enterprise-wide level. Irritated users storm the Help Desk, customers cancel their accounts, and the city where your business functions declares a paper shortage emergency due to your contingency plans, which require pen and paper to process the transactions that your technology systems can no longer process. This is a very dark period for the IT department as they struggle to bring systems back into alignment with expected operating parameters, flushing the dirty data into containment, and usually flushing a good deal of clean data with it.

Responding to a Level 3 Infection successfully requires resources that most organizations do not have at their disposal. All systems are quarantined and examined. Business processes slow to snail speed and costs for paper skyrocket as contingency plans are activated. Connections to outside data sources are suspended, causing a backlog when those services are restarted. Cleaning datasets is just part of the problem in a Level 3 situation. Keeping them clean as the environment comes back online is equally challenging.

Level 4 Infection        

Level 4 is characterized by rampant ambiguity and mistrust of datasets. As one dataset is cleaned, ten more are being infected, and it doesn’t take long for your newly cleaned dataset to get dirty again. Customers lose faith in your ability to protect and manage their information, and move their business elsewhere. Employees realize the extent of the contagion, and decide that their skills will be better appreciated by competitor organizations that maintain a level of clarity and control over their internal mechanisms. There is not enough paper available to conduct all business processes in contingency mode.

If you have a Level 4 Infection, you’re going to need every available resource to identify remaining pockets of clean data, reroute the internal and external connections to use clean datasets exclusively, and rebuild your information environment. This process is akin to switching out the transmission on a car while it is driving down the highway. Although at this point, it is likely that your car is doing about 5 MPH in a zombie-infested post-apocalyptic landscape and your mechanic will only work on the engine from the passenger compartment, if you can find a mechanic.

The best approach is to avoid a Level 4 Infection.

Understanding complex systems and targeting the entry and transfer points for dirty data can be challenging. For assistance with isolating your dirty data and developing a treatment strategy to achieve clean, trusted data, contact the Data Integrity Response Team at ComResource.

“The only thing worse than dirty data is making business decisions based on dirty data.”