The Dangers of Too Much Data

Many people love a good ham sandwich. Many, but not all.

For some, it’s a question of taste, or a rejection of all meats based on ethical, environmental or spiritual grounds. It’s often simply an aversion to the long-term health dangers associated with ultra-processed foods, including cured meats. When it comes to food choices, as in most things, we all walk the line between now and later, between our principles and pleasures.

3 Qualities of Good Data

It’s smart. Data must have context as well as labels that help give it sense.
It’s clean. The dataset you are optimizing against must be completely free of signals based on bot activity.
It’s purposeful. Data must be accurate and complete and also have purpose.

Navigating between reactive rewards and long-term interests is also common to many modern businesses. It’s often a question of balancing between satisfying immediate business needs and keeping an eye on ultimate goals or taking advantage of the moment while making sure we are tracking towards our objectives.

Should we use all the data we have available? Given regulatory and governance pressures, we increasingly need to know where our data has been sourced, how it has been processed and who made it. Is it safe, is it of high quality? How much of it can we store? Who can we share it with?

More Data Does Not Mean Better Data

Decision-making science tells us that having some data is generally better than having no data. A researcher once ran a study where a group of professional gamblers were given increasingly more data while continually measuring the accuracy of their bets.

What they found was that having some data is generally better than having no data. But after a certain point, giving a gambler more data will actually decrease the accuracy of their bets, not increase it (Slovic and Lichtenstein 1973).

This is largely due to what can be referred to as the signal-to-noise ratio. In any data set, there is a signal (important information that you must heed) and noise (meaningless, distracting information). As a rule, more data means more noise, not more signal. If there is too much of the stuff, the quality and utility of it can become questionable.

Unmanaged, Inaccurate Data Can Endanger Consumers

The perfect illustration of how the unfettered use of big data can go wrong is in the story of James and Theresa Arnold.

Butler County, Kansas, contains the calculated geographical center of the 48 states in the main landmass that is the United States. It’s Bullseye USA for data and map geeks.

The Arnolds moved into their 623-acre farm in Butler County, Kansas, in March 2011. Over the next few years, they had countless visits from law enforcement authorities investigating a series of crimes. Tax fraud, stolen cars, stolen credit cards and even the illicit production of pornographic films were all connected to this one Butler County farm. Either that location was a one-farm crimewave, involving a horrific concentration of events, or a systematic error led to this family being falsely interrogated.

It was the latter, because of an IP geolocation analytics company. These companies store and process and help connect IP addresses to wider datasets. Specifically, they provide geographical coordinates for IP addresses. Give them an IP address and they will tell you where it’s officially registered. For the most part.

But IP addresses can be unreliable sources of information. Geolocation analytics companies know the ins and outs of geographically classifying IP addresses. Whenever they come across IP addresses that look particularly problematic to identify, they put them in a digital bucket. That bucket is simply labeled as the exact geographical center of the United States (or a convenient set of coordinates near that center).

Whenever a tech-savvy criminal was masking an IP address, the company would classify the activity accordingly. The location of that Butler County farm would pop up into the database, which was then tapped into by authorities, with the resultant visits and raids to the innocent farm, day and night. This went on for 15 years until the family took legal action.

Unchecked Data Can Spark Privacy Risks

User privacy and data security have become focal issues for the digital measurement industry. Monitoring and tracking user behaviors is increasingly unsustainable and existing approaches that track, monitor or fingerprint will be increasingly privacy challenged.

The challenge for most businesses is that the type and nature of what data can be considered personal is a fluid and expanding beast. It is no longer confined to just email addresses or definitive personal identifiers, but things that can be combined with other datasets to profile. Collecting and use of IP addresses themselves, a natural byproduct of most digital advertising campaigns, is increasingly in the crosshairs of regulators.

As regulators expand on their privacy regulations, any business that doesn’t filter and manage its collected, bought and borrowed datasets will run the risk of attracting fines, reputational damage and more. Facebook’s recent 1.2 billion Euro fine is the most recent large example.

More Incoming Bad Data Means More Misinformation

We are all witnessing the birth of popular generative AI tools. ChatGPT is becoming one of the fastest-growing consumer applications as well as finding everyday uses in many business areas.

According to some (including some governments in Europe), it is also facilitating the provision of inaccurate or misleading information, while also failing to notify users of its data collection practices and failing to meet any of the GDPR level justifications for processing personal data.

This will leave many businesses open to statutory risks that are just emerging. Plus, it will put a premium on business processes that manage and filter any incoming AI generated data being relied upon for fundamental decision making. The ability of most companies to determine what is fake and what is real in some sectors, such as advertising, was already a challenge.

Determining brand safety and suitability of environment only becomes more challenging as the noise outplays the signal. Imagine a world where most data and imagery are AI generated. If we thought the internet to date was the Wild West, we are now on the verge of a veritable Oklahoma land rush.

More Processed Data Affects the Environment

The cloud is now used to describe any remote data storage and computing. It’s weightless and intentionally vague: your data is up there somewhere, in a better place, where you can forget about it. It’s in sharp contrast to the industrial reality of millions of remote servers, tucked sometimes underground in data centers that are gigantic, loud and require tremendous amounts of energy. We may imagine the digital cloud as placeless, mute, ethereal and unmediated. Yet the reality of the cloud is embodied in thousands of these massive data centers.

The planet contains more than seven million such data centers, any one of which can use as much electricity as a mid-sized town. They are also, notably, the biggest contributor of carbon emissions in global IT.

By some estimates data centers worldwide use more than 2 percent of the world’s electricity and generate the same volume of carbon emissions as the global airline industry (in fuel consumption terms).

More Reading About Data8 Types of Data Analysis

The Solution to Too Much Data

The solution is to prioritize quality data over a long period vs. big data within a short period. Quality and time separate signal from noise. Where possible, data that is granular, privacy-safe and broad in coverage is to be sought after. It must be smart, clean and purposeful:

Smart. The data must have context. It has to have labels that help give it sense. Naked numbers are pieces of data that have no context or sense. Naked numbers are pervasive in digital advertising and pollute the ecosystem. A lot of naming conventions and labeling systems are frankly all over the place. Companies that do well operate strict naming conventions and encourage their tech partners to do the same.

Clean. Given that half of the digital ecosystem is driven by bots, and increasingly infused with AI, there must be a guarantee that the dataset you are optimizing against is completely free of signals based on bot activity.

Purposeful. Data needs to not only be accurate and complete, but it also has to have purpose. At the most basic level, the purpose of a dataset is driven by who is paying for it. But the best types of measures are ones with purpose.