Big Data Limitations


Big Data is here to stay and eventually companies who don’t understand it could be left behind.

Unfortunately, many people don’t fully understand what Big Data can do. Just as important though, many people don’t understand what Big Data can’t do. Understanding the limitations to data analysis is very important to utilizing Big Data to the fullest extent. In order to improve anything, we must understand what needs to be improved.

What is Big Data?

Big Data, in it’s simplest form, is a large gathering of data so robust that internal servers generally can’t handle the scale of information. Big Data is most prominent in healthcare, meteorology, business, physics, large government surveys, and other countless fields. In the marketing world, Big Data is used to find everything there is to know about the consumer or business. Marketers realize that Big Data is an incredibly useful tool, but mistakes are inevitable with so much data. The goal, then, should be to limit mistakes as much as possible to ensure the results are as close to real life as possible.

“Correlation is not causation.”

This phrase is uttered by proponents and opponents of data analysis alike. Understanding that correlation is not causation is a strength, not a weakness. Knowing that much of our data only correlates allows us to avoid making overconfident decisions.

For example, leads correlate very strongly with new business but don’t necessarily guarantee it. Data analysts use correlation much more than causation because it avoids dealing in absolutes. Big Data attempts to model real world events where almost nothing is absolute. This means that extremely confident predictions or someone trying to prove causation should be met with skepticism. It is tempting to want a sure thing, but these predictions often don’t pan out. It may be more beneficial to use correlation in these instances rather than blindly follow an incorrect analysis.

Data is Lossy

Much like converting analog music to digital, Big Data tries to code real events into numbers. In sports, stats help show who the best players are, but the data does not say if players are liked by the rest of the team, how hard they train, or what motivates them. It is impossible to tell an entire story with data but segmenting the data may help. Again in the context of sports, perhaps you believe your favorite team is winning more after a new coach is brought in. You could segment the data to look at team performance with the old coach and compare that to team performance with the new coach. This adds context to the numbers and helps eliminate some of the questions that data can’t answer. Wondering “why” or “how” something happened is often the result of looking at data. This is because nothing is better than data analysis at explaining what is happening, but it needs help when trying to go beyond that.

Context is Everything

Data isn’t inherently good or bad, but context is what makes data useful or useless. Data needs context to be effective, and this context is provided by whoever is looking at the data. It can be sculpted and arranged to fit arguments. Be weary of “vanity” metrics or data that seek to impress rather than attempting to tell a larger story. Many times, vanity metrics aren’t supported by meaningful underlying data. Vanity metrics are great starting points, but they don’t tell anything about conversions. They can be anything from likes and fans on Facebook to pageviews and visits on a web page.

Sample Size

The amount of data being analyzed is also a factor worth considering. For example, if I say, “the unemployment rate is at 30% in population A,” that statement has considerably different meanings for a population of 10 people compared to 10,000 people. Again, it is all about context and patience. If the immediate results aren’t great, don’t change anything just yet.

The same can be said about positive results. Don’t get caught up in excitement either. Results aren’t always significant. Chance, external factors, and industry trends need to be on your mind.

Respecting Data

In the end, it comes down to respecting data. Understanding what it can and can’t do is the only way to avoid false causation, making decisions on small samples of data, and misrepresenting your findings. This respect needs to be constant because as long as you understand the limitations to Big Data, it will prove to be the powerful tool it’s meant to be.

Many of the ideas here are covered further in Nate Silver’s book, The Signal and the Noise. I highly recommend reading the book if you want an interesting, more in-depth look at data prediction.