So today I’m unleashing my pent up rage on the “Big Data” crew; devotees and neophytes alike.
You do not have a big data problem. You have a functional ignorance problem.
Go back and read that a few times if necessary. Or to put it another way:
“Before you turned to big data, did you first try ‘small data’(tm)”
Or to put it yet a third and more direct way
“What’s your question?”
Most people who are “turning to big data” in their time of need don’t even know the question that they are questing for. As a result, many of the current “big data” set (pun intended) are collecting exabytes of data to hide their collective ignorance…
Start-ups are cheapening the term by using it to prop up and endless series of questionable business models and generally bad ideas.
The companies that are at the forefront of “Big Data” are there because they are solving interesting problems - at scale. Most people mistake problems at scale for the scale of their problems. What do I mean by this? Consider, Google set out to build a system that could search the web and help you find the information you wanted when you wanted it. They did not set out to build a business on top of map reduce… Netflix started out with the mission of helping me find the movies that I want to watch and get them to me… They didn’t start out with the mission of becoming the leader in applied machine learning and recommendations. The product provides the reason for the technology.
What we now call machine learning, was called descriptive statistics in the ’60s and ’70s…
Models often have hidden flaws that only get exposed in the real world. Your model can be fine with test data, but break on production data. The sad part is that you won’t know that it’s broken until your auto suggestion algorithm starts recommending home Euthanasia kits to people searching for elder care books.
This post isn’t to vilify every company that mentions data analysis as being core to their product… This post is an attempt to throw the wet blanket of reality onto the bonfire of investment that seems to be throwing perfectly good VC cash down the drain in the hopes that analyzing the data from your last game-mechanic-social-coupon-buying hot thing will finally have them making money instead of spending it….
Three places to look before you hit big data
Classical descriptive statistics.
Every time you map reduce without drawing a box plot, God murders a marmoset. For most data sets, starting with box plots and histograms does no harm and provides valuable insight on how to proceed. Far too often, we have one tiny nail to drive and you reach for a gigantic sledge hammer.
Computers are awesome at simulating things… Instead of recording down every possible piece of information - try recording a little bit of information and simulating the rest. This very technique lies at the heart of one of the more powerful techniques in the statistical tool box: Monte Carlo Simulation.
Ignore it and it Might go Away/ be unimportant
That’s right - my favorite technique is to go focus on something else: Like your business model… Some questions just aren’t that important or intriguing. Let me revise that: MOST questions just aren’t that important. Are you asking questions that are central to improving your product or service?
We live in interesting times for data. If we are careful with the ideas that now available to us, and the terms we use to describe them, data big and small will start driving decisions that can help everyone.
If we continue down this road of using “big data” as a crutch for weak business models and terrible “products” … VCs, investors and industry will see our statistically driven ideas as mere tech-bubble snake oil.
This is worth the time to scroll through, as it documents the effort necessary to dispel an incorrect rumor propagated via social media. The means of dispelling was also social media, Twitter in particular, one man and followers. Note how they had to go though it all a second time, just like a lump in a rug that gets pushed down in one place only to pop up in another!
There remains some lingering uncertainty whether the correction was heeded. That is not their fault.
Andy and his Tweeps did a great job.
How Andy Carvin used Twitter to debunk a geopolitical rumor viasoupsoup:
This morning, I pulled out my iPad to read The New York Times feature entitled Growing Up Digital, Wired for Distraction. After reading a few hundred words, I tweeted about reading it. Then I realized it was something like 4,000 words, so I took a break to go check Twitter. Then Facebook. Then my email. Then Yammer. Then I came back to reading — for another 1,000 words or so, before an Instagram Push Notification popped up. I hopped over there. Then I came back and finished the article…