A Data Scientist You’ve Never Heard of Is Now the Master of Your Domain

You’d imagine that Allstate is pretty good by now at predicting which kinds of cars are most likely to get into accidents, wouldn’t you? After all, this is kind of what they do for a living, and they’ve got over 80 years of experience doing it. They also employ a lot of people to build mathematical models that predict accident rates, and to refine these models over time.

So how can it be that a small team of people who don’t even work for the company was able, in three months, to achieve a 340% improvement over Allstate’s ability to predict bodily injury insurance claims based on car characteristics? And how was the team able to do this while working only with disguised data — without, in other words, knowing the true makes and models of the cars?

Improvement in prediction accuracy over time in Allstate-sponsored contest at Kaggle. Image from Kaggle (http://host.kaggle.com/casestudies/allstate)

Welcome to the weird new world big data.

I’ve spent the past two days at O’Reilly Media’s Strata conference, trying to get my mind around the phenomenon of big data. I went in kind of skeptical, thinking that might not be a lot of novelty here – that ‘big data’ is just today’s marketing term for what we were calling ‘analytics’ or ‘business intelligence’ yesterday. But as I’m sitting here on the flight home, I’m quickly coming to the conclusion that big data is in fact real, new, and important.

I see it as the confluence of four trends:

  • Continuing improvement in computing resources (processing, bandwidth, storage, memory, power consumption, etc.) to the point that most of us have what Autodesk CEO Carl Bass calls ‘infinite computing.’
  • Ever-greater quantities of digital data from search engines, weblogs, social networks, mobile devices, sensors, and so on. Potentially valuable data used to be relatively scarce and pretty well-behaved, sitting in relational databases. Now it’s all over the place and unruly as hell.
  • New tools to grab all this data and make use of it. This is what hadoop is for, and its extraordinary growth shows how much demand there is for software that brings together the above two trends.
  • Advances in the discipline of machine learning, which is just about what it sounds like. We’ve gotten a lot smarter about teaching machines how to get smarter at tasks ranging from driving cars to playing Jeopardy! to detecting fraud to predicting bodily injury insurance claims.

There’s a lot more to say about each of these and I’ll do so in later posts and writing. Right now I want to home in on the final trend listed above.

My data scientist friends tell me that machine learning has come a long way in recent years, and that the algorithms, tools, and techniques are much better than they were even a decade ago. This implies that it’s a young person’s field (and the crowd at Strata certainly bore that conclusion out). It also implies that large incumbent companies like Allstate might not be too good at it yet, and that the models they use would be considered old school by today’s data scientists.

So wouldn’t head-to-head competitions between the old and new schools of using data to make predictions be interesting? If you answered ‘yes,’ go check out Kaggle, which runs data prediction competitions. Companies like Allstate upload a dataset (scrubbed enough to deal with privacy and confidentiality issues), specify what they’re trying to predict, post a reward, then let the games begin.

Individuals and teams from all over the world can download the data, build a model that uses these data to generate a prediction, upload the prediction to Kaggle and instantly learn how accurate it is, and keep iterating through this cycle. Kaggle keeps track of and posts the leading score. In Allstate’s case, the lead changed well over a dozen times in the course of the month-long competition.

It’s worth repeating: the eventual winner was almost 3.5 times better at predicting insurance claims using car data than Allstate itself was. Of course, this is a highly limited test. Allstate takes into account many things beyond car data in order to arrive at its best prediction. But do you think that if the data scientists that congregate around Kaggle had access to all this data, they could beat Allstate’s best prediction, and beat it by, if not 340%, then by a lot?

I do, which is why I’m a big data believer.

Early stage results like the ones were seeing from Kaggle should start causing companies in many industries to ask themselves some very uncomfortable questions, like

Are we really the ‘domain experts’ that we think we are? If a bunch of kids and outsiders, who don’t know our customers, markets, and histories at all, can make much better predictions that we can in critically important areas, what does this imply? Do we need to completely rethink how we make our predictions? Do we need to outsource or crowdsource the making of predictions, even if we currently think of it as a ‘core competence?’ In the world of big data, how much has relevant expertise shifted, and where has it shifted to?

I’ll be diving in on these questions. I suggest you do the same…