A Data Scientist You’ve Never Heard of Is Now the Master of Your Domain

by Andrew McAfee on March 2, 2012

You’d imagine that Allstate is pretty good by now at predicting which kinds of cars are most likely to get into accidents, wouldn’t you? After all, this is kind of what they do for a living, and they’ve got over 80 years of experience doing it. They also employ a lot of people to build mathematical models that predict accident rates, and to refine these models over time.

So how can it be that a small team of people who don’t even work for the company was able, in three months, to achieve a 340% improvement over Allstate’s ability to predict bodily injury insurance claims based on car characteristics? And how was the team able to do this while working only with disguised data — without, in other words, knowing the true makes and models of the cars?

Improvement in prediction accuracy over time in Allstate-sponsored contest at Kaggle. Image from Kaggle (http://host.kaggle.com/casestudies/allstate)

Welcome to the weird new world big data.

I’ve spent the past two days at O’Reilly Media’s Strata conference, trying to get my mind around the phenomenon of big data. I went in kind of skeptical, thinking that might not be a lot of novelty here – that ‘big data’ is just today’s marketing term for what we were calling ‘analytics’ or ‘business intelligence’ yesterday. But as I’m sitting here on the flight home, I’m quickly coming to the conclusion that big data is in fact real, new, and important.

I see it as the confluence of four trends:

  • Continuing improvement in computing resources (processing, bandwidth, storage, memory, power consumption, etc.) to the point that most of us have what Autodesk CEO Carl Bass calls ‘infinite computing.’
  • Ever-greater quantities of digital data from search engines, weblogs, social networks, mobile devices, sensors, and so on. Potentially valuable data used to be relatively scarce and pretty well-behaved, sitting in relational databases. Now it’s all over the place and unruly as hell.
  • New tools to grab all this data and make use of it. This is what hadoop is for, and its extraordinary growth shows how much demand there is for software that brings together the above two trends.
  • Advances in the discipline of machine learning, which is just about what it sounds like. We’ve gotten a lot smarter about teaching machines how to get smarter at tasks ranging from driving cars to playing Jeopardy! to detecting fraud to predicting bodily injury insurance claims.

There’s a lot more to say about each of these and I’ll do so in later posts and writing. Right now I want to home in on the final trend listed above.

My data scientist friends tell me that machine learning has come a long way in recent years, and that the algorithms, tools, and techniques are much better than they were even a decade ago. This implies that it’s a young person’s field (and the crowd at Strata certainly bore that conclusion out). It also implies that large incumbent companies like Allstate might not be too good at it yet, and that the models they use would be considered old school by today’s data scientists.

So wouldn’t head-to-head competitions between the old and new schools of using data to make predictions be interesting? If you answered ‘yes,’ go check out Kaggle, which runs data prediction competitions. Companies like Allstate upload a dataset (scrubbed enough to deal with privacy and confidentiality issues), specify what they’re trying to predict, post a reward, then let the games begin.

Individuals and teams from all over the world can download the data, build a model that uses these data to generate a prediction, upload the prediction to Kaggle and instantly learn how accurate it is, and keep iterating through this cycle. Kaggle keeps track of and posts the leading score. In Allstate’s case, the lead changed well over a dozen times in the course of the month-long competition.

It’s worth repeating: the eventual winner was almost 3.5 times better at predicting insurance claims using car data than Allstate itself was. Of course, this is a highly limited test. Allstate takes into account many things beyond car data in order to arrive at its best prediction. But do you think that if the data scientists that congregate around Kaggle had access to all this data, they could beat Allstate’s best prediction, and beat it by, if not 340%, then by a lot?

I do, which is why I’m a big data believer.

Early stage results like the ones were seeing from Kaggle should start causing companies in many industries to ask themselves some very uncomfortable questions, like

Are we really the ‘domain experts’ that we think we are? If a bunch of kids and outsiders, who don’t know our customers, markets, and histories at all, can make much better predictions that we can in critically important areas, what does this imply? Do we need to completely rethink how we make our predictions? Do we need to outsource or crowdsource the making of predictions, even if we currently think of it as a ‘core competence?’ In the world of big data, how much has relevant expertise shifted, and where has it shifted to?

I’ll be diving in on these questions. I suggest you do the same…

chieftech March 2, 2012 at 2:39 pm

Andrew – you might find Nicholas Gruen’s (chairman of Kaggle at the time) presentation at the Dachis Group’s Social Business Summit last year in Singapore of interest, where he talks about Kaggle, Big Data and Open Data:

The Coming Revolution in Data

Anonymous March 2, 2012 at 3:37 pm

Well said, Andrew. 2 thoughts come to mind:
1 – how are orgs positioning themselves to run this kind of analysis? In the pre-big data era, data sat in silos that were heavily structured and didn’t speak to each other. That’s just beginning to change now, but curious about your thoughts on what orgs need to do internally to even get to the point where unstructured data is clean enough to gain deep insight
2 – should we be concerned/hopeful about companies that have theoretical access to all the data? Cloud consolidation means that SFDC and co have the ability to manipulate data in ways that Google and others have done on the personal front. I’m less worried about privacy than I am about understanding what they actually know and have permission to do. 

My own thoughts on analytics in the era of big data: http://wp.me/p1OWHB-fs 

Guest March 2, 2012 at 4:48 pm

The dataset in the AllState challenge is only about 340 MB (zipped). This has nothing to do with big data, just with a smart application of machine learning and statistics.

Guest March 3, 2012 at 12:48 pm

Interesting post, Andrew.  I wonder, how much of that 340% improvement do you think was due to new big data techniques and how much to the competition format motivating data scientists in a way that they aren’t inside big companies like AllState?

Andrew McAfee March 4, 2012 at 1:26 pm

It’s a good question. Successful data scientists are interviewed at the Kaggle blog, and they often talk about how motivating it was to be in competition – to see someone else doing better than them, then try to catch and surpass them. 
And when I read about the methods they’re employing, it’s often like they’re speaking a foreign language. The tools and approaches are pretty new, and pretty exciting.

Andrew McAfee March 4, 2012 at 1:27 pm

Which is why I stressed in my post that one of the pillars of big data is “Advances in the discipline of machine learning, which is just about what it sounds like. We’ve gotten a lot smarter about teaching machines how to get smarter at tasks ranging from driving cars to playing Jeopardy! to detecting fraud to predicting bodily injury insurance claims.”

Andrew McAfee March 4, 2012 at 1:30 pm

The tools for dealing with data silos are getting better – this is what Hadoop and Cloudera are all about — but org design challenges remain, and are serious. Tom Davenport is investigating this issue, and I look forward to hearing what he has to say about it.

Comments on this entry are closed.

Previous post:

Next post: