Checking in, Late 2013 Edition

So once again, it’s been a while since I last posted. What have I been up to? Well, to start, this came out in the spring:

And then early September, this happened:





And at the end of it, I got these:


So I’m now in DC for the foreseeable future, doing very interesting things with very obscene quantities of data. I have a few invited talks and conference presentations coming up, so hopefully sometime soon I’ll be able to share some of those materials on here as well.

When Can You Trust a Data Scientist?

Pete Warden’s Monkey Cage post, “Why You Should Never Trust a Data Scientist” (original version from his blog), illustrates one of the biggest challenges facing both consumers and practitioners of data science: the issue of accountability. And while I suspect that Warden—a confessed data scientist himself—was being hyperbolic when choosing the title for his post, I worry that some readers may well take it at face value. So for those who are worried that they really can’t trust a data scientist, I’d like to offer a few reassurances and suggestions.

Data science (sometimes referred to as “data mining”, “big data”, “machine learning”, or “analytics”) has long been subject to criticism from more traditional researchers. Some of these critiques are justified, others less so, but in reality data science has the same baby/bathwater issues as any other approach to research. Its tools can provide tremendous value, but we also need to accept their limitations. Those limitations are far too extensive to get into here, and that’s indicative of the real problem Warden identified: as a data scientist, nobody checks your work, mostly because few of your consumers even understand it.

As a political scientist by training, this was a strange thing to accept when I left the ivory tower (or its Southern equivalent, anyway) last year to do applied research. The reason for a client to hire someone like me is because I know how to do things they don’t, but that also means that they can’t really tell if I’ve done my job correctly. It’s ultimately a leap of faith—the work we do often looks, as one client put it, like “magic.” But that magic can offer big rewards when done properly, because it can provide insights that simply aren’t available any other way.

So for those who could benefit from such insights, here are a few things to look for when deciding whether to trust a data scientist:

  • Transparency: Beware the “black box” approach to analysis that’s all too common. Good practitioners will share their methodology when they can, explain why when they can’t, and never use the words, “it’s proprietary,” when they really mean, “I don’t know.”
  • Accessibility: The best practitioners are those who help their audience understand what they did and what it means, as much as possible given the audience’s technical sophistication. Not only is it a good sign that that they understand what they’re doing, it will also help you make the most of what they provide.
  • Rigor: There are always multiple ways to analyze a “big data” problem, so a good practitioner will try different approaches in the course of a project. This is especially important when using methods that can be opaque, since it’s harder to spot problems along the way.
  • Humility: Find someone who will tell you what they don’t know, not just what they do.

These are, of course, fundamental characteristics of good research in any field, and that’s exactly my point. Data science is to data as political science is to politics, in that the approach to research matters as much as the raw material. Identifying meaningful patterns in large datasets is a science, and so my best advice is to find someone who treats it that way.

The Fundamentals of the New Hampshire Primary

Last night’s caucuses in Iowa apparently succeeded in thinning out the GOP herd: Bachmann’s out, Perry’s returning to Texas to “reflect” for a while, and Jon Huntsman was less than a hundred votes away from officially receiving 0%.* It seems we’ve finally reached the point where the race can be discussed in terms of more than just Ron Paul’s eyebrows, Rick Perry’s arsenal, or Michele Bachmann’s steadfast opposition to the concept of “facts”. And so in honor of the occasion, I present the first installment of my commentary on the 2012 elections.

Today, I’m in a Granite State of mind.** I’ve always liked New Hampshire. Growing up just over the MA border, it was where everyone went for tax free shopping, illicit fireworks, socialized liquor distribution, better skiing, and vacations when you didn’t have the money to buy plane tickets. I still have many friends there and get back as often as I can, and even wonder whether this will be the year I finally make it up Mt. Washington.***

So it’s been something of a personal peeve to me that in nearly all the coverage of the NH primary (both this year and in previous cycles), commentators have either treated the state as (a) a quirky, unrepresentative backwater with inflated electoral power, or (b) a mystical land of cold-tempered sages who control our political destinies. While I hate to be a buzzkill, the reality is that New Hampshire is just a regular state, with regular people, which has traditionally held the first primary and which continues to do so because nobody’s found a politically feasible way to change that yet.

To understand what the NH primary actually means, it’s important to take a more precise look at what distinguishes New Hampshire relative to the rest of the country. Two main idiosyncrasies tend to come up with regard to this year’s primary. First, because of New Hampshire’s size, many discussions are premised upon the idea that it isn’t representative of the broader electorate. And second, Romney is widely expected to win in New Hampshire because it’s in the former MA governor’s “backyard.” To see whether those ideas hold water, I decided to turn to something rarely found in recent coverage: actual data.

Data 2008 and 2012 surveys

Click image to zoom

Because of the wide variation in survey methodology across polls (which makes it hard to compare polls’ results to one another or to national-level data) and the scarcity of raw data, I decided to turn to the 2008 National Annenberg Election Survey. While the NAES’s state-level sample size is smaller than I’d like, it provides the ability to compare responses across groups and states beyond what’s possible with individual primary election polls. And just as importantly, the NAES contains a wide array of variables providing highly-detailed responses, so the resulting data can be directly compared to data from other sources—in this case, the results of the Iowa caucus entrance polls. (While using 2008 data in the context of the 2012 race naturally introduces some degree of error, in this context it’s unlikely to be a major concern: my interest is in the prevalence of certain “fundamental” variables that are highly-stable from year to year.)

I’ve tallied the results of questions on the NAES which correspond with those found on the Iowa entrance polls, and present results for all respondents in (a) New Hampshire and (b) the US as a whole, then provide the same results for (c) GOP primary voters nationwide in 2008 and (d) Iowa caucusgoers in 2012. First, some observations about the uniqueness of New Hampshire’s residents:

  • In terms of race and ethnicity, NH is much more representative of the broader electorate than one might think. It is indeed much more likely to be white, non-hispanic, and native-born than the rest of the country, but in the context of analyzing the GOP primary, that hardly makes a difference—all three groups are similarly overrepresented among 2008 primary voters and made up the vast majority of this year’s caucusgoers as well.
  • With regard to age, education, and income, NH differs only modestly from national averages.
  • The biggest difference in terms of demographics is the low prevalence of evangelicals in NH, at less than half the national average.
  • Politically, NH residents are far more likely to call themselves independent than others, but this matters more in terms of identity than ideology: the number of self-identified conservatives is only slightly below the national average, and the proportion of those who call themselves “very conservative” is similar.
  • The much higher rate of independent identification does not mean NH is full of moderates—breaking the data down further (not shown), the percentage of respondents describing themselves as “moderate” is less than 4 points higher in NH than elsewhere. While there may be some differences in ideology, they’re not nearly as extreme as the party identification rates might suggest.

So when looking at the GOP primary, New Hampshire isn’t nearly the backwater many imagine it to be. With the (admittedly big) exception of the rates of evangelicalism, New Hampshire actually does a fairly good job of reflecting the broader electorate. Given that the main argument for the value of the NH primary is that New Hampshire’s small size forces candidates to engage in on-the-ground campaigning and allows the voters to learn about the candidates, it’s hard to think of another state so small in both population and geography that would better serve as a stand-in for the rest of the country.

This leads into the second point I brought up above—whether Romney’s presumed strength in New Hampshire is due to proximity. This notion has always seemed suspect to me, since Romney was never that popular in Massachusetts, and there’s nothing in his record that would specifically appeal to the New Hampshire GOP (given that his only notable accomplishment was enacting universal healthcare with an individual mandate). Considering the fundamentals in New Hampshire alongside the cross-tabs of the Iowa entrance poll, Romney’s supposed advantage would be more plausibly attributed to the characteristics of NH voters, especially in contrast to Iowa’s caucusgoers:

  • In terms of race, ethnicity, and age, there’s little reason to suspect that these will have much impact on the outcome.
  • But Romney does stand to do well in NH due to a greater number of college graduates and high-income households in NH than in the rest of the country, both groups which favored Romney in Iowa.****
  • With regard to religion, NH has very few evangelicals, while Iowa is at the opposite end of the spectrum (57% of caucusgoers). Given that non-evangelicals supported Romney at nearly three times the rate of evangelicals, this is a huge advantage for him in NH.
  • Nearly half of Iowa’s caucusgoers identified as “very conservative”, but that pattern almost certainly won’t repeat in NH. Given that Romney does much better with moderate and slightly conservative voters, Romney is again in a stronger position in NH than he was in Iowa, and to a lesser extent should do better in NH than in the rest of the country as well for the same reason.
  • Though Ron Paul triumphed among independent caucusgoers in Iowa, Romney came in second with independents. Barring an unlikely Ron Paul landslide, the high number of independents in NH should also give Romney an edge over his opponents.

So with all of those factors considered, a strong showing by Romney in New Hampshire would be likely no matter where he came from. At best, he might get a bit of a boost from increased name recognition, but those benefits would be minimal given his prominence in the campaign and his 2008 run—all else equal, by now he would be very well known in New Hampshire whether he had been governor of Massachusetts or Montana. But even if he were helped to some degree by familiarity, it would not do him much good if he weren’t already a match for the state’s electorate. After all, as he’s become more known throughout the country during this campaign, it hasn’t appeared to help him at all with voters; as often as not, those who find out who he is learn that they’d rather have someone else.

Finally, what does this mean for the other candidates? For many of the same reasons Romney does well in New Hampshire, Santorum should suffer relative to his Iowa performance—though such a decline could be somewhat offset if he gets most of Perry’s and Bachmann’s supporters (who appear most similar to his own in terms of fundamentals), and maybe a few of Gingrich’s as well. The wildcard here is Ron Paul: given the independent nature of NH voters, he should conceivably do better there than most other places, but his eccentricities (both personal and political) suggest a ceiling in his potential support. He received 8% of the NH vote in 2008 and stands to improve upon that next week, but it’s hard to see him becoming truly competitive on a national scale—South Carolina won’t be nearly so welcoming as New Hampshire.

Notes: Continue reading