Slides from my MPSA 2014 presentation on forecasting turnout

For those of you who missed it, here are the slides from my talk at last weekend’s MPSA conference:

Past + Present = Future: A New Approach to Predicting Voter Turnout

The example shown is more of a toy model than one I’d actually put into practice, but it should give a general sense of the concepts behind the framework I’m proposing. As always, feel free to get in touch with questions.

New Article in Political Behavior, with Replication Files

My article with Josh Tucker and Ted Brader, “Cross-Pressure Scores: An Individual-Level Measure of Cumulative Partisan Pressures Arising from Social Group Memberships”, has just been printed in Political Behavior. I also created a GitHub repo for replication code and data to reproduce the results in the paper’s figures and tables. (You can also download a zip file containing all of the same files.) Putting those files together involved piecing together code from a variety of analyses conducted since we started the project in 2008, so please let me know if any of the files don’t work or deliver results that don’t line up with those in the paper.

One of the biggest challenges in putting this data together was to condense code written over a long period of time and by multiple authors (I wrote the original code for our analyses, but Josh and Ted made their own modifications as well) to make it straightforward for others to reproduce the results in our paper. There were plenty of other things I had to take out to avoid confusion, and I wouldn’t be surprised if there were a few bugs in this initial release that I’ll need to remedy (a major reason why I’m hosting the files on GitHub instead of a static site). Gathering replication files was a very informative exercise, and I’d highly recommend that others do it with their own work, despite the frustration that can be involved. Aside from the logistics of gathering the right code from among the many versions we produced along the way, there were also several changes I would have made if I were conducting the analysis now. I had barely finished my second year of grad school when we started, and my programming skills have grown immensely since then, so some of the steps I took in that paper are things I wouldn’t do now for a project like this.

Most galling to me is the randomness in the hotdecking procedure I used for imputing lightly-missing data. The basic way I handle missing data in this paper is to hotdeck variables with few missing values, so that these now-complete variables can then be used as predictors in multiply imputing other variables with more serious missingness. (If the latter variables were continuous, I could of course just impute them all simultaneously, but nearly all of the variables used in these analyses were categorical and thus required complete data for modeling.) I still think that’s a sound practical approach to dealing with missing data—though not as common in political science as elsewhere, hotdecking is a popular approach in survey research more generally and has attractive statistical properties—but this particular implementation is problematic because of the use of the hotdeckvars package in Stata. As far as I can tell (and I invite readers to correct me if I’ve overlooked something), using the “set seed” command in Stata doesn’t affect the randomization in the hotdecking process, and as such the results aren’t consistent from one run of the code to the next (as they would be using built-in methods that use seed values).

This isn’t a major issue, given that the variation in the final results is negligible—if it weren’t, that would mean our whole imputation strategy is flawed—and I still use that package on occasion when doing quick, one-off analyses. But it’s a real annoyance when prepping code for replication purposes, since it means that others can’t reproduce the exact same results in the paper. If I were to do this analysis again now, I’d write my own code for the hotdecking process, making each run of the code consistent given the same set seed. So, lesson learned!

(I’ve debated writing up my own Stata package to do reproducible hotdecking, but haven’t done so because my work is more often being done in R and Python these days. But feel free to get in touch if you think it would be useful—if there’s enough demand, it might be worth spending an afternoon on anyway.)

Finally, I should also note that there are other analyses referenced in the article’s text and supplemental online appendices that aren’t included in these files for the sake of brevity. If you’re particularly interested in some of that, though, let me know and I can probably get it for you. We may also be making a few small edits to the supplemental materials, based upon late-stage revisions to the paper itself, so I’ll post separately about that if it happens.

Checking in, Late 2013 Edition

So once again, it’s been a while since I last posted. What have I been up to? Well, to start, this came out in the spring:

And then early September, this happened:

1185366_10102232842632269_397210563_n

62604_10102235328590389_8784511_n

1004862_10102235327667239_1423902327_n

531851_10102239091140209_151381956_n

And at the end of it, I got these:

gqr_card

So I’m now in DC for the foreseeable future, doing very interesting things with very obscene quantities of data. I have a few invited talks and conference presentations coming up, so hopefully sometime soon I’ll be able to share some of those materials on here as well.

When Can You Trust a Data Scientist?

Pete Warden’s Monkey Cage post, “Why You Should Never Trust a Data Scientist” (original version from his blog), illustrates one of the biggest challenges facing both consumers and practitioners of data science: the issue of accountability. And while I suspect that Warden—a confessed data scientist himself—was being hyperbolic when choosing the title for his post, I worry that some readers may well take it at face value. So for those who are worried that they really can’t trust a data scientist, I’d like to offer a few reassurances and suggestions.

Data science (sometimes referred to as “data mining”, “big data”, “machine learning”, or “analytics”) has long been subject to criticism from more traditional researchers. Some of these critiques are justified, others less so, but in reality data science has the same baby/bathwater issues as any other approach to research. Its tools can provide tremendous value, but we also need to accept their limitations. Those limitations are far too extensive to get into here, and that’s indicative of the real problem Warden identified: as a data scientist, nobody checks your work, mostly because few of your consumers even understand it.

As a political scientist by training, this was a strange thing to accept when I left the ivory tower (or its Southern equivalent, anyway) last year to do applied research. The reason for a client to hire someone like me is because I know how to do things they don’t, but that also means that they can’t really tell if I’ve done my job correctly. It’s ultimately a leap of faith—the work we do often looks, as one client put it, like “magic.” But that magic can offer big rewards when done properly, because it can provide insights that simply aren’t available any other way.

So for those who could benefit from such insights, here are a few things to look for when deciding whether to trust a data scientist:

  • Transparency: Beware the “black box” approach to analysis that’s all too common. Good practitioners will share their methodology when they can, explain why when they can’t, and never use the words, “it’s proprietary,” when they really mean, “I don’t know.”
  • Accessibility: The best practitioners are those who help their audience understand what they did and what it means, as much as possible given the audience’s technical sophistication. Not only is it a good sign that that they understand what they’re doing, it will also help you make the most of what they provide.
  • Rigor: There are always multiple ways to analyze a “big data” problem, so a good practitioner will try different approaches in the course of a project. This is especially important when using methods that can be opaque, since it’s harder to spot problems along the way.
  • Humility: Find someone who will tell you what they don’t know, not just what they do.

These are, of course, fundamental characteristics of good research in any field, and that’s exactly my point. Data science is to data as political science is to politics, in that the approach to research matters as much as the raw material. Identifying meaningful patterns in large datasets is a science, and so my best advice is to find someone who treats it that way.