When Can You Trust a Data Scientist?

Pete Warden’s Monkey Cage post, “Why You Should Never Trust a Data Scientist” (original version from his blog), illustrates one of the biggest challenges facing both consumers and practitioners of data science: the issue of accountability. And while I suspect that Warden—a confessed data scientist himself—was being hyperbolic when choosing the title for his post, I worry that some readers may well take it at face value. So for those who are worried that they really can’t trust a data scientist, I’d like to offer a few reassurances and suggestions.

Data science (sometimes referred to as “data mining”, “big data”, “machine learning”, or “analytics”) has long been subject to criticism from more traditional researchers. Some of these critiques are justified, others less so, but in reality data science has the same baby/bathwater issues as any other approach to research. Its tools can provide tremendous value, but we also need to accept their limitations. Those limitations are far too extensive to get into here, and that’s indicative of the real problem Warden identified: as a data scientist, nobody checks your work, mostly because few of your consumers even understand it.

As a political scientist by training, this was a strange thing to accept when I left the ivory tower (or its Southern equivalent, anyway) last year to do applied research. The reason for a client to hire someone like me is because I know how to do things they don’t, but that also means that they can’t really tell if I’ve done my job correctly. It’s ultimately a leap of faith—the work we do often looks, as one client put it, like “magic.” But that magic can offer big rewards when done properly, because it can provide insights that simply aren’t available any other way.

So for those who could benefit from such insights, here are a few things to look for when deciding whether to trust a data scientist:

  • Transparency: Beware the “black box” approach to analysis that’s all too common. Good practitioners will share their methodology when they can, explain why when they can’t, and never use the words, “it’s proprietary,” when they really mean, “I don’t know.”
  • Accessibility: The best practitioners are those who help their audience understand what they did and what it means, as much as possible given the audience’s technical sophistication. Not only is it a good sign that that they understand what they’re doing, it will also help you make the most of what they provide.
  • Rigor: There are always multiple ways to analyze a “big data” problem, so a good practitioner will try different approaches in the course of a project. This is especially important when using methods that can be opaque, since it’s harder to spot problems along the way.
  • Humility: Find someone who will tell you what they don’t know, not just what they do.

These are, of course, fundamental characteristics of good research in any field, and that’s exactly my point. Data science is to data as political science is to politics, in that the approach to research matters as much as the raw material. Identifying meaningful patterns in large datasets is a science, and so my best advice is to find someone who treats it that way.

On the role of gender in vote choice models

Josh Tucker poses an intriguing question over on the Money Cage blog, about why we persist in including gender in models of vote choice. I posted my thoughts as a comment on that page, but decided to repost them here as well (given that they exceeded the length of the original post) to continue the conversation:

Like other demographic variables—race, religion, income, etc.—it can serve as a useful proxy for unobserved issue preferences. Even in the US, where there’s no women’s party (or black party, christian party, or worker’s party, at least not officially), there are certainly issues on which the parties and their candidates differ, where the cleavages at least partially split along gender lines.

For example, running some data from the 2004 NAES, women:

  • supported the assault weapons ban at a rate 10% higher than men,
  • supported increasing the minimum wage at a rate 11% higher, and
  • supported making health insurance more available to children and workers at rates of 7 and 11% higher (respectively).

While in the NAES we have this data directly-measured (of course) and could thus use it on its own in vote models, in many surveys we don’t. Or, when we do, we have such fine-grained measures that aggregation is problematic. In either case, gender serves to capture some of this variation, so it’s useful for keeping vote models simple but meaningful.

Of course, there’s also the path dependency side of the equation; because everyone uses gender, it’s much easier to include it than exclude it. That’s a problem, of course, when we do have preference data, because the correlations between the data often drown out the significance of the issue preferences in regressions.

(Maybe that’s why the issue voting lit is still fairly primitive—social pressure to include demographic confounders makes the burden of proof that issues matter so much higher? Interesting topic for another time.)

Not sure what the status of the lit is on this question, but I imagine somebody’s tackled it before—seeing whether gender matters when everything else is controlled for as well. For what it’s worth, in my most recent paper (presented at APSA, on campaign effects), I found gender to be highly-significant for predicting 2004 presidential vote choice, even after accounting for partisanship, ideology, issue salience, and aggregated issue preferences. Didn’t run it with each issue separately, however, so that might have changed the results.

Now back to dissertation writing.