2/24/2010

The Data is Shit

After many months of working frenetically to get the data in front of the client, that was their response.

“The data is shit”. – (unhappy) (I mean, *really unhappy*) client

Or so the story goes. Fortunately, I was not invited to that particular client meeting. I don’t know if the client actually used those words, but I wouldn’t be surprised if they did because it was a pretty concise description of the data we had been working with for months.

P1010242
We had politely been referring to the herculean effort as one of (unprecedented) data “harmonization,” so forgive me if I chuckle every time I hear the term “data harmonization”. Seeing as how I played a leading role in this data harmonization effort, you might be surprised that a) I find the client response amusing and b) I would publicly fess up to coordinating this effort.

The Client


I was recruited to work on this project because of my expertise in data management *and* I had experience working with a (successful) data archiving effort for this particular client. I hope I don’t get in trouble for this, but in the age of “government 2.0” and transparency in government, here goes. The client was SAMHSA, the Substance Abuse and Mental Health Services Administration at the US Department of Health and Human Services.

I had previously been working at the official SAMHSA data archive (SAMHDA) at the University of Michigan’s Institute for Social Research (ICPSR), so I was quickly snatched up by an HIT (health information technology) ‘beltway bandit’ company at the end of 2006 when they realized they were in over their heads on a 5-year $25M project for SAMHSA. The project involved standardizing data from disparate sources for their Data Coordination and Consolidation Center (DCCC).

The Data


There was nothing inherently wrong with the data. Really, it was a matter of expectations. Not many people understand data and what it can and cannot do for you. We were given the challenge of consolidating data from significantly different data streams. This is not impossible, but is a delicate effort. Arguably, the more you work with data in this way, the less it will do for you. Even ‘good data’ has its limits.

It’s All about Context


Part of the reason I didn’t take this whole abysmal failure thing too seriously is because I could see it coming for months ahead of time and it was kind of a relief to get it over with. Beyond the ordinary challenges of working with data and managing client expectations of said data is the larger context of the project. It was one of those huge government contracts that didn’t, um, how should I say this? It didn’t really make sense.

The original contract called for this whole data harmonization effort as well as the creation of a data analysis tool to view and analyze the data. Sounds simple enough, right? Simple, yes, but also expensive and unnecessary given that such tools already exist. Apparently the contract was set up in such a way that incentivized creating a whole new data analysis system from scratch. And so the HIT company did exactly that. It spent two years and a few million bucks reinventing the wheel. While this expensive and unnecessary toy was being developed, I worked with a team of SAS programmers to harmonize the data to put into the shiny new object.

Should anyone have been surprised when the client, at least $7M later, looked at the data and said it was shit? They took the luxury car (that they had asked for in the original contract) for a test drive and when it only drove 3 MPH, they were livid. Was the problem that the engine maxed out at 3 MPH or that they accidentally ordered a luxury car when a lawn mower would have been more appropriate?

1 comment:

  1. Totally great post about the annals, but really anals, of data analysis!

    ReplyDelete