Elevation and Suicide

The other day, I got into a somewhat protracted battle on Facebook with a fellow who I charitably describe as “absolutely fine with confirmation bias.” The nature of that conversation inspired me to look a bit deeper into this story when someone else posted it as a link in another subthread of the same conversation. You don’t have to read it. The gist of it is this: there’s a Utah neurologist who thinks that living at altitude effects the mood altering/controlling neurotransmitters serotonin and dopamine, and that this is why Utah (and the other mountain west states) have such high rates of suicide. How good is this theory?

There is no questions that these states have high rates of suicide. Pictured to the left is 2012 data from the CDC. You can see that the “mountain west” states are 7 of the top 10m6345qsf, and all mountain west states are in the top 12. If you assume that Oregon’s and Alaska’s high rates are connected to those that live in or near the mountains those states have, then altitude explains 11 of the top 12 highest rates.

It’s starting off as a pretty strong case. So I set off in search of some data. I’ll skip what I found first. I ultimately decided on getting county-level suicide data from here, and county-level elevation data from the here.

So far, so good. Both datasets required a bit of cleaning (I used R) and then I joined the two datasets using a combined State FIPS + County FIPS variable. I ran a simple linear regression using “Average Elevation of the County” as the input and “Crude Suicide Rates per 100,000” as the output. And here it is.suicide_elevation

And the theory looks pretty good. There’s seems to be a strong correlation between elevation and suicide. In fact, this model shows this correlation is statistically significant at p<.001. And I can say I used two good, trustworthy data sources: the CDC for mortality data and USGIS for elevation data. I didn’t do anything sneaky statistics-wise. the Q-Q residuals look good and confirm that no major assumptions of the linear model were violated. The only thing you might want for a more robust finding is to control for other variables (like poverty).

And that’s why I thought this was worthy of a post. It isn’t that this data analysis is “bad” per se, but it’s woefully incomplete. Stopping here would be a bad thing to do, not just because I didn’t control for other variables (that are likely more important than elevation, R-squared for this model is only 0.1663).

No, the real problem with this data shows up in the data I found first, from the WISQARS interactive database the CDC recently launched. On there I was able to make a query and generate a map..and theoretically…download the data that generated the map. But alas, this functionality seems to be broken at the moment. So if I wanted to run my own regression (and I did) I had to go get my data elsewhere. Here is the map I generated.output-m7723388

Notice that big white band running north – south in the middle of the country? And all those giant white islands in the sea of brown further west? Those are really important. That’s missing data. And that data is “missing” because the CDC considered it “unreliable” and “suppressed” it. That data is “unreliable” primarily because those counties are either really sparsely populated (so a single suicide would generate an incredibly high rate/100,000 or those counties had too few suicides in the five-year time span to calculate a genuine rate).

That’s really important because that means that the missing data is not random and the reason it’s missing is directly related to the hypothesis under investigation.

What that means is, for the most part, these counties had very few suicides. And (and this is important) most of those counties exist at elevations higher than the 0ft – 1000ft area where the suicides cluster on the left-hand side of the scatterplot.

When I got rid of counties with no data, I went from the +3000 counties the US has down to 483, a loss of around 85% of total counties. In the remaining dataset the lowest suicide rate was 4.7 suicides per 100,000 people. So I ran a simulation where all missing rates were replaced with this number. And what happened? The correlation vanished.

Now this is not an entirely fair way to test this data. But it’s not entirely unfair either, 4.7 suicides per 100,000 is a pretty low rate, but it’s also a much higher rate than any of these counties actually experience (which is why the data is missing). Consistent variations within these ~2500 counties might still lead to a detectable correlation with altitude. But I doubt it.

And I doubt it because what appears to a Utah-based neurologist to be an issue with elevation is probably much more strongly correlated with other features that also correlate with elevation: poverty, rurality, machismo, gun culture, high levels of drug and alcohol abuse–and all of these things, in turn, correlate with suicide in general and can help explain the rise in rates that these states have seen in the last few years. That is, elevation may effect dopamine and serotonin levels, but they were doing that in 2005, 2000, 1995, … and on and on. So we can’t use elevation to explain the rise in rates even if it helped explain the high base rate (which it probably doesn’t.)