04 September 2007

Beating the Fluke Effect

Anyone who regularly deals with statistics is aware of the fluke effect, which is really just the flip side of the law of averages. The smaller the absolute number of data points a category in a data set has, the more likely it is to be unusually high or low.

For example, in the world of crime statistics, local murder rates vary much more from year to year than any other measure of crime, because they are so rare, so a very small number of incidents can have a dramatic effect on the overall rate.

It is, of course, possible to quantify the variability associated with a small sample. Margin of error calculations are common place, and you can find calculators on the web not only for "stupid margin of error" that assumes an infinitely large total population and a 50% result, but also more sophisticated margin of error calculations.

The trouble is that this is rarely reflected in the presentation of data. When you want to rank 200 cities by murder rate, for example, you almost always use the actual murder rate to order the cities, relegating the margin of error to a footnote that sophisticated analysts have an intangible understanding of, but neophytes simply ignore because they don't know how to use it. The only time margin of error usually enters into lay analysis of statistics is when someone retires that a survey result (say of Candidate A and Candidate B) is irrelevant, because the options are within the margin of error of each other (which itself, isn't strictly the right analysis).

How could a ranked data set be presented in a way that would convey the relative significance of both the actual results and its likely variability? One way to do it would be to do a ranking based not on the middle of a confidence interval, but on one of its extremes.

Suppose you want to prepare a ranking of how extreme the high temperatures were in 3000 U.S. cities this year, expressed as a percentage of the days when temperatures were within two degrees of the all time high for that city.

Traditionally, you would do this by simply calculating the number and ranking them. But, this is going to disproportionately overrepresent locations with short histories of recorded weather data. Excluding short data sets would eliminate the flukes, but would also miss trends in newly settled areas. The compromise would be to assign each result a margin of error based upon sample size, and then use the lower bound of the 95% confidence interval to conduct the ranking (to be conservative).

For an East Coast or Southwest City with four hundred years of continuous weather keeping, the confidence interval will be smal, and the actual rate will closely corrospond to the bottom of the confidence interval. For some suburban Nevada hamlet, with only twenty years of records, only a truly phenomenal year would be sufficient to rank it highly, but such a truly phenomenal year would appear (on a muted basis) somewhere in the mid- to high- part of the rankings.

The chart wouldn't be too hard to explain. You could say this list shows the percentage of days where we are 95% sure that this year's temperatures were within 2 degrees of an all time high on last year. The presentation would also serve the useful purpose of highlighting the inherent limitations of the even non-fluke data which often would otherwise be buried in minutae, or simply relegated to a user's general knowledge of statistics and probability.

The result would be to focus the attention of readers on the substantive task of parsing the statistically significant data points, rather than merely immediately calling valueable "first impression" attention to explaining flukes in the data set, which, while it may have curiousity and technical virtue, really doesn't tell the most important story that the data have to show.

No comments: