20 January 2011

Experts Make Bad Prophets

When based on the same evidence, the predictions of SPRs [statistical prediction rules] are at least as reliable as, and are typically more reliable than, the predictions of human experts for problems of social prediction. . . even when experts are given the results of SPRs, they still can't outperform those SPRs.


From here.

This is something that I've know for a long time, but the post linked is notable because it shows the extent to which this tendency is widespread and pervasive.

Some of the better known examples of statistical prediction rules that are actually used on a regular basis include:

1. Pre-trial services evaluations of criminal defendants for pre-trial release.
2. A formula used by the Colorado Department of Corrections to assign new state prison inmates to the most appropriate security level prison facility.
3. Credit scores for making loan decisions, and determining the likelihood that a borrower will go bankrupt.
4. Test score and GPA formula based college admissions decisions.
5. Test score based standard for whether or not to accept military recruits into particular military services and into military occupational specialities.

There is a simple SPR for predicting marital happiness, and I recent heard an NPR report describing an SPR that predicted the likelihood that unmarried couples would stick together based on the extent to which their speech patterns were similar.

Uncritical reliance on test scores by low level bureacrats produces better results than consideration of these scores as one factor out of many by highly qualified experts. Finessing the result with expert analysis undermines the virtues that make SPRs work.

There are lots of situations where there simply aren't SPRs available, and in those cases, experts may be the next best thing, but when you have a regularized situation where there is an advantage to be had from making accurate predictions, you are better off devising an SPR from a good data set than relying on experts to make the calls. And, in the era of mass data collection and cheap computing resources to mine it, it is much easier to create a decent SPR than it used to be.

This isn't the only situation where the general cognitive bias towards undue faith in capacity of experts or decision-makers with access to large amounts of data to goods better judgments comes into play. Studies have shown that attorneys systemically overestimate the strength of their cases. People like judges and juries who are able to watch the demeanor of witnesses on the stand testifying are less effective at distinguishing true from untrue statements than people forced to rely on transcripts of the same testimony.

One of the main things that an SPR does is to toss out of the model all of the factors that somebody thought might be relevant but don't actually have an empirical predictive power. But, experts and others with a wider fact set are tempted to consider all of the information, even the irrelevant bits, which inserts non-empirically support biases into the decision making process.

Indeed, one way to understand the structure of legal rules in statutes and precedents is as SPRs that force a highly diverse set of legal expert judges to consider only a small subset of all the facts in a situation according to a particular formula to get a result, rather than relying on the totality of their life experience in making those kinds of decisions.

Limits of SPRs

This doesn't mean that SPRs are the be all and end all solution to all of society's woes.

It is possible to design rules like look like SPRs but don't have the empirical grounding linking potential causes and potential effects accurate or are based on incorrect data or assumptions. Perhaps the classic example of this can be found in the United States Sentencing Guidelines which relied on assumptions like a 100:1 crack cocaine to powder cocaine ratio that was just plain wrong and produced grossly unfair results as a result. Indeed, that example is an apt cautionary tale, because a bad SPR affects large numbers of people in a systemic way, while mistakes by experts tend to be diffuse and to vary in differing directions from the correct prediction.

It is also possible to build a SPR that is the best possible SPR given the data that is used to build it, but remains quite inaccurate because there are factors that it fails to consider. For example, an SPR designed to predict natural gas usage that fails to consider average monthly temperature as an input is going to be much less accurate than one that does considers it.

One can also build a suboptimal SPR by designing it to predict something other than the thing that you actually want to measure. A recent practical example of that kind of problem is the use of law school grades by large law firms to predict law firm success. Law school grades are an important factor, (indeed better than social class or law school prestige) but it turns out that it is a factor that has diminishing returns when you reach a point where all of your applicants have very good law school grades and academic ability. Indeed, lawyers with good law school grades from less prestigious schools, who on average had lower LSAT scores and undergraduate GPAs than their peers at more prestigious schools, actually perform better when hired by big law firms and are happier than lawyers with the same law school grades from more pestigious schools. There are other factors that matter too that aren't measured, but work to develop efficient ways to test these factors in a format such as a two hour test, is in its infancy.

This is a classic case of an SPR with a design problem. LSATs and undergraduate grades predict law school grades, particularly in the first year of law school, as well as any other measure available, but there are far more abilities that are relevant to the practice of law than there are to being a law student. Two dimensions (correlated strongly with each other, but more weakly with "g heavy" LSATs and academic grade point averages) were multiple choice tests of "situational judgment" and "biographical information" which were calculated to predict effectiveness measures such as practical judgment, organization, self-discipline, creativity, and effectiveness in interpersonal communication.

Another problem is SPRs is that they are often formulated in one context and then used in another that they are ill suited to, where they are not validated to work. For example, an IQ test administered in the Standard American English works just fine as a predictor of academic success and many other things when the people taking the test are native speakers of Standard American English. But, it works rather less well when the people taking the test are native speakers of Spanish or Chinese and have only been learning English for a few years.

A recent Army test designed to measure "spiritual fitness" may be reasonable reliable for the Christians who make up most soldiers and for a significant number of people who practice other religions. But, the questions it presents may present category errors that aren't valid when given to people who are considered atheists and agnostics. The fact that the test was developed by a psychologist who also inspired the CIA's torture program also casts real doubt on the methodology used to create it.

More generally, one should be suspicious of SPRs in any situation where there are two or more populations of people who are very different in relevant ways, one of those populations is much larger than another, and the SPR isn't separately validated for each subpopulation. For example, the body measurements that indicate optimal fitness for men aren't likely to be the body measurements that indicate optimal fitness for women.

Likewise, SPRs in the criminal justice system that are predominantly validated with normal "blue collar" criminal defendants and show validity across of wide range of those kinds of criminal defendants, may not necessarily be very useful in accurately predicting how "white collar" criminal defendants, who make up a tiny part of the validation sample but are different in a great many respects from blue collar defendants demographically and otherwise, will act.

Further, many SPRs only make sense for individuals within the "normal range" of whatever they were designed to measure. SAT scores from high school, for example, are a poor instrument to use to distinguish between graduate students who will go on to be professors at less prestigious institutions from graduate students who will go on to be professors at more prestigious institutions. Everyone who finishes graduate school and goes on to become a professor is very academically able, and the SAT is not very discriminating at the high end of the scale. Similarly, an SAT is a poor instrument for distinguishing between someone who has a mild and a severe developmental disability, because it isn't designed to be particularly discriminating or accurate at the very low end of the scale.

Using Imperfect SPRs

Expert supplementation of SPRs is probably best when one knows what an SPR does measure, what factors were ruled out as irrelevant in the formulation of SPR, what flaws have been demonstrated in the SPR, and what factors are known to be relevant but are not measured by the SPR.

For example, credit scores are probably as good as one is going to get at predicting propensity to default on loans. Any loan officer effort to second guess propensity to default on loans from other data (like "good character") is likely to be counterproductive. But, credit scores don't measure ability to pay factors such income and assets. For example, default rates and bad debt losses are extremely low in the case of purchase money mortgages involving large down payments, even when the borrower has very low credit scores.

Similarly, an expert may be helpful in identifying circumstances when an SPR (perhaps a "second best" one) is and is not likely to have validity. Someone who got high SAT scores in high school almost certainly mentally retarded unless some sort of brain injury has taken place since then. But, in the case of someone who got low SAT scores in high school, the SAT may be unhelpful.

No comments: