|
Useless Statistical Almanach – n°2
(Don’t forget to participate in the census if you have not yet done so)
Today, a very simple quiz:
A medical condition happens in the population with a frequency of 1 in 5000 people.
There is a test to decide if you have the disease, but it has a 5% error rate
You get tested, and are found positive for the disease.
What is the probability that you actually have the condition?
(Answer in a little while. Propose your bets in the comments, don’t explain – yet – how you got there, but tell us how confident you are in your proposal…)
The proportion of MDs who got it right is pretty low, sadly, and worryingly.
MarcinGomulka’s solution is correct, but his explanation isn’t all that clear. Just for kicks, here’s another attempt, which may appeal more to those of a more visual frame of mind. (This is blatantly stolen from pages 46 to 49 of Larry Gonick’s Cartoon Guide to Statistics, except that I am leaving out the entertaining little pictures and jokes.)
Let’s make a table, showing all the possible combinations of tests and diseases. It will look something like this:
Disease No Disease Total
Tests Positive A B A+B
Tests Negative C D C+D
Total A+C B+D A+B+C+D
Now to fill in the table with some “real” numbers. Let’s say there are 1 million people in the population. Then “A+B+C+D” is 1 million. We also know that 1 person in 5000 has the disease, so “A+C” (the total for the “disease” column) is 1 ⁄ 5000 of the total of the population, or 200. That means that “B+D” (the total for the “no disease” column) has to be 999800, because the total for the row has to be 1 million. So the table now looks like:
Disease No Disease Total
Tests Positive A B A+B
Tests Negative C D C+D
Total 200 999800 1000000
Now, to work upward: we are told that the test has a 5% error rate. Now, usually, there are separate error rates for false positives and false negatives, but here they are the same. That means that 5% of the people who have the disease will be told they don’t have it, and 5% of the people who don’t have the disease will be told that they do have it. 5% of that 200 is 10, and 5% of 999800 is 49990. So now the table looks like:
Disease No Disease Total
Tests Positive A 49990 A+B
Tests Negative 10 D C+D
Total 200 999800 1000000
Since the columns have to add up, we can do some subtraction (200 – 10 = 190, and 999800 – 49990 = 949810) and fill in some more:
Disease No Disease Total
Tests Positive 190 49990 A+B
Tests Negative 10 949810 C+D
Total 200 999800 1000000
And last but not least, the rows have to sum (190 + 49990 = 50180, and 10 + 949810 = 949820) so the completed table says:
Disease No Disease Total
Tests Positive 190 49990 50180
Tests Negative 10 949810 949820
Total 200 999800 1000000
The question was: if you test positive, what is the likelihood that you have the disease? Well, from the chart, we see that 50180 people will test positive, but that only 190 of that group will actually have the disease. That means the chance of having the disease will be 190 out of 50180. That’s less than one percent (which would be roughly 502 out of 50180) or even half of one percent (251 out of 50180). So it still isn’t a rock-solid certainty. (Also note that even though the disease only infects one fiftieth of one percent of the population, over five percent will test positive!)
But as was pointed out before, by getting that false positive, we jumped from having a 1 in 5000 chance of having the problem (which is not particularly worrisome) to having a 1 in about 264 chance, which is where you need to start taking things very seriously. (Consult your insurance policy, talk to your doctor, update your will if the disease is a serious one…)
Among other names, this is called the “False Positive Paradox”.
There is one… um… sociological aspect of it, though, which strikes me as being a bit off: there is a hidden variable in this scenario which is nearly always there in real life, but is not in the table or the math. Specifically, for most diseases, a doctor won’t give you the test in the first place unless there is some reason to suspect you might have the disease — symptoms or a genetic predisposition. (As a trivial example: my doctor, at least, does not give me a strep throat test every time I show up. He reserves that for when I seem to have symptoms of strep throat.) So in reality, I suspect that the probability that you actually have the disease, given a positive result, is much higher. Still not in the “certainty” range, but more like double digits (depending on how closely the symptoms are linked to the disease, of course; for a fictional disease, we could have it be one of those “you feel just fine until the day you drop dead” maladies out of science fiction and DHS terrorism scenarios).
Posted by: Blind Misery | Jan 14 2005 3:17 utc | 49
Nope, Alabama, you’re just flat-out wrong in this case. (Don’t screw with me, man, I have a B.A. in math! 😉 ) Percentages of statistically distinct groups do not add like that. (And “people who have the disease” and “people who do not have the disease” are distinct.) To demonstrate this, consider any example where the two groups do not overlap.
One quick one of my own: suppose that there are 100 people in a building and half are men and half are women. Those two groups are distinct, yes? (Or, at least, for our purposes they are: every man is not a woman, and every woman is not a man. Our building is the Republican National Headquarters; no deviant gender-swapping allowed.)
Now, suppose I stick “kick me” signs on the backs of 10% of the men at random and 10% of the women at random. You would say that 20% of the people in the building now have signs, because 10% plus 10% is 20%, but this is not true: there are 50 men and 50 women (half of 100 each). That means “10% of the men” is 5 people, and “10% of the women” is also 5 people, for a total of 10, which is 10% of the whole.
The percentages only alter each other if the groups either overlap or do not cover the whole population, and then it isn’t always predictable. If I had said “I put signs on 10% of the men chosen at random, then put signs on 10% of the building’s occupants chosen at random” then things get messy — I might have put two signs on a single man, since a man is also an occupant, and I didn’t specify that they were additional occupants, but I might also have chosen 10 women for the random occupants, which means there were 5 men and 10 women, for 15 people total, or 15%. Or something in between. The best we can say is that I played a juvenile prank on at least 10 people, and that at least 5 of them were men.
In the disease case, “people who have the disease” and “people who don’t have the disease” are distinct groups, which together make up the whole population. So 5% of one and 5% of the other is 5% of the whole population.
Posted by: Blind Misery | Jan 14 2005 4:23 utc | 51
|