## Friday, April 18, 2014

### Studying rare diseases with survey data.

A rare disease post

This post looks at issues associated with sampling when an event or occurrence is rare.

Question:   A team of statisticians is collecting data on a wide range of medical conditions.    The likelihood a person has a disease is 1/50,000 or 0.00002.

The team questions 50,000 people.   What is the likelihood that no person in the sample has the disease in question?    What is the likelihood that the point estimate of the incidence of this disease in this population obtained from this sample is greater than or equal to 0.00004.

What are the answers to these questions if the team questions 100,000 people?

Methodology:

The number of people with the diseases is binomially distributed with the probability of having the disease equal to 0.00002 and the probability of not having the disease equal to 0.00008.

The easiest way to calculate binomial distribution probabilities is to use the BINOM.DIST function in Excel.

The BINOM.DIST function has four arguments  --- the number of successes (in this case the number of people with the disease), the number of trials, the probability of a success on each trial and a logical variable set to 0 for the probability mass function (the probability X=k) or set to 1 for the cumulative density function (X<=k).

Answers:   The chart below contains information on probability X=k for a sample of 50,000.

 Number of people with a disease Trials Probability of having the disease BINOM PROBABILITY 0 50000 0.00002 0.367876 1 50000 0.00002 0.367883 2 50000 0.00002 0.183942 3 50000 0.00002 0.061313 4 50000 0.00002 0.015328 5 50000 0.00002 0.003065 6 50000 0.00002 0.000511 7 50000 0.00002 0.000073 8 50000 0.00002 0.000009 9 50000 0.00002 0.000001 10 50000 0.00002 0.000000 11 50000 0.00002 0.000000 12 50000 0.00002 0.000000

The probability of having no one in a sample of 50,000 people that has this disease is 0.368.

When the number of successes is greater or equal to 2 the estimated incidence of the disease is greater than 0.00004.   The likelihood we obtain an estimate of the disease incidence greater than 0.00004 is 0.264.

The chart below contains information on P(X=K) for a sample of 100,000 people.

 Number of people with a disease Trials Probability of having the disease BINOM PROBABILITY 0 100000 0.00002 0.135333 1 100000 0.00002 0.270671 2 100000 0.00002 0.270673 3 100000 0.00002 0.180449 4 100000 0.00002 0.090224 5 100000 0.00002 0.036089 6 100000 0.00002 0.012029 7 100000 0.00002 0.003437 8 100000 0.00002 0.000859 9 100000 0.00002 0.000191 10 100000 0.00002 0.000038 11 100000 0.00002 0.000007 12 100000 0.00002 0.000001

The probability of having no one in a sample of 100000 people who has this disease is 0.135.

When number of successes (people with the disease) is greater than or equal to 4 the estimated incidence of the disease is greater than or equal to 0.00004.   This likelihood is 0.143.

Concluding Thoughts:  General public surveys like the MEPS are not useful when studying issues like costs associated with a relatively rare disease or the characteristics of people with a particularly rare disease.

People seemed to like my previous post on high-cost patients and health plan type.