Running back missing data – difference between zero and not
available
This post illustrates that one most be careful to not code a
missing observation as zero. It builds
on my previous post on running backs and outliers. Larry Stengent got injured in the preseason
of his rookie year and his actual career yardage is zero. But it is incorrect to assert (as I previously
did) that his yardspercarry figure was zero.
Interestingly, this post suggests that the skew statistic or
changes in skew are a good way to flag the existence of outliers or data
anomalies.
Question: Use the previously published running back
data to estimate the average, standard deviation, and skew of total career
yards and career yards per carry.
Raw data is in file below.
Calculate all statistics for the entire sample of
firstchoice picks and the sample with Larry Stengent omitted. Comment on the results.
Answer: The statistics are presented below.
Entire Sample


Total Yards

Yards Per Carry


Average

6506

3.9

Standard Deviation

4978

0.8

Skew

0.5

3.8

Sample Minus Missing
Player


Total Yards

Yards Per Carry


Average

6709

4.0

Standard Deviation

4917

0.4

Skew

0.4

0.1

Observations:
Removal of Larry Stengent
(the player who got injured preseason of his rookie year does not have a
huge impact on average, standard deviation or skew of total career yardage.
The removal of this one data point totally alters the
standard deviation and skew of yards per carry.
It is incorrect to code yards per carry for this player at zero. But for his unfortunate injury he would have
had (nonzero) yards per carry.
It is not unusual for players to have short careers and low
career totals
It is unusual for a top pick to have zero runs per carry.
Be careful how you treat anomalies in your datasets!!!!!!!
No comments:
Post a Comment