Sunday, July 20, 2014

Running back missing data -- difference between zero and not available.

Running back missing data – difference between zero and not available

This post illustrates that one most be careful to not code a missing observation as zero.   It builds on my previous post on running backs and outliers.   Larry Stengent got injured in the preseason of his rookie year and his actual career yardage is zero.  But it is incorrect to assert (as I previously did) that his yards-per-carry figure was zero.


Interestingly, this post suggests that the skew statistic or changes in skew are a good way to flag the existence of outliers or data anomalies.

Question:  Use the previously published running back data to estimate the average, standard deviation, and skew of total career yards and career yards per carry.  


Raw data is in file below.



Calculate all statistics for the entire sample of first-choice picks and the sample with Larry Stengent omitted.     Comment on the results.

Answer:  The statistics are presented below.



Entire Sample
Total Yards
Yards Per Carry
Average
6506
3.9
Standard Deviation
4978
0.8
Skew
0.5
-3.8
Sample Minus Missing Player
Total Yards
Yards Per Carry
Average
6709
4.0
Standard Deviation
4917
0.4
Skew
0.4
-0.1

Observations:

Removal of Larry Stengent  (the player who got injured preseason of his rookie year does not have a huge impact on average, standard deviation or skew of total career yardage.

The removal of this one data point totally alters the standard deviation and skew of yards per carry.  It is incorrect to code yards per carry for this player at zero.  But for his unfortunate injury he would have had (non-zero) yards per carry.

It is not unusual for players to have short careers and low career totals

It is unusual for a top pick to have zero runs per carry.


Be careful how you treat anomalies in your datasets!!!!!!!

No comments:

Post a Comment