I have a dream...: Statistics Fundamentals Part 2 (Getting good data)

Cluster sampling

The dataset we've been working with was scraped from the WNBA's website. The website centralizes data on basketball games and players in the WNBA. Let's suppose for a moment that such a site didn't exist, and the data were instead scattered across each individual team's website. There are twelve unique teams in our data set, which means we'd have to scrape twelve different websites, each of which requires its own scraping script.

This scenario is quite common in the data science workflow: you want to answer some questions about a population, but the data is scattered in such a way that data collection is either time-consuming or close to impossible. For instance, let's say you want to analyze how people review and rate movies as a function of movie budget. There are many websites that can help with data collection, but how can you go about it so that you can spend a day or two on getting the data you need, rather than a month or two?

One way is to list all the data sources you can find, and then randomly pick only a few of them from which to collect. Then you can individually sample each of the sources you've randomly picked. We call this sampling method cluster sampling, and we call each of the individual data sources a cluster.

script.py

wnba.csv

#Simulate cluster sampling on wnba dataset
#We'll cluster by teams randomly
print(wnba['Team'].unique())
print(type(wnba['Team'].unique()))
​
#Randoly pick a few clusters from our array of Team 
clusters = pd.Series(wnba['Team'].unique()).sample(4, random_state=0)
print(cluster)
​
​
​
#Collect the data from each clusters
sample = pd.DataFrame()
for cluster in clusters:
    data_collected = wnba[wnba['Team'] == cluster]
    sample = sample.append(data_collected)
print(sample.shape) 
print(sample.columns)
​
#Estimation of Sample mean
sample_mean_height = sample['Height'].mean()
sample_mean_age = sample['Age'].mean()
sample_mean_bmi = sample['BMI'].mean()
sample_mean_total_points = sample['PTS'].mean()
​
#Estimation of Population mean
pop_mean_height = wnba['Height'].mean()
pop_mean_age = wnba['Age'].mean()
pop_mean_bmi = wnba['BMI'].mean()
pop_mean_total_points = wnba['PTS'].mean()
​
sampling_error_height = pop_mean_height - sample_mean_height
sampling_error_age = pop_mean_age - sample_mean_age
sampling_error_BMI = pop_mean_bmi - sample_mean_bmi
sampling_error_points = pop_mean_total_points - sample_mean_total_points
​
print(sampling_error_height)
print(sampling_error_age)
print(sampling_error_BMI)
print(sampling_error_points)

Output

['DAL' 'LA' 'CON' 'SAN' 'MIN' 'SEA' 'PHO' 'CHI' 'WAS' 'NY' 'ATL' 'IND']
<class 'numpy.ndarray'>
ATL
(46, 33)
Index(['Name', 'Team', 'Pos', 'Height', 'Weight', 'BMI', 'Birth_Place',
       'Birthdate', 'Age', 'College', 'Experience', 'Games Played', 'MIN',
       'FGM', 'FGA', 'FG%', '15:00', '3PA', '3P%', 'FTM', 'FTA', 'FT%', 'OREB',
       'DREB', 'REB', 'AST', 'STL', 'BLK', 'TO', 'PTS', 'DD2', 'TD3',
       'Pts_per_game'],
      dtype='object')
-0.06400121617511445
-1.401337792642142
0.23095444165951662
-27.79674673152934

I have a dream...

Thursday, July 15, 2021

Statistics Fundamentals Part 2 (Getting good data)

Cluster sampling

No comments: