What is Statistics?

It is the practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.

Statistics is the grammar of data science.

Statistics is an essential concept before you can produce high quality models. Machine Learning starts out as Statistics and then advances. The concept of Linear Regression is an age old statistical analysis concept.

The knowledge of statistical techniques and metrics like mean, median, mode, variance, standard deviation, z-scores, confidence intervals, probability estimation and hypothesis testing are very essential. It helps find structure in your data and provide deeper insights. It is the most important discipline to analyze and quantify uncertainty.

It all begins by getting data for analysis. Once we have structured and measured our data, we can move on to visualizing large amounts of data and finding patters.

Let us begin by exploring the kinds of problems we can solve with Statistics.

Here is the workflow of Statistical technique we use...

Using statistical techniques, we can organize, summarize, and visualize large amounts of data to find patterns that otherwise would remain hidden.

Populations and Samples

As a data analyst, we will often need to use a small dataset to answer questions about a much larger dataset.

In Statistics, we call the set of ALL individuals relevant to a particular Statistical question, a Population.

We call a smaller group selected from a population a Sample. When we select a smaller group from a population we are sampling.

Population don't necessarily consist of people. It can even be Monkeys ;) Or companies, stars, planets, vegetables, factory produced equipments etc....

The individual parts of a population go by many names like individuals, units, events, and observations.

Population and Sample dataset example:

Say we have a dataset for individuals working in IT roles in a big company like Nike. If we wanted to answer a question about all of the individuals in our company (working in IT and non-IT roles), then the dataset we have is a sample dataset.

The dataset for all individuals in the company is population.

Sampling Error

A sample is, by definition, an incomplete dataset for the question we are trying to answer. For this reason, there's almost always some difference between the metrics of a population and the metrics of a sample. we see this difference as an error, and because it is the result of sampling, we call it Sampling Error.

We call a metric specific to a population a parameter and we call a metric specific to a sample a statistic.

In other words, Sample Error = parameter - statistic

Now let's play with real world dataset to understand Statistics

The dataset is about basketball players in the WNBA (Women's National Basketball Association)

Kaggle Link https://www.kaggle.com/jinxbe/wnba-player-stats-2017

Exploring the dataset:

This dataset contains all players who have played at least one game, so it is a population relative to our question we are trying to answer.

script.py

wnba.csv

import pandas as pd
wnba = pd.read_csv('wnba.csv')
print(wnba.head())
print(wnba.tail())
print(wnba.shape)

Output

              Name Team  Pos  Height  Weight        BMI Birth_Place  \
  Aerial Powers  DAL    F     183    71.0  21.200991          US   
    Alana Beard   LA  G/F     185    73.0  21.329438          US   
   Alex Bentley  CON    G     170    69.0  23.875433          US   
Alex Montgomery  SAN  G/F     185    84.0  24.543462          US   
   Alexis Jones  MIN    G     175    78.0  25.469388          US   

           Birthdate  Age         College Experience  Games Played  MIN  FGM  \
 January 17, 1994   23  Michigan State          2             8  173   30   
     May 14, 1982   35            Duke         12            30  947   90   
 October 27, 1990   26      Penn State          4            26  617   82   
December 11, 1988   28    Georgia Tech          6            31  721   75   
   August 5, 1994   23          Baylor          R            24  137   16   

   FGA   FG%  15:00  3PA   3P%  FTM  FTA   FT%  OREB  DREB  REB  AST  STL  \
 85  35.3     12   32  37.5   21   26  80.8     6    22   28   12    3   
177  50.8      5   18  27.8   32   41  78.0    19    82  101   72   63   
218  37.6     19   64  29.7   35   42  83.3     4    36   40   78   22   
195  38.5     21   68  30.9   17   21  81.0    35   134  169   65   20   
 50  32.0      7   20  35.0   11   12  91.7     3     9   12   12    7   

   BLK  TO  PTS  DD2  TD3  
  6  12   93    0    0  
 13  40  217    0    0  
  3  24  218    0    0  
 10  38  188    2    0  
  0  14   50    0    0  
                 Name Team  Pos  Height  Weight        BMI Birth_Place  \
   Tiffany Hayes  ATL    G     178    70.0  22.093170          US   
 Tiffany Jackson   LA    F     191    84.0  23.025685          US   
Tiffany Mitchell  IND    G     175    69.0  22.530612          US   
    Tina Charles   NY  F/C     193    84.0  22.550941          US   
   Yvonne Turner  PHO    G     175    59.0  19.265306          US   

              Birthdate  Age         College Experience  Games Played  MIN  \
September 20, 1989   27     Connecticut          6            29  861   
    April 26, 1985   32           Texas          9            22  127   
September 23, 1984   32  South Carolina          2            27  671   
      May 12, 1988   29     Connecticut          8            29  952   
  October 13, 1987   29        Nebraska          2            30  356   

     FGM  FGA   FG%  15:00  3PA   3P%  FTM  FTA   FT%  OREB  DREB  REB  AST  \
144  331  43.5     43  112  38.4  136  161  84.5    28    89  117   69   
 12   25  48.0      0    1   0.0    4    6  66.7     5    18   23    3   
 83  238  34.9     17   69  24.6   94  102  92.2    16    70   86   39   
227  509  44.6     18   56  32.1  110  135  81.5    56   212  268   75   
 59  140  42.1     11   47  23.4   22   28  78.6    11    13   24   30   

     STL  BLK  TO  PTS  DD2  TD3  
 37    8  50  467    0    0  
  1    3   8   28    0    0  
 31    5  40  277    0    0  
 21   22  71  582   11    0  
 18    1  32  151    0    0  
(143, 32)

 47  23.4   22   28  78.6    11    13   24   30   

     STL  BLK  TO  PTS  DD2  TD3  
 37    8  50  467    0    0  
  1    3   8   28    0    0  
 31    5  40  277    0    0  
 21   22  71  582   11    0  
 18    1  32  151    0    0  
(143, 32)

script.py

wnba.csv

import pandas as pd
wnba = pd.read_csv('wnba.csv')
wnba.columns

Output

Index(['Name', 'Team', 'Pos', 'Height', 'Weight', 'BMI', 'Birth_Place',
       'Birthdate', 'Age', 'College', 'Experience', 'Games Played', 'MIN',
       'FGM', 'FGA', 'FG%', '15:00', '3PA', '3P%', 'FTM', 'FTA', 'FT%', 'OREB',
       'DREB', 'REB', 'AST', 'STL', 'BLK', 'TO', 'PTS', 'DD2', 'TD3'],
      dtype='object')

script.py

wnba.csv

import pandas as pd
wnba = pd.read_csv('wnba.csv')
​
#Max num.of games played by a player in the 2016-2017 season
parameter = wnba["Games Played"].max()
​
#Randomly select 30 players from the population
sample = wnba["Games Played"].sample(30, random_state = 1) #random_state makes your results reproducible, meaning we'd get the same sample every time we ran the code.
print(sample) #first column is the index no, second column is the Games Played column
​

Output

   30
  29
   26
    14
  28
       ..
   28
   30
   24
   29
   21
Name: Games Played, Length: 30, dtype: int64

script.py

wnba.csv

import pandas as pd
wnba = pd.read_csv('wnba.csv')
​
#Max num.of games played by a player in the 2016-2017 season
parameter = wnba["Games Played"].max()
print(parameter)
​
#Randomly select 30 players from the population
sample = wnba["Games Played"].sample(30, random_state = 1) #random_state makes your results reproducible, meaning we'd get the same sample every time we ran the code.
#print(sample) #first column is the index no, second column is the Games Played column
​
statistic = sample.max()
print(statistic)
sampling_error = parameter - statistic
print(sampling_error)
​

Output

32
30
2

Simple Random Sampling

In Statistical terms, we want our sample to be representative of their corresponding populations. If a sample is representative, then the sampling error is low. The more representative a sample is, the smaller the sampling error. The less representative a sample is, the greater the sampling error.

Say we want to find the mean height of people in the USA... To make our samples representative, we can try to give every individual in the population an equal chance for selection in our samples. We want a very tall individual to have the same chance of being selected as a short individual. To give every individual an equal chance at selection, we need to sample randomly.

One way to perform random sampling is to generate random numbers and use them to select a few sample units from the population. In statistics, this sampling method is called simple random sampling, often abbreviated as SRS.

"Series.sample() method performs simple random sampling by generating an array of random numbers, and then using those numbers to select values from a Series at the indices corresponding to those random numbers. We can also extend this method for DataFrame objects, where we can sample random rows or columns.

When we use the random_state parameter, like Series.sample(30, random_state = 1), we make the generation of random numbers predictable. This is because Series.sample() uses a pseudorandom number generator. A pseudorandom number generator uses an initial value to generate a sequence of numbers that has properties similar to those of a sequence that is truly random. With random_state, we specify that initial value used by the pseudorandom number generator.

If we want to generate a sequence of five numbers using a pseudorandom generator, and begin from an initial value of 1, we'll get the same five numbers no matter how many times we run the code. If we ran wnba['Games Played'].sample(5, random_state = 1) we'd get the same sample every time we ran the code.

Pseudorandom number generators are useful in scientific research where reproducible work is necessary. In our case, pseudorandom number generators allow us to work with the same samples as you do in the exercises, which, in turn, allows for meaningful answer-checking."

script.py

wnba.csv

#Simple random sampling
#Visualize the discrepancy between a parameter and its corresponding statistics
import pandas as pd
import matplotlib.pyplot as plt
​
wnba = pd.read_csv('wnba.csv')
sample_means = []
population_mean = wnba['PTS'].mean()
print(population_mean)
​
#Take 100 samples of 10 values each from our WNBA data set
#PTS = Total points
for i in range(100):
    sample = wnba['PTS'].sample(10, random_state=i)
    sample_means.append(sample.mean())
print(sample_means)
​
plt. scatter(range(1,101), sample_means)
plt.axhline(population_mean)

Output

201.7902097902098
[145.1, 185.4, 140.4, 293.7, 172.7, 124.9, 187.8, 157.0, 188.9, 282.0, 241.5, 178.1, 157.0, 301.4, 212.9, 115.0, 135.3, 197.2, 182.5, 236.8, 145.9, 255.9, 161.2, 184.1, 213.6, 139.7, 176.3, 148.5, 118.2, 166.8, 188.3, 140.9, 182.2, 178.9, 187.1, 174.4, 126.6, 204.5, 156.0, 152.1, 193.6, 232.4, 235.1, 181.6, 230.3, 182.2, 229.4, 225.6, 203.6, 177.1, 157.4, 140.8, 147.4, 176.1, 224.5, 220.5, 132.4, 175.8, 244.3, 160.1, 244.7, 294.3, 127.1, 209.3, 173.0, 159.9, 249.9, 145.3, 144.9, 186.1, 172.9, 248.9, 137.4, 137.3, 176.0, 286.7, 258.5, 138.5, 188.5, 135.5, 178.1, 185.3, 252.0, 242.5, 253.7, 183.6, 172.7, 170.3, 148.4, 174.0, 143.9, 275.3, 152.6, 215.6, 179.8, 200.2, 177.3, 213.2, 187.2, 153.1]
<matplotlib.lines.Line2D at 0x7ff3ebaebac8>

Plots

The importance of Sample Size

sample 10:

sample 100:

Observations and conclusions:

1. Sample size not being representative of the population is a problem. This can be solved by increasing the sample size. As we increase the sample size, the sample means vary less around the population mean and the chances of getting an unrepresentative sample decrease.

2. Simple random sampling is not a reliable sampling method when the sample size is small. Try to get a sample as large as possible. A large dataset decreases the variability of the sampling process, which, in turn, decreases the chance that we will get an unrepresentative sample.

Stratified Sampling

Because simple random sample is entirely random, it can exclude certain population individuals who are relevant to some of our questions.

In other words, it is not guaranteed that we will have a representative sample.

Now let us perform stratified sampling with real world wnba dataset:

script.py

wnba.csv

#PTS = Total points
wnba['points_per_game'] = wnba['PTS'] / wnba['Games Played']
print(wnba['points_per_game'])
​
#Unique Pos column values
print(wnba['Pos'].unique())    

Output

    11.625000
     7.233333
     8.384615
     6.064516
     2.083333
         ...    
  16.103448
   1.272727
  10.259259
  20.068966
   5.033333
Name: points_per_game, Length: 143, dtype: float64
['F' 'G/F' 'G' 'C' 'F/C']

script.py

wnba.csv

#PTS = Total points
wnba['points_per_game'] = wnba['PTS'] / wnba['Games Played']
wnba['points_per_game']
​
#Unique Pos column values
wnba['Pos'].unique() 
​
#Perform stratified sampling
#Stratifying the dataset by player position
​
stratum_F = wnba[wnba['Pos'] == 'F']
stratum_GF = wnba[wnba['Pos'] == 'G/F']
stratum_G = wnba[wnba['Pos'] == 'G']
stratum_C = wnba[wnba['Pos'] == 'C']
stratum_FC = wnba[wnba['Pos'] == 'F/C']
​
points_per_position = {}
​
for stratum, position in [(stratum_F, 'F'), (stratum_GF, 'G/F'), (stratum_G, 'G'),                                     (stratum_C, 'C'), (stratum_FC, 'F/C')]:
    sample = stratum['points_per_game'].sample(10, random_state=0)
    points_per_position[position] = sample.mean()
    
print(points_per_position) 
​
position_most_points = max(points_per_position, key = points_per_position.get)
print(position_most_points)

Output

{'C': 9.833761394334251, 'F': 8.702175158545568, 'G/F': 6.817264935760487, 'G': 7.092926195632343, 'F/C': 9.059118773946361}
C

Proportional Stratified Sampling

Approximately 72.7 percent of the players had more than 23 games for the 2016-2017 season, which means that this category of players who played many games probably influenced the mean.

So, when we randomly sample, we might end up with a sample where only 2% of the players played more than 23 games. This is underestimation.

Or, we might end up with sample where 98% of the players played more than 23 games. This is overestimation.

This scenario of underestimation and overestimation is a problem and it is common with small samples.

One solution to this problem is to use stratified sampling while being mindful of the proportions in the population.

script.py

wnba.csv

#From the dataset, we see that number of games played ranges from 2 to 32 and it influences the number of total points.
#Columns Games Players and PTS are correlated.
​
print(wnba['Games Played'].min())
print(wnba['Games Played'].max())
print("Games played value counts")
print(wnba['Games Played'].value_counts(bins=3, normalize=True)*100)
​
# The output below (1.969, 12.0], (12.0, 22.0] and (22.0, 32.0] are number intervals. The ( character indicates that the beginning of the interval isn't included, and the ] indicates that the endpoint is included. For example, (22.0, 32.0] means that 22.0 isn't included, while 32.0 is, and the interval contains this array of numbers: [23, 24, 25, 26, 27, 28, 29, 30, 31, 32].

Output

2
32
Games played value counts
(22.0, 32.0]     72.727273
(12.0, 22.0]     18.181818
(1.969, 12.0]     9.090909
Name: Games Played, dtype: float64

script.py

wnba.csv

#From the dataset, we see that number of games played ranges from 2 to 32 and it influences the number of total points.
#Columns Games Players and PTS are correlated.
​
#print(wnba['Games Played'].min())
#print(wnba['Games Played'].max())
#print("Games played value counts")
#print(wnba['Games Played'].value_counts(bins=3, normalize=True)*100)
​
# The output below (1.969, 12.0], (12.0, 22.0] and (22.0, 32.0] are number intervals. The ( character indicates that the beginning of the interval isn't included, and the ] indicates that the endpoint is included. For example, (22.0, 32.0] means that 22.0 isn't included, while 32.0 is, and the interval contains this array of numbers: [23, 24, 25, 26, 27, 28, 29, 30, 31, 32].
​
#Stratify the dataset by the number of games played
​
stratum_1 = wnba[wnba['Games Played'] <= 12]
stratum_2 = wnba[(wnba['Games Played'] > 12) & (wnba['Games Played'] <= 22)]
stratum_3 = wnba[wnba['Games Played'] >22]
​
#Perform proportional stratified sampling
proportional_sample_mean = []
pts_mean = wnba['PTS'].mean()
print(pts_mean)
for i in range(100):
    sample1 = stratum_1['PTS'].sample(1, random_state = i)
    sample2 = stratum_2['PTS'].sample(2, random_state = i)
    sample3 = stratum_3['PTS'].sample(7, random_state  =i)
                               
#Concatenate all sample observations
    final_sample = pd.concat([sample1, sample2, sample3])
    proportional_sample_mean.append(final_sample.mean())
                              
print(proportional_sample_mean)                               

Output

201.7902097902098
[185.9, 163.6, 176.0, 305.1, 241.1, 200.1, 272.5, 170.5, 190.5, 138.4, 165.7, 214.9, 130.0, 173.6, 195.0, 148.5, 192.6, 176.5, 267.2, 208.9, 153.6, 176.3, 206.3, 118.4, 268.3, 197.1, 154.7, 294.4, 160.9, 160.8, 203.4, 188.8, 274.9, 201.7, 275.3, 235.1, 141.2, 145.5, 222.7, 187.0, 231.3, 202.0, 230.2, 289.7, 249.1, 120.4, 222.7, 225.8, 217.9, 232.7, 176.0, 197.7, 177.6, 208.8, 144.8, 279.5, 330.3, 169.3, 123.2, 172.7, 169.7, 259.9, 191.9, 239.1, 177.3, 264.2, 151.9, 176.4, 180.5, 189.0, 227.6, 225.5, 161.5, 148.8, 208.7, 173.1, 200.1, 219.7, 260.3, 169.2, 159.0, 216.7, 204.3, 245.0, 234.4, 216.1, 196.9, 201.0, 191.8, 186.7, 202.6, 155.7, 182.5, 162.6, 192.5, 203.7, 230.2, 207.1, 157.4, 196.2]

You might not have been impressed by the results we got by sampling proportionally.

We see there is no big difference between simple random sampling and sampling proportionally. The sample mean is still unrepresentative, being very far from the population mean. This poor performance is a result of choosing bad strata. Stratifying the data by the number of games played is a bad idea after all.

It makes more sense to stratify the data by the number of minutes played than by the number of games played. The minutes played are much better indicator of how much a player scored in a season.

So how to choosing the right strata?

1. Minimize the variability within each stratum

2. Maximize the variability between strata

3. The stratifying criterion should correlate strongly with the property you're trying to measure.

Minimize the variability within each stratum

Say for example, Brindha has scored 10 points and Mani has scored 1000 points. Avoid having Brindha and Mani in the same stratum since the variability of points is high.

If the variability is high, it is a sign that you either need more granular stratification (you need more strata) or you need to change the criterion of stratification (an example of criterion is minutes played)

Maximize the variability between strata

Good strata are different from one another.

If you have strata that are similar to one another, with respect to what you want to measure, then change the stratification criterion or go for more granular stratification.

For example, when the stratification criterion was Games Played, it resulted in strata that were similar to each other regarding the distribution of total points.

So we changed the criterion to minutes played and we managed to increase the variability.

The stratifying criterion should correlate strongly with the property you're trying to measure.

Minutes played column (criterion) and total number of points (property) have strong correlation.

script.py

wnba.csv

#MIN = Minutes Played
#PTS = Total points
print(wnba['MIN'].value_counts(bins = 5, normalize = True)*100)
​
#Stratify the dataset by minutes played
​
stratum_1 = wnba[(wnba['MIN'] > 10.993) & (wnba['MIN'] < 213.2)]
stratum_2 = wnba[(wnba['MIN'] > 213.2) & (wnba['MIN'] < 414.4)]
stratum_3 = wnba[(wnba['MIN'] > 414.4) & (wnba['MIN'] < 615.6)]
stratum_4 = wnba[(wnba['MIN'] > 615.6) & (wnba['MIN'] < 816.8)]
stratum_5 = wnba[(wnba['MIN'] > 816.8) & (wnba['MIN'] < 1018.0)]
​
#Perform proportional stratified sampling
proportional_sample_mean = []
pts_mean = wnba['PTS'].mean()
print(pts_mean)
for i in range(100):
    sample1 = stratum_1['PTS'].sample(12, random_state = i)
    sample2 = stratum_2['PTS'].sample(12, random_state = i)
    sample3 = stratum_3['PTS'].sample(12, random_state  =i)
    sample4 = stratum_4['PTS'].sample(12, random_state = i)
    sample5 = stratum_5['PTS'].sample(12, random_state = i)
                                     
#Concatenate all sample observations
    final_sample = pd.concat([sample1, sample2, sample3, sample4, sample5])
    proportional_sample_mean.append(final_sample.mean())
                              
print(proportional_sample_mean)   
​
#Display the proportional stratified sampling process
plt.scatter(range(1,101), proportional_sample_mean)
plt.axhline(pts_mean)
plt.xlabel('Sample number')
plt.ylabel('Proportional Sampling mean')
plt.title('Stratified Sampling (Minutes Played)')

Output

(213.2, 414.4]     22.377622
(615.6, 816.8]     20.279720
(10.993, 213.2]    20.279720
(816.8, 1018.0]    19.580420
(414.4, 615.6]     17.482517
Name: MIN, dtype: float64
201.7902097902098
[217.43333333333334, 200.66666666666666, 200.3, 208.51666666666668, 195.7, 209.43333333333334, 202.6, 215.11666666666667, 201.01666666666668, 214.38333333333333, 207.75, 215.0, 201.05, 200.06666666666666, 213.4, 212.23333333333332, 195.16666666666666, 212.63333333333333, 190.0, 197.93333333333334, 209.41666666666666, 206.3, 202.73333333333332, 197.8, 205.01666666666668, 201.68333333333334, 205.26666666666668, 199.45, 202.08333333333334, 198.18333333333334, 194.01666666666668, 202.96666666666667, 196.61666666666667, 210.06666666666666, 201.86666666666667, 210.78333333333333, 195.38333333333333, 209.33333333333334, 204.36666666666667, 203.06666666666666, 210.38333333333333, 198.53333333333333, 203.21666666666667, 209.91666666666666, 212.4, 208.56666666666666, 199.51666666666668, 205.65, 200.68333333333334, 208.6, 197.06666666666666, 196.31666666666666, 210.71666666666667, 204.2, 195.55, 204.25, 198.41666666666666, 199.43333333333334, 208.46666666666667, 204.0, 203.95, 210.13333333333333, 201.81666666666666, 203.93333333333334, 198.61666666666667, 198.0, 208.5, 199.78333333333333, 204.05, 199.68333333333334, 210.56666666666666, 204.01666666666668, 210.98333333333332, 203.18333333333334, 195.78333333333333, 214.38333333333333, 200.56666666666666, 204.08333333333334, 189.55, 201.66666666666666, 202.08333333333334, 208.95, 201.51666666666668, 211.1, 204.05, 205.35, 213.68333333333334, 210.7, 204.78333333333333, 204.1, 202.08333333333334, 204.16666666666666, 216.3, 217.41666666666666, 208.48333333333332, 198.26666666666668, 195.73333333333332, 202.11666666666667, 196.61666666666667, 203.35]
<matplotlib.text.Text at 0x7f3d739ec278>

Let us continue in another blog...

I have a dream...

Wednesday, July 14, 2021

Statistics Fundamentals (Getting good data)

What is Statistics?

Using statistical techniques, we can organize, summarize, and visualize large amounts of data to find patterns that otherwise would remain hidden.

Populations and Samples

Population and Sample dataset example:

Sampling Error

Now let's play with real world dataset to understand Statistics

Simple Random Sampling

Observations and conclusions:

Stratified Sampling

Proportional Stratified Sampling

So how to choosing the right strata?

Minimize the variability within each stratum

Maximize the variability between strata

The stratifying criterion should correlate strongly with the property you're trying to measure.

Let us continue in another blog...

No comments: