Wednesday, July 14, 2021

Statistics Fundamentals (Getting good data)

 What is Statistics? 

It is the practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.

Statistics is the grammar of data science.

Statistics is an essential concept before you can produce high quality models. Machine Learning starts out as Statistics and then advances. The concept of Linear Regression is an age old statistical analysis concept.

The knowledge of statistical techniques and metrics like mean, median, mode, variance, standard deviation, z-scores, confidence intervals, probability estimation and hypothesis testing are very essential. It helps find structure in your data and provide deeper insights. It is the most important discipline to analyze and quantify uncertainty.

It all begins by getting data for analysis. Once we have structured and measured our data, we can move on to visualizing large amounts of data and finding patters.

Let us begin by exploring the kinds of problems we can solve with Statistics.

Here is the workflow of Statistical technique we use...



Using statistical techniques, we can organize, summarize, and visualize large amounts of data to find patterns that otherwise would remain hidden.

Populations and Samples

As a data analyst, we will often need to use a small dataset to answer questions about a much larger dataset.

In Statistics, we call the set of ALL individuals relevant to a particular Statistical question, a Population.

We call a smaller group selected from a population a Sample. When we select a smaller group from a population we are sampling.

Population don't necessarily consist of people. It can even be Monkeys ;) Or companies, stars, planets, vegetables, factory produced equipments etc....

The individual parts of a population go by many names like individuals, units, events, and observations. 

Population and Sample dataset example:


Say we have a dataset for individuals working in IT roles in a big company like Nike. If we wanted to answer a question about all of the individuals in our company (working in IT and non-IT roles), then the dataset we have is a sample dataset.
The dataset for all individuals in the company is population.

Sampling Error

A sample is, by definition, an incomplete dataset for the question we are trying to answer. For this reason, there's almost always some difference between the metrics of a population and the metrics of a sample. we see this difference as an error, and because it is the result of sampling, we call it Sampling Error.

We call a metric specific to a population a parameter and we call a metric specific to a sample a statistic.
 
In other words, Sample Error = parameter - statistic

Now let's play with real world dataset to understand Statistics


The dataset is about basketball players in the WNBA (Women's National Basketball Association) 

Kaggle Link https://www.kaggle.com/jinxbe/wnba-player-stats-2017

Exploring the dataset:

This dataset contains all players who have played at least one game, so it is a population relative to our question we are trying to answer.

script.py
wnba.csv
Output
Name Team Pos Height Weight BMI Birth_Place \ 0 Aerial Powers DAL F 183 71.0 21.200991 US 1 Alana Beard LA G/F 185 73.0 21.329438 US 2 Alex Bentley CON G 170 69.0 23.875433 US 3 Alex Montgomery SAN G/F 185 84.0 24.543462 US 4 Alexis Jones MIN G 175 78.0 25.469388 US Birthdate Age College Experience Games Played MIN FGM \ 0 January 17, 1994 23 Michigan State 2 8 173 30 1 May 14, 1982 35 Duke 12 30 947 90 2 October 27, 1990 26 Penn State 4 26 617 82 3 December 11, 1988 28 Georgia Tech 6 31 721 75 4 August 5, 1994 23 Baylor R 24 137 16 FGA FG% 15:00 3PA 3P% FTM FTA FT% OREB DREB REB AST STL \ 0 85 35.3 12 32 37.5 21 26 80.8 6 22 28 12 3 1 177 50.8 5 18 27.8 32 41 78.0 19 82 101 72 63 2 218 37.6 19 64 29.7 35 42 83.3 4 36 40 78 22 3 195 38.5 21 68 30.9 17 21 81.0 35 134 169 65 20 4 50 32.0 7 20 35.0 11 12 91.7 3 9 12 12 7 BLK TO PTS DD2 TD3 0 6 12 93 0 0 1 13 40 217 0 0 2 3 24 218 0 0 3 10 38 188 2 0 4 0 14 50 0 0 Name Team Pos Height Weight BMI Birth_Place \ 138 Tiffany Hayes ATL G 178 70.0 22.093170 US 139 Tiffany Jackson LA F 191 84.0 23.025685 US 140 Tiffany Mitchell IND G 175 69.0 22.530612 US 141 Tina Charles NY F/C 193 84.0 22.550941 US 142 Yvonne Turner PHO G 175 59.0 19.265306 US Birthdate Age College Experience Games Played MIN \ 138 September 20, 1989 27 Connecticut 6 29 861 139 April 26, 1985 32 Texas 9 22 127 140 September 23, 1984 32 South Carolina 2 27 671 141 May 12, 1988 29 Connecticut 8 29 952 142 October 13, 1987 29 Nebraska 2 30 356 FGM FGA FG% 15:00 3PA 3P% FTM FTA FT% OREB DREB REB AST \ 138 144 331 43.5 43 112 38.4 136 161 84.5 28 89 117 69 139 12 25 48.0 0 1 0.0 4 6 66.7 5 18 23 3 140 83 238 34.9 17 69 24.6 94 102 92.2 16 70 86 39 141 227 509 44.6 18 56 32.1 110 135 81.5 56 212 268 75 142 59 140 42.1 11 47 23.4 22 28 78.6 11 13 24 30 STL BLK TO PTS DD2 TD3 138 37 8 50 467 0 0 139 1 3 8 28 0 0 140 31 5 40 277 0 0 141 21 22 71 582 11 0 142 18 1 32 151 0 0 (143, 32)
11 47 23.4 22 28 78.6 11 13 24 30 STL BLK TO PTS DD2 TD3 138 37 8 50 467 0 0 139 1 3 8 28 0 0 140 31 5 40 277 0 0 141 21 22 71 582 11 0 142 18 1 32 151 0 0 (143, 32)



script.py
wnba.csv
Output
Index(['Name', 'Team', 'Pos', 'Height', 'Weight', 'BMI', 'Birth_Place', 'Birthdate', 'Age', 'College', 'Experience', 'Games Played', 'MIN', 'FGM', 'FGA', 'FG%', '15:00', '3PA', '3P%', 'FTM', 'FTA', 'FT%', 'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK', 'TO', 'PTS', 'DD2', 'TD3'], dtype='object')


script.py
wnba.csv
Output
78 30 116 29 31 26 5 14 125 28 .. 42 28 33 30 99 24 69 29 97 21 Name: Games Played, Length: 30, dtype: int64



script.py
wnba.csv
Output
32 30 2


Simple Random Sampling 

In Statistical terms, we want our sample to be representative of their corresponding populations. If a sample is representative, then the sampling error is low. The more representative a sample is, the smaller the sampling error. The less representative a sample is, the greater the sampling error.




Say we want to find the mean height of people in the USA... To make our samples representative, we can try to give every individual in the population an equal chance for selection in our samples. We want a very tall individual to have the same chance of being selected as a short individual. To give every individual an equal chance at selection, we need to sample randomly.

One way to perform random sampling is to generate random numbers and use them to select a few sample units from the population. In statistics, this sampling method is called simple random sampling, often abbreviated as SRS.


"Series.sample() method performs simple random sampling by generating an array of random numbers, and then using those numbers to select values from a Series at the indices corresponding to those random numbers. We can also extend this method for DataFrame objects, where we can sample random rows or columns.

When we use the random_state parameter, like Series.sample(30, random_state = 1), we make the generation of random numbers predictable. This is because Series.sample() uses a pseudorandom number generator. A pseudorandom number generator uses an initial value to generate a sequence of numbers that has properties similar to those of a sequence that is truly random. With random_state, we specify that initial value used by the pseudorandom number generator.

If we want to generate a sequence of five numbers using a pseudorandom generator, and begin from an initial value of 1, we'll get the same five numbers no matter how many times we run the code. If we ran wnba['Games Played'].sample(5, random_state = 1) we'd get the same sample every time we ran the code.

Pseudorandom number generators are useful in scientific research where reproducible work is necessary. In our case, pseudorandom number generators allow us to work with the same samples as you do in the exercises, which, in turn, allows for meaningful answer-checking."

script.py
wnba.csv
Output
201.7902097902098 [145.1, 185.4, 140.4, 293.7, 172.7, 124.9, 187.8, 157.0, 188.9, 282.0, 241.5, 178.1, 157.0, 301.4, 212.9, 115.0, 135.3, 197.2, 182.5, 236.8, 145.9, 255.9, 161.2, 184.1, 213.6, 139.7, 176.3, 148.5, 118.2, 166.8, 188.3, 140.9, 182.2, 178.9, 187.1, 174.4, 126.6, 204.5, 156.0, 152.1, 193.6, 232.4, 235.1, 181.6, 230.3, 182.2, 229.4, 225.6, 203.6, 177.1, 157.4, 140.8, 147.4, 176.1, 224.5, 220.5, 132.4, 175.8, 244.3, 160.1, 244.7, 294.3, 127.1, 209.3, 173.0, 159.9, 249.9, 145.3, 144.9, 186.1, 172.9, 248.9, 137.4, 137.3, 176.0, 286.7, 258.5, 138.5, 188.5, 135.5, 178.1, 185.3, 252.0, 242.5, 253.7, 183.6, 172.7, 170.3, 148.4, 174.0, 143.9, 275.3, 152.6, 215.6, 179.8, 200.2, 177.3, 213.2, 187.2, 153.1] <matplotlib.lines.Line2D at 0x7ff3ebaebac8>
Plots





The importance of Sample Size

sample 10:

sample 100:




Observations and conclusions:


1. Sample size not being representative of the population is a problem. This can be solved by increasing the sample size. As we increase the sample size, the sample means vary less around the population mean and the chances of getting an unrepresentative sample decrease.

2. Simple random sampling is not a reliable sampling method when the sample size is small. Try to get a sample as large as possible. A large dataset decreases the variability of the sampling process, which, in turn, decreases the chance that we will get an unrepresentative sample.

Stratified Sampling

Because simple random sample is entirely random, it can exclude certain population individuals who are relevant to some of our questions.

In other words, it is not guaranteed that we will have a representative sample.





Now let us perform stratified sampling with real world wnba dataset:

script.py
wnba.csv
Output
0 11.625000 1 7.233333 2 8.384615 3 6.064516 4 2.083333 ... 138 16.103448 139 1.272727 140 10.259259 141 20.068966 142 5.033333 Name: points_per_game, Length: 143, dtype: float64 ['F' 'G/F' 'G' 'C' 'F/C']


script.py
wnba.csv
Output
{'C': 9.833761394334251, 'F': 8.702175158545568, 'G/F': 6.817264935760487, 'G': 7.092926195632343, 'F/C': 9.059118773946361} C

Proportional Stratified Sampling

Approximately 72.7 percent of the players had more than 23 games for the 2016-2017 season, which means that this category of players who played many games probably influenced the mean.

So, when we randomly sample, we might end up with a sample where only 2% of the players played more than 23 games. This is underestimation.

Or, we might end up with sample where 98% of the players played more than 23 games. This is overestimation. 

This scenario of underestimation and overestimation is a problem and it is common with small samples.

One solution to this problem is to use stratified sampling while being mindful of the proportions in the population.




script.py
wnba.csv
Output
2 32 Games played value counts (22.0, 32.0] 72.727273 (12.0, 22.0] 18.181818 (1.969, 12.0] 9.090909 Name: Games Played, dtype: float64


script.py
wnba.csv
Output
201.7902097902098 [185.9, 163.6, 176.0, 305.1, 241.1, 200.1, 272.5, 170.5, 190.5, 138.4, 165.7, 214.9, 130.0, 173.6, 195.0, 148.5, 192.6, 176.5, 267.2, 208.9, 153.6, 176.3, 206.3, 118.4, 268.3, 197.1, 154.7, 294.4, 160.9, 160.8, 203.4, 188.8, 274.9, 201.7, 275.3, 235.1, 141.2, 145.5, 222.7, 187.0, 231.3, 202.0, 230.2, 289.7, 249.1, 120.4, 222.7, 225.8, 217.9, 232.7, 176.0, 197.7, 177.6, 208.8, 144.8, 279.5, 330.3, 169.3, 123.2, 172.7, 169.7, 259.9, 191.9, 239.1, 177.3, 264.2, 151.9, 176.4, 180.5, 189.0, 227.6, 225.5, 161.5, 148.8, 208.7, 173.1, 200.1, 219.7, 260.3, 169.2, 159.0, 216.7, 204.3, 245.0, 234.4, 216.1, 196.9, 201.0, 191.8, 186.7, 202.6, 155.7, 182.5, 162.6, 192.5, 203.7, 230.2, 207.1, 157.4, 196.2]





You might not have been impressed by the results we got by sampling proportionally. 

We see there is no big difference between simple random sampling and sampling proportionally. The sample mean is still unrepresentative, being very far from the population mean. This poor performance  is a result of choosing bad strata. Stratifying the data by the number of games played is a bad idea after all.

It makes more sense to stratify the data by the number of minutes played than by the number of games played. The minutes played are much better indicator of how much a player scored in a season.




So how to choosing the right strata?

1. Minimize the variability within each stratum
2. Maximize the variability between strata
3. The stratifying criterion should correlate strongly with the property you're trying to measure.

Minimize the variability within each stratum

Say for example, Brindha has scored 10 points and Mani has scored 1000 points. Avoid having Brindha and Mani in the same stratum since the variability of points is high. 

If the variability is high, it is a sign that you either need more granular stratification (you need more strata) or you need to change the criterion of stratification (an example of criterion is minutes played)

Maximize the variability between strata

Good strata are different from one another. 
If you have strata that are similar to one another, with respect to what you want to measure, then change the stratification criterion or go for more granular stratification.

For example, when the stratification criterion was Games Played, it resulted in strata that were similar to each other regarding the distribution of total points.

So we changed the criterion to minutes played and we managed to increase the variability.

The stratifying criterion should correlate strongly with the property you're trying to measure.

Minutes played column (criterion) and total number of points (property) have strong correlation.




script.py
wnba.csv
Output
(213.2, 414.4] 22.377622 (615.6, 816.8] 20.279720 (10.993, 213.2] 20.279720 (816.8, 1018.0] 19.580420 (414.4, 615.6] 17.482517 Name: MIN, dtype: float64 201.7902097902098 [217.43333333333334, 200.66666666666666, 200.3, 208.51666666666668, 195.7, 209.43333333333334, 202.6, 215.11666666666667, 201.01666666666668, 214.38333333333333, 207.75, 215.0, 201.05, 200.06666666666666, 213.4, 212.23333333333332, 195.16666666666666, 212.63333333333333, 190.0, 197.93333333333334, 209.41666666666666, 206.3, 202.73333333333332, 197.8, 205.01666666666668, 201.68333333333334, 205.26666666666668, 199.45, 202.08333333333334, 198.18333333333334, 194.01666666666668, 202.96666666666667, 196.61666666666667, 210.06666666666666, 201.86666666666667, 210.78333333333333, 195.38333333333333, 209.33333333333334, 204.36666666666667, 203.06666666666666, 210.38333333333333, 198.53333333333333, 203.21666666666667, 209.91666666666666, 212.4, 208.56666666666666, 199.51666666666668, 205.65, 200.68333333333334, 208.6, 197.06666666666666, 196.31666666666666, 210.71666666666667, 204.2, 195.55, 204.25, 198.41666666666666, 199.43333333333334, 208.46666666666667, 204.0, 203.95, 210.13333333333333, 201.81666666666666, 203.93333333333334, 198.61666666666667, 198.0, 208.5, 199.78333333333333, 204.05, 199.68333333333334, 210.56666666666666, 204.01666666666668, 210.98333333333332, 203.18333333333334, 195.78333333333333, 214.38333333333333, 200.56666666666666, 204.08333333333334, 189.55, 201.66666666666666, 202.08333333333334, 208.95, 201.51666666666668, 211.1, 204.05, 205.35, 213.68333333333334, 210.7, 204.78333333333333, 204.1, 202.08333333333334, 204.16666666666666, 216.3, 217.41666666666666, 208.48333333333332, 198.26666666666668, 195.73333333333332, 202.11666666666667, 196.61666666666667, 203.35] <matplotlib.text.Text at 0x7f3d739ec278>


Let us continue in another blog...





























No comments: