What is Statistics?
It is the practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.
Statistics is the grammar of data science.
Statistics is an essential concept before you can produce high quality models. Machine Learning starts out as Statistics and then advances. The concept of Linear Regression is an age old statistical analysis concept.
The knowledge of statistical techniques and metrics like mean, median, mode, variance, standard deviation, z-scores, confidence intervals, probability estimation and hypothesis testing are very essential. It helps find structure in your data and provide deeper insights. It is the most important discipline to analyze and quantify uncertainty.
It all begins by getting data for analysis. Once we have structured and measured our data, we can move on to visualizing large amounts of data and finding patters.
Let us begin by exploring the kinds of problems we can solve with Statistics.
Here is the workflow of Statistical technique we use...
Using statistical techniques, we can organize, summarize, and visualize large amounts of data to find patterns that otherwise would remain hidden.
Populations and Samples
As a data analyst, we will often need to use a small dataset to answer questions about a much larger dataset.
In Statistics, we call the set of ALL individuals relevant to a particular Statistical question, a Population.
We call a smaller group selected from a population a Sample. When we select a smaller group from a population we are sampling.
Population don't necessarily consist of people. It can even be Monkeys ;) Or companies, stars, planets, vegetables, factory produced equipments etc....
The individual parts of a population go by many names like individuals, units, events, and observations.
Population and Sample dataset example:
Say we have a dataset for individuals working in IT roles in a big company like Nike. If we wanted to answer a question about all of the individuals in our company (working in IT and non-IT roles), then the dataset we have is a sample dataset.
The dataset for all individuals in the company is population.
Sampling Error
A sample is, by definition, an incomplete dataset for the question we are trying to answer. For this reason, there's almost always some difference between the metrics of a population and the metrics of a sample. we see this difference as an error, and because it is the result of sampling, we call it Sampling Error.
We call a metric specific to a population a parameter and we call a metric specific to a sample a statistic.
In other words, Sample Error = parameter - statistic
Now let's play with real world dataset to understand Statistics
The dataset is about basketball players in the WNBA (Women's National Basketball Association)
Kaggle Link https://www.kaggle.com/jinxbe/wnba-player-stats-2017
Exploring the dataset:
This dataset contains all players who have played at least one game, so it is a population relative to our question we are trying to answer.
Simple Random Sampling
In Statistical terms, we want our sample to be representative of their corresponding populations. If a sample is representative, then the sampling error is low. The more representative a sample is, the smaller the sampling error. The less representative a sample is, the greater the sampling error.

Say we want to find the mean height of people in the USA... To make our samples representative, we can try to give every individual in the population an equal chance for selection in our samples. We want a very tall individual to have the same chance of being selected as a short individual. To give every individual an equal chance at selection, we need to sample randomly.
One way to perform random sampling is to generate random numbers and use them to select a few sample units from the population. In statistics, this sampling method is called simple random sampling, often abbreviated as SRS.
"
Series.sample()
method performs simple random sampling by generating an array of random numbers, and then using those numbers to select values from a Series
at the indices corresponding to those random numbers. We can also extend this method for DataFrame
objects, where we can sample random rows or columns.
When we use the random_state
parameter, like Series.sample(30, random_state = 1)
, we make the generation of random numbers predictable. This is because Series.sample()
uses a pseudorandom number generator. A pseudorandom number generator uses an initial value to generate a sequence of numbers that has properties similar to those of a sequence that is truly random. With random_state
, we specify that initial value used by the pseudorandom number generator.
If we want to generate a sequence of five numbers using a pseudorandom generator, and begin from an initial value of 1, we'll get the same five numbers no matter how many times we run the code. If we ran wnba['Games Played'].sample(5, random_state = 1)
we'd get the same sample every time we ran the code.
Pseudorandom number generators are useful in scientific research where reproducible work is necessary. In our case, pseudorandom number generators allow us to work with the same samples as you do in the exercises, which, in turn, allows for meaningful answer-checking."
Output
201.7902097902098
[145.1, 185.4, 140.4, 293.7, 172.7, 124.9, 187.8, 157.0, 188.9, 282.0, 241.5, 178.1, 157.0, 301.4, 212.9, 115.0, 135.3, 197.2, 182.5, 236.8, 145.9, 255.9, 161.2, 184.1, 213.6, 139.7, 176.3, 148.5, 118.2, 166.8, 188.3, 140.9, 182.2, 178.9, 187.1, 174.4, 126.6, 204.5, 156.0, 152.1, 193.6, 232.4, 235.1, 181.6, 230.3, 182.2, 229.4, 225.6, 203.6, 177.1, 157.4, 140.8, 147.4, 176.1, 224.5, 220.5, 132.4, 175.8, 244.3, 160.1, 244.7, 294.3, 127.1, 209.3, 173.0, 159.9, 249.9, 145.3, 144.9, 186.1, 172.9, 248.9, 137.4, 137.3, 176.0, 286.7, 258.5, 138.5, 188.5, 135.5, 178.1, 185.3, 252.0, 242.5, 253.7, 183.6, 172.7, 170.3, 148.4, 174.0, 143.9, 275.3, 152.6, 215.6, 179.8, 200.2, 177.3, 213.2, 187.2, 153.1]
<matplotlib.lines.Line2D at 0x7ff3ebaebac8>


The importance of Sample Size
sample 100:
Observations and conclusions:
1. Sample size not being representative of the population is a problem. This can be solved by increasing the sample size. As we increase the sample size, the sample means vary less around the population mean and the chances of getting an unrepresentative sample decrease.
2. Simple random sampling is not a reliable sampling method when the sample size is small. Try to get a sample as large as possible. A large dataset decreases the variability of the sampling process, which, in turn, decreases the chance that we will get an unrepresentative sample.
Stratified Sampling
Because simple random sample is entirely random, it can exclude certain population individuals who are relevant to some of our questions.
In other words, it is not guaranteed that we will have a representative sample.
Now let us perform stratified sampling with real world wnba dataset:
Proportional Stratified Sampling
Approximately 72.7 percent of the players had more than 23 games for the 2016-2017 season, which means that this category of players who played many games probably influenced the mean.
So, when we randomly sample, we might end up with a sample where only 2% of the players played more than 23 games. This is underestimation.
Or, we might end up with sample where 98% of the players played more than 23 games. This is overestimation.
This scenario of underestimation and overestimation is a problem and it is common with small samples.
One solution to this problem is to use stratified sampling while being mindful of the proportions in the population.
Output
201.7902097902098
[185.9, 163.6, 176.0, 305.1, 241.1, 200.1, 272.5, 170.5, 190.5, 138.4, 165.7, 214.9, 130.0, 173.6, 195.0, 148.5, 192.6, 176.5, 267.2, 208.9, 153.6, 176.3, 206.3, 118.4, 268.3, 197.1, 154.7, 294.4, 160.9, 160.8, 203.4, 188.8, 274.9, 201.7, 275.3, 235.1, 141.2, 145.5, 222.7, 187.0, 231.3, 202.0, 230.2, 289.7, 249.1, 120.4, 222.7, 225.8, 217.9, 232.7, 176.0, 197.7, 177.6, 208.8, 144.8, 279.5, 330.3, 169.3, 123.2, 172.7, 169.7, 259.9, 191.9, 239.1, 177.3, 264.2, 151.9, 176.4, 180.5, 189.0, 227.6, 225.5, 161.5, 148.8, 208.7, 173.1, 200.1, 219.7, 260.3, 169.2, 159.0, 216.7, 204.3, 245.0, 234.4, 216.1, 196.9, 201.0, 191.8, 186.7, 202.6, 155.7, 182.5, 162.6, 192.5, 203.7, 230.2, 207.1, 157.4, 196.2]
You might not have been impressed by the results we got by sampling proportionally.
We see there is no big difference between simple random sampling and sampling proportionally. The sample mean is still unrepresentative, being very far from the population mean. This poor performance is a result of choosing bad strata. Stratifying the data by the number of games played is a bad idea after all.
It makes more sense to stratify the data by the number of minutes played than by the number of games played. The minutes played are much better indicator of how much a player scored in a season.
So how to choosing the right strata?
1. Minimize the variability within each stratum
2. Maximize the variability between strata
3. The stratifying criterion should correlate strongly with the property you're trying to measure.
Minimize the variability within each stratum
Say for example, Brindha has scored 10 points and Mani has scored 1000 points. Avoid having Brindha and Mani in the same stratum since the variability of points is high.
If the variability is high, it is a sign that you either need more granular stratification (you need more strata) or you need to change the criterion of stratification (an example of criterion is minutes played)
Maximize the variability between strata
Good strata are different from one another.
If you have strata that are similar to one another, with respect to what you want to measure, then change the stratification criterion or go for more granular stratification.
For example, when the stratification criterion was Games Played, it resulted in strata that were similar to each other regarding the distribution of total points.
So we changed the criterion to minutes played and we managed to increase the variability.
The stratifying criterion should correlate strongly with the property you're trying to measure.
Minutes played column (criterion) and total number of points (property) have strong correlation.
Output
(213.2, 414.4] 22.377622
(615.6, 816.8] 20.279720
(10.993, 213.2] 20.279720
(816.8, 1018.0] 19.580420
(414.4, 615.6] 17.482517
Name: MIN, dtype: float64
201.7902097902098
[217.43333333333334, 200.66666666666666, 200.3, 208.51666666666668, 195.7, 209.43333333333334, 202.6, 215.11666666666667, 201.01666666666668, 214.38333333333333, 207.75, 215.0, 201.05, 200.06666666666666, 213.4, 212.23333333333332, 195.16666666666666, 212.63333333333333, 190.0, 197.93333333333334, 209.41666666666666, 206.3, 202.73333333333332, 197.8, 205.01666666666668, 201.68333333333334, 205.26666666666668, 199.45, 202.08333333333334, 198.18333333333334, 194.01666666666668, 202.96666666666667, 196.61666666666667, 210.06666666666666, 201.86666666666667, 210.78333333333333, 195.38333333333333, 209.33333333333334, 204.36666666666667, 203.06666666666666, 210.38333333333333, 198.53333333333333, 203.21666666666667, 209.91666666666666, 212.4, 208.56666666666666, 199.51666666666668, 205.65, 200.68333333333334, 208.6, 197.06666666666666, 196.31666666666666, 210.71666666666667, 204.2, 195.55, 204.25, 198.41666666666666, 199.43333333333334, 208.46666666666667, 204.0, 203.95, 210.13333333333333, 201.81666666666666, 203.93333333333334, 198.61666666666667, 198.0, 208.5, 199.78333333333333, 204.05, 199.68333333333334, 210.56666666666666, 204.01666666666668, 210.98333333333332, 203.18333333333334, 195.78333333333333, 214.38333333333333, 200.56666666666666, 204.08333333333334, 189.55, 201.66666666666666, 202.08333333333334, 208.95, 201.51666666666668, 211.1, 204.05, 205.35, 213.68333333333334, 210.7, 204.78333333333333, 204.1, 202.08333333333334, 204.16666666666666, 216.3, 217.41666666666666, 208.48333333333332, 198.26666666666668, 195.73333333333332, 202.11666666666667, 196.61666666666667, 203.35]
<matplotlib.text.Text at 0x7f3d739ec278>
Let us continue in another blog...
No comments:
Post a Comment