Thursday, July 15, 2021

Sampling in Data Science - Use Cases

Sampling use cases

Case 1:
Let's say you work for an e-commerce company that has a table in a database with more than 10 million rows of online transactions. The marketing team asks you to analyze the data and find categories of customers with a low buying rate, so that they can target their marketing campaigns at the right people. Instead of working with more than 10 million rows at each step of your analysis, you can save a lot of code running time by sampling several hundred rows and performing your analysis on the sample. You can do a simple random sampling, but if you're interested in some categories beforehand, it might be a good idea to use stratified sampling.

Case 2:
Let us say  you need to collect data from an API that either has a usage limit or isn't free. In this case, you are more or less forced to sample. Knowing how and what to sample can be highly useful.


Case 3:
The wnba dataset we worked on was scraped from the WNBA's website. The website centralizes data on basketball games and players in the WNBA. Let's suppose for a moment that such a site didn't exist, and the data were instead scattered across each individual team's website. There are twelve unique teams in our data set, which means we'd have to scrape twelve different websites, each of which requires its own scraping script. 

This scenario is quite common in the data science workflow: you want to answer some questions about a population, but the data is scattered in such a way that data collection is either time-consuming or close to impossible. For instance, let's say you want to analyze how people review and rate movies as a function of movie budget. There are many websites that can help with data collection, but how can you go about it so that you can spend a day or two on getting the data you need, rather than a month or two?

One way is to list all the data sources you can find, and then randomly pick only a few of them from which to collect. Then you can individually sample each of the sources you've randomly picked. We call this sampling method cluster sampling, and we call each of the individual data sources a cluster.

Case 4:
When the data is scattered across different locations (different websites, different databases, different companies, etc.), cluster sampling would be a great choice in such a scenario.

Case 5:
You run an international company with over 50,000 employees. You've recently made a company-wide change that made your employees' jobs more difficult. Now you want to determine if this change has affected the employees negatively in any significant manner. If it has, then the change may backfire in the future, so it would be a good decision to revert the change while it's still possible.

In this situation, you reach out to your data analyst and ask for her opinion. She says that she can do a survey to collect data and answer your question. Surveying over 50,000 employees would be time-consuming and expensive, so she plans to survey 100 people to get an answer to your question.




No comments: