Monday, July 19, 2021

Correlated subqueries

Correlated subqueries 

  • Uses values from the outer query to generate a result
  • Re-run for every row generated in the final data set
A simple subquery in SELECT FROM and WHERE


SELECT 
    -- Select the stage and average goals from s
    s.stage,
    ROUND(s.avg_goals,2) AS avg_goal,
    -- Select the overall average for 2012/2013
    (SELECT AVG(home_goal + away_goal) FROM match WHERE season = '2012/2013') AS overall_avg
FROM 
    -- Select the stage and average goals in 2012/2013 from match
    (SELECT
         stage,
         AVG(home_goal + away_goal) AS avg_goals
     FROM match
     WHERE season = '2012/2013'
     GROUP BY stage) AS s
WHERE 
    -- Filter the main query using the subquery
    s.avg_goals > (SELECT AVG(home_goal + away_goal) 
                    FROM match WHERE season = '2012/2013');









Both the above queries gives identical OUTPUT:







Why Correlated subqueries in the first place?


Correlated subqueries take longer to produce results as they are evaluated in loops and this significantly slows down query runtime, but they often prevent you from having to create multiple subqueries. They are useful for matching data across multiple column (for selecting data broken into multiple categories).


These queries take a while to load.


Basic Correlated Subqueries 

  •  Correlated Subqueries are subqueries that reference one or more columns in the main query.
  • They depend on the information in the main query to run and thus cannot be execute on their own.
  • They are evaluated in SQL once per row of data retrieved -- a process that takes a lot more computing power and time than a simple subquery.
Exercise:

Examine matches with scores that are extreme outliers for each country -- above 3 times the average score!

SELECT 
    -- Select country ID, date, home, and away goals from match
    main.country_id,
    date,
    main.home_goal, 
    main.away_goal
FROM match AS main
WHERE 
    -- Filter the main query by the subquery
    (home_goal + away_goal) > 
        (SELECT AVG((sub.home_goal + sub.away_goal) * 3)
         FROM match AS sub
         -- Join the main query to the subquery in WHERE
         WHERE main.country_id = sub.country_id);




Exercise:

What was the highest scoring match for each country, in each season?


SELECT 
    -- Select country ID, date, home, and away goals from match
    main.country_id,
    date,
    main.home_goal,
    main.away_goal
FROM match AS main
WHERE 
    -- Filter for matches with the highest number of goals scored
    (home_goal + away_goal) IN 
        (SELECT MAX(sub.home_goal + sub.away_goal)
         FROM match AS sub
         WHERE main.country_id = sub.country_id
               AND main.season = sub.season);



Thank you!



Friday, July 16, 2021

Subqueries everywhere! And best practices!

 

1. Format your queries

The best practice you can start early on in your SQL journey is properly formatting your queries. It's important to properly line up your SELECT, FROM, GROUP BY, and WHERE statements, and all of the information contained in them. This way, you and others you work with can return to a saved query and easily tell if these statements are part of a main query, or a subquery.

2. Annotate your queries

It's also considered best practice to annotate your queries with comments in order to tell the user what it does -- using either a multiple line comment, inside a forward slash, star, and ending with a star, and a forward slash.

3. Annotate your queries

You can also use in-line comments using two dashes. Every piece of information after an in-line comment is treated as text, even if it's a recognized SQL command.

4. Indent your queries

Additionally, make sure that you properly indent all information contained within a subquery. That way, you can easily return to the query and understand what information is being processed first, where you need to apply changes, such as to a range of dates, and what you can expect from your results if you make those changes.

5. Indent your queries

Make sure that you clearly indent all information that's part of a single column, such as a long CASE statement, or a complicated subquery in SELECT. In order to best keep track of all the conditions necessary to set up each WHEN clause, each THEN clause, and how they create the column outcome, it's important to clearly indent each piece of information in the statement. Overall, I highly recommend you read Holywell's SQL Style Guide to get a sense of all the formatting conventions when working with SQL queries.

6. Is that subquery necessary?

When deciding whether or not you need a subquery, it's important to know that each subquery you add requires additional computing power to generate your results. Depending on the size of your database and the number of records you extract in your query, you may significantly increase the amount of time it takes to run your query. So it's always worth asking whether or not a specific subquery is necessary to get the results you need.

7. Properly filter each subquery!

Finally, when constructing a main query with multiple subquery, make sure that your filters are properly placed in every subquery, and the main query, in order to generate accurate results. The query here, for example, filters for the 2013/2014 season in 3 places -- once in the SELECT subquery, once in the WHERE subquery, and once in the main query. This ensures that all data returned is only about matches from the 2013/2014 season.





Subqueries inside FROM clause

 

Subqueries inside FROM clause (returns a temporary table)

Usage:
  • Restructure and transform data into different shape
  • A subquery in FROM is an effective way of answering detailed questions that requires filtering or transforming data before including it in your final results.
  • Pre-filtering data
  • Calculating aggregates of aggregates
    • Which 3 teams has the highest average of home goals scored? (AVG, MAX)
Things to remember:

  • You can create multiple subqueries in one FROM statement
    • Alias them!
    • Join them!
  • You can join a subquery to a table in FROM
    • Include a joining columns in both tables


Subquery as a temporary table in FROM clause. We can have multiple table using a comma in FROM clause.

SELECT DISTINCT monarchs.continent, subquery.max_perc
FROM monarchs,
            (SELECT continent, MAX(women_parli_perc) AS max_perc
                FROM states
                GROUP BY continent) AS subquery
WHERE monarch.continent = subquery.continent
ORDER BY continent;

Question: Determine the number of languages spoken for each country, identified by the country's local name! 

-- Select fields
SELECT local_name, subquery.lang_num
  -- From countries
  FROM countries,
    -- Subquery (alias as subquery)
    (SELECT code, COUNT(*) AS lang_num
     FROM languages
     GROUP BY code) AS subquery
  -- Where codes match
  WHERE countries.code = subquery.code
-- Order by descending number of languages
ORDER BY lang_num DESC;
   

Question:
Generate a subquery using the match table, and then join that subquery to the country table to calculate information about matches with 10 or more goals in total!

SELECT
    -- Select country name and the count match IDs
    c.name AS country_name,
    COUNT(sub.id) AS matches
FROM country AS c
-- Inner join the subquery onto country
-- Select the country id and match id columns
INNER JOIN (SELECT country_id, id 
            FROM match
            -- Filter the subquery by matches with 10+ goals
            WHERE (home_goal + away_goal) >= 10) AS sub
ON c.id = sub.country_id
GROUP BY country_name;


In the previous exercise, you found that England, Netherlands, Germany and Spain were the only countries that had matches in the database where 10 or more goals were scored overall. Let's find out some more details about those matches -- when they were played, during which seasons, and how many of the goals were home vs. away goals.

You'll notice that in this exercise, the table alias is excluded for every column selected in the main query. This is because the main query is extracting data from the subquery, which is treated as a single table.

SELECT
    -- Select country, date, home, and away goals from the subquery
    country,
    date,
    home_goal,
    away_goal
FROM 
    -- Select country name, date, home_goal, away_goal, and total goals in the subquery
    (SELECT c.name AS country, 
            m.date, 
            m.home_goal, 
            m.away_goal,
           (m.home_goal + m.away_goal) AS total_goals
    FROM match AS m
    LEFT JOIN country AS c
    ON m.country_id = c.id) AS subq
-- Filter by total goals scored in the main query
WHERE total_goals >= 10;








Subqueries inside SELECT clause

 

Subqueries inside SELECT clause

Second most common: Inside SELECT

Usage:
  • Returns s single value
    • Include aggregate values to compare to individuals values
  • Used in mathematical calculations
    • Deviation from average
Keep things in mind:
  • Need to return a single value
  • Properly filter both the main and the subquery

Some queries can be written using either a join or a subquery

SELECT DISTINCT continent,
                (SELECT COUNT(*)
                  FROM states
                   WHERE prime_ministers.continent = states.continent) AS countries_num
FROM prime_ministers;

Examples:

Joins: To get summarized results from multiple table we used GROUP BY and INNER JOIN.

Question: Select the top nine countries in terms of number of cities (popular cities) appearing in the cities table.

SELECT countries.name AS country, COUNT(*) AS cities_num
  FROM cities
    INNER JOIN countries
    ON countries.code = cities.country_code
GROUP BY country
ORDER BY cities_num DESC, country
LIMIT 9;

/* 
SELECT ___ AS ___,
  (SELECT ___
   FROM ___
   WHERE countries.code = cities.country_code) AS cities_num
FROM ___
ORDER BY ___ ___, ___
LIMIT 9;
*/

Subquery: Convert the GROUP BY code to use a subquery inside of SELECT


/*
SELECT countries.name AS country, COUNT(*) AS cities_num
  FROM cities
    INNER JOIN countries
    ON countries.code = cities.country_code
GROUP BY country
ORDER BY cities_num DESC, country
LIMIT 9;
*/

SELECT countries.name AS country,
  -- Subquery
  (SELECT COUNT(*)
   FROM cities
   WHERE countries.code = cities.country_code) AS cities_num
FROM countries
ORDER BY cities_num DESC, country
LIMIT 9;


Question:
Construct a query that calculates the average number of goals per match in each country's league.

SELECT 
    l.name AS league,
    -- Select and round the league's total goals
    ROUND(AVG(m.home_goal + m.away_goal), 2) AS avg_goals,
    -- Select & round the average total goals for the season
    (SELECT ROUND(AVG(home_goal + away_goal), 2) 
     FROM match
     WHERE season = '2013/2014') AS overall_avg
FROM league AS l
LEFT JOIN match AS m
ON l.country_id = m.country_id
-- Filter for the 2013/2014 season
WHERE season = '2013/2014'
GROUP BY league;

In the previous exercise, you created a column to compare each league's average total goals to the overall average goals in the 2013/2014 season. In this exercise, you will add a column that directly compares these values by subtracting the overall average from the subquery.



SELECT
    -- Select the league name and average goals scored
    l.name AS league,
    ROUND(AVG(m.home_goal + m.away_goal),2) AS avg_goals,
    -- Subtract the overall average from the league average
    ROUND(AVG(m.home_goal + m.away_goal) -
        (SELECT AVG(home_goal + away_goal)
         FROM match 
         WHERE season = '2013/2014'),2) AS diff
FROM league AS l
LEFT JOIN match AS m
ON l.country_id = m.country_id
-- Only include 2013/2014 results
WHERE season = '2013/2014'
GROUP BY l.name;


Q:
Extract the average number of home and away team goals in two SELECT subqueries. Calculate the average home and away goals for the specific stage in the main query.Filter both subqueries and the main query so that only data from the 2012/2013 season is included.Group the query by the m.stage column.

ECT 
    -- Select the stage and average goals for each stage
    m.stage,
    ROUND(AVG(m.home_goal + m.away_goal),2) AS avg_goals,
    -- Select the average overall goals for the 2012/2013 season
    ROUND((SELECT AVG(home_goal + away_goal) 
           FROM match 
           WHERE season = '2012/2013'),2) AS overall
FROM match AS m
-- Filter for the 2012/2013 season
WHERE season = '2012/2013'
-- Group by stage
GROUP BY m.stage
ORDER BY m.stage;













Subqueries (Joining Data in SQL) - Subqueries inside WHERE clause

What is a subquery?

  • Query nested inside another query 
  • Useful for intermediary transformations
  • Can return a variety of information
    • Scalar quantities (3.14159, -2, 0.001)
    • A list (id = (12, 25, 392, 401, 939))
    • A table
  • Can be in any part of a query
    • WHERE, SELECT, FROM, ON statement in JOINS, GROUP BY

Why subquery?

  • Comparing groups to summarized values
    • How did Liverpool compare to the English Premier League's average performance for that year?
  • Reshaping data
    • What is the highest monthly average of goals scored in the Bundesliga?
  • Combining data that cannot be joined
    • How do you get both the home and away team nams into a table of match results?

Subqueries inside WHERE clause (returns only one column)

Most common: Inside WHERE

These are useful for filtering results based on information you'd have to calculate separately beforehand. 

Subqueries are also useful for generating a filtering list.

You can filter data based on single, scalar values using a subquery in ways you cannot by using WHERE statements or joins.



SELECT name, fert_rate
FROM states
WHERE continent = 'Asia'
        AND fert_rate <
                           (SELECT AVG(fert_rate
                                FROM states);

Question: Which countries had high average life expectancies (at the country level) in 2015.

-- Select fields
SELECT *
  -- From populations
  FROM populations
-- Where life_expectancy is greater than
WHERE life_expectancy >
  -- 1.15 * subquery
  1.15 * (SELECT AVG(life_expectancy)
   FROM populations
   WHERE year = 2015) AND
  year = 2015;

Question: Get the urban area population for only capital cities.

-- Select fields
SELECT c1.name, c1.country_code, urbanarea_pop
  -- From cities
  FROM cities c1
-- Where city name in the field of capital cities
WHERE c1.name IN
  -- Subquery
  (SELECT capital
   FROM countries)
ORDER BY urbanarea_pop DESC;
                

Question: 

Generate a list of matches where the total goals scored (for both teams in total) is more than 3 times the average for games in the matches_2013_2014 table, which includes all games played in the 2013/2014 season.

SELECT 
    -- Select the date, home goals, and away goals scored
    date,
    home_goal,
    away_goal
FROM  matches_2013_2014
-- Filter for matches where total goals exceeds 3x the average
WHERE (home_goal + away_goal) > 
       (SELECT 3 * AVG(home_goal + away_goal)
        FROM matches_2013_2014);


Question: 

Generate a list of teams that never played a game in their home city. Using a subquery, generate a list of unique hometeam_ID values from the unfiltered match table to exclude in the team table's team_api_ID column.

In addition to filtering using a single-value (scalar) subquery, you can create a list of values in a subquery to filter data based on a complex set of conditions. This type of subquery generates a one column reference list for the main query. As long as the values in your list match a column in your main query's table, you don't need to use a join -- even if the list is from a separate table.

SELECT 
    -- Select the team long and short names
    team_long_name,
    team_short_name
FROM team 
-- Exclude all values from the subquery
WHERE team_api_id NOT IN
     (SELECT DISTINCT hometeam_ID  FROM match);

Question:

Create a list of teams that scored 8 or more goals in a home match.

SELECT
    -- Select the team long and short names
    team_long_name,
    team_short_name
FROM team
-- Filter for teams with 8 or more home goals
WHERE team_api_id IN
      (SELECT hometeam_id
       FROM match
       WHERE home_goal >= 8);

Question 

Generate a subquery using the match table, and then join that subquery to the country table to calculate information about matches with 10 or more goals in total!

SELECT
    -- Select country name and the count match IDs
    c.name AS country_name,
    COUNT(sub.id) AS matches
FROM country AS c
-- Inner join the subquery onto country
-- Select the country id and match id columns
INNER JOIN (SELECT country_id, id 
            FROM match
            -- Filter the subquery by matches with 10+ goals
            WHERE (home_goal + away_goal) >= 10) AS sub
ON c.id = sub.country_id
GROUP BY country_name;

SELECT
    -- Select country, date, home, and away goals from the subquery
    country,
    date,
    home_goal,
    away_goal
FROM 
    -- Select country name, date, home_goal, away_goal, and total goals in the subquery
    (SELECT c.name AS country, 
            m.date, 
            m.home_goal, 
            m.away_goal,
           (m.home_goal + m.away_goal) AS total_goals
    FROM match AS m
    LEFT JOIN country AS c
    ON m.country_id = c.id) AS subq
-- Filter by total goals scored in the main query
WHERE total_goals >= 10;

Subqueries inside FROM clause (returns a temporary table)

Usage:
  • Restructure and transform data into different shape
  • A subquery in FROM is an effective way of answering detailed questions that requires filtering or transforming data before including it in your final results.
  • Pre-filtering data
  • Calculating aggregates of aggregates
    • Which 3 teams has the highest average of home goals scored? (AVG, MAX)
Things to remember:

  • You can create multiple subqueries in one FROM statement
    • Alias them!
    • Join them!
  • You can join a subquery to a table in FROM
    • Include a joining columns in both tables


Subquery as a temporary table in FROM clause. We can have multiple table using a comma in FROM clause.

SELECT DISTINCT monarchs.continent, subquery.max_perc
FROM monarchs,
            (SELECT continent, MAX(women_parli_perc) AS max_perc
                FROM states
                GROUP BY continent) AS subquery
WHERE monarch.continent = subquery.continent
ORDER BY continent;

Question: Determine the number of languages spoken for each country, identified by the country's local name! 

-- Select fields
SELECT local_name, subquery.lang_num
  -- From countries
  FROM countries,
    -- Subquery (alias as subquery)
    (SELECT code, COUNT(*) AS lang_num
     FROM languages
     GROUP BY code) AS subquery
  -- Where codes match
  WHERE countries.code = subquery.code
-- Order by descending number of languages
ORDER BY lang_num DESC;
   

Advanced subquery

Question: Identify which country had the maximum inflation rate, and how high it was, using multiple subqueries. (Note: Again, there are multiple ways to get to this solution using only joins, but the focus here is on showing you an introduction into advanced subqueries.)


-- Select fields
SELECT name, continent, inflation_rate
  -- From countries
  FROM countries
  -- Join to economies
  INNER JOIN economies
  -- Match on code
  ON countries.code = economies.code
  -- Where year is 2015
  WHERE year = 2015
    -- And inflation rate in subquery (alias as subquery)
    AND inflation_rate IN (
        SELECT MAX(inflation_rate) AS max_inf
        FROM (
             SELECT name, continent, inflation_rate
             FROM countries
             INNER JOIN economies
             ON countries.code = economies.code
             WHERE year = 2015) AS subquery
      -- Group by continent
        GROUP BY continent);

Question:

Use a subquery to get 2015 economic data for countries that do not have 

  • gov_form of 'Constitutional Monarchy' or 
  • 'Republic' in their gov_form.

Here, gov_form stands for the form of the government for each country. 

-- Select fields
SELECT code, inflation_rate, unemployment_rate
  -- From economies
  FROM economies
  -- Where year is 2015 and code is not in
  WHERE year = 2015 AND code NOT IN
    -- Subquery
    (SELECT code
     FROM countries
     WHERE (gov_form = 'Constitutional Monarchy' OR gov_form LIKE '%Republic%'))
-- Order by inflation rate
ORDER BY inflation_rate;


Challenge 1:

Get the country names and other 2015 data in the economiestable and the countries table for Central American countries with an official language.

-- Select fields
SELECT DISTINCT c.name, total_investment, imports
  -- From table (with alias)
  FROM countries AS c
    -- Join with table (with alias)
    LEFT JOIN economies AS e
      -- Match on code
      ON (c.code = e.code
      -- and code in Subquery
        AND c.code IN (
          SELECT l.code
          FROM languages AS l
          WHERE official = 'true'
        ) )
  -- Where region and year are correct
  WHERE region = 'Central America' AND year = 2015
-- Order by field
ORDER BY c.name;


As above, a subquery can be used inside of ON statement of join.

Challenge 2:

Calculate the average fertility rate for each region in 2015.

-- Select fields
SELECT region, continent, AVG(fertility_rate) AS avg_fert_rate
  -- From left table
  FROM countries AS c
    -- Join to right table
    INNER JOIN populations AS p
      -- Match on join condition
      ON c.code = p.country_code
  -- Where specific records matching some condition
  WHERE year = 2015
-- Group appropriately
GROUP BY continent, region
-- Order appropriately
ORDER BY avg_fert_rate;


Challenge 3:

Determining the top 10 capital cities in Europe and the Americas in terms of a calculated percentage using city_proper_pop and metroarea_pop in cities.

Do not use table aliasing in this exercise.

-- Select fields
SELECT name, country_code, city_proper_pop, metroarea_pop,  
      -- Calculate city_perc
      city_proper_pop / metroarea_pop * 100 AS city_perc
  -- From appropriate table
  FROM cities
  -- Where 
  WHERE name IN
    -- Subquery
    (SELECT capital
     FROM countries
     WHERE (continent = 'Europe'
        OR continent LIKE '%America'))
       AND metroarea_pop IS NOT NULL
-- Order appropriately
ORDER BY city_perc DESC
-- Limit amount
LIMIT 10;


Subqueries inside SELECT clause

Second most common: Inside SELECT

Usage:
  • Returns s single value
    • Include aggregate values to compare to individuals values
  • Used in mathematical calculations
    • Deviation from average
Keep things in mind:
  • Need to return a single value
  • Properly filter both the main and the subquery

Some queries can be written using either a join or a subquery

SELECT DISTINCT continent,
                (SELECT COUNT(*)
                  FROM states
                   WHERE prime_ministers.continent = states.continent) AS countries_num
FROM prime_ministers;

Examples:

Joins: To get summarized results from multiple table we used GROUP BY and INNER JOIN.

Question: Select the top nine countries in terms of number of cities (popular cities) appearing in the cities table.

SELECT countries.name AS country, COUNT(*) AS cities_num
  FROM cities
    INNER JOIN countries
    ON countries.code = cities.country_code
GROUP BY country
ORDER BY cities_num DESC, country
LIMIT 9;

/* 
SELECT ___ AS ___,
  (SELECT ___
   FROM ___
   WHERE countries.code = cities.country_code) AS cities_num
FROM ___
ORDER BY ___ ___, ___
LIMIT 9;
*/

Subquery: Convert the GROUP BY code to use a subquery inside of SELECT


/*
SELECT countries.name AS country, COUNT(*) AS cities_num
  FROM cities
    INNER JOIN countries
    ON countries.code = cities.country_code
GROUP BY country
ORDER BY cities_num DESC, country
LIMIT 9;
*/

SELECT countries.name AS country,
  -- Subquery
  (SELECT COUNT(*)
   FROM cities
   WHERE countries.code = cities.country_code) AS cities_num
FROM countries
ORDER BY cities_num DESC, country
LIMIT 9;