Friday, July 16, 2021

Subqueries (Joining Data in SQL) - Subqueries inside WHERE clause

What is a subquery?

  • Query nested inside another query 
  • Useful for intermediary transformations
  • Can return a variety of information
    • Scalar quantities (3.14159, -2, 0.001)
    • A list (id = (12, 25, 392, 401, 939))
    • A table
  • Can be in any part of a query
    • WHERE, SELECT, FROM, ON statement in JOINS, GROUP BY

Why subquery?

  • Comparing groups to summarized values
    • How did Liverpool compare to the English Premier League's average performance for that year?
  • Reshaping data
    • What is the highest monthly average of goals scored in the Bundesliga?
  • Combining data that cannot be joined
    • How do you get both the home and away team nams into a table of match results?

Subqueries inside WHERE clause (returns only one column)

Most common: Inside WHERE

These are useful for filtering results based on information you'd have to calculate separately beforehand. 

Subqueries are also useful for generating a filtering list.

You can filter data based on single, scalar values using a subquery in ways you cannot by using WHERE statements or joins.



SELECT name, fert_rate
FROM states
WHERE continent = 'Asia'
        AND fert_rate <
                           (SELECT AVG(fert_rate
                                FROM states);

Question: Which countries had high average life expectancies (at the country level) in 2015.

-- Select fields
SELECT *
  -- From populations
  FROM populations
-- Where life_expectancy is greater than
WHERE life_expectancy >
  -- 1.15 * subquery
  1.15 * (SELECT AVG(life_expectancy)
   FROM populations
   WHERE year = 2015) AND
  year = 2015;

Question: Get the urban area population for only capital cities.

-- Select fields
SELECT c1.name, c1.country_code, urbanarea_pop
  -- From cities
  FROM cities c1
-- Where city name in the field of capital cities
WHERE c1.name IN
  -- Subquery
  (SELECT capital
   FROM countries)
ORDER BY urbanarea_pop DESC;
                

Question: 

Generate a list of matches where the total goals scored (for both teams in total) is more than 3 times the average for games in the matches_2013_2014 table, which includes all games played in the 2013/2014 season.

SELECT 
    -- Select the date, home goals, and away goals scored
    date,
    home_goal,
    away_goal
FROM  matches_2013_2014
-- Filter for matches where total goals exceeds 3x the average
WHERE (home_goal + away_goal) > 
       (SELECT 3 * AVG(home_goal + away_goal)
        FROM matches_2013_2014);


Question: 

Generate a list of teams that never played a game in their home city. Using a subquery, generate a list of unique hometeam_ID values from the unfiltered match table to exclude in the team table's team_api_ID column.

In addition to filtering using a single-value (scalar) subquery, you can create a list of values in a subquery to filter data based on a complex set of conditions. This type of subquery generates a one column reference list for the main query. As long as the values in your list match a column in your main query's table, you don't need to use a join -- even if the list is from a separate table.

SELECT 
    -- Select the team long and short names
    team_long_name,
    team_short_name
FROM team 
-- Exclude all values from the subquery
WHERE team_api_id NOT IN
     (SELECT DISTINCT hometeam_ID  FROM match);

Question:

Create a list of teams that scored 8 or more goals in a home match.

SELECT
    -- Select the team long and short names
    team_long_name,
    team_short_name
FROM team
-- Filter for teams with 8 or more home goals
WHERE team_api_id IN
      (SELECT hometeam_id
       FROM match
       WHERE home_goal >= 8);

Question 

Generate a subquery using the match table, and then join that subquery to the country table to calculate information about matches with 10 or more goals in total!

SELECT
    -- Select country name and the count match IDs
    c.name AS country_name,
    COUNT(sub.id) AS matches
FROM country AS c
-- Inner join the subquery onto country
-- Select the country id and match id columns
INNER JOIN (SELECT country_id, id 
            FROM match
            -- Filter the subquery by matches with 10+ goals
            WHERE (home_goal + away_goal) >= 10) AS sub
ON c.id = sub.country_id
GROUP BY country_name;

SELECT
    -- Select country, date, home, and away goals from the subquery
    country,
    date,
    home_goal,
    away_goal
FROM 
    -- Select country name, date, home_goal, away_goal, and total goals in the subquery
    (SELECT c.name AS country, 
            m.date, 
            m.home_goal, 
            m.away_goal,
           (m.home_goal + m.away_goal) AS total_goals
    FROM match AS m
    LEFT JOIN country AS c
    ON m.country_id = c.id) AS subq
-- Filter by total goals scored in the main query
WHERE total_goals >= 10;

Subqueries inside FROM clause (returns a temporary table)

Usage:
  • Restructure and transform data into different shape
  • A subquery in FROM is an effective way of answering detailed questions that requires filtering or transforming data before including it in your final results.
  • Pre-filtering data
  • Calculating aggregates of aggregates
    • Which 3 teams has the highest average of home goals scored? (AVG, MAX)
Things to remember:

  • You can create multiple subqueries in one FROM statement
    • Alias them!
    • Join them!
  • You can join a subquery to a table in FROM
    • Include a joining columns in both tables


Subquery as a temporary table in FROM clause. We can have multiple table using a comma in FROM clause.

SELECT DISTINCT monarchs.continent, subquery.max_perc
FROM monarchs,
            (SELECT continent, MAX(women_parli_perc) AS max_perc
                FROM states
                GROUP BY continent) AS subquery
WHERE monarch.continent = subquery.continent
ORDER BY continent;

Question: Determine the number of languages spoken for each country, identified by the country's local name! 

-- Select fields
SELECT local_name, subquery.lang_num
  -- From countries
  FROM countries,
    -- Subquery (alias as subquery)
    (SELECT code, COUNT(*) AS lang_num
     FROM languages
     GROUP BY code) AS subquery
  -- Where codes match
  WHERE countries.code = subquery.code
-- Order by descending number of languages
ORDER BY lang_num DESC;
   

Advanced subquery

Question: Identify which country had the maximum inflation rate, and how high it was, using multiple subqueries. (Note: Again, there are multiple ways to get to this solution using only joins, but the focus here is on showing you an introduction into advanced subqueries.)


-- Select fields
SELECT name, continent, inflation_rate
  -- From countries
  FROM countries
  -- Join to economies
  INNER JOIN economies
  -- Match on code
  ON countries.code = economies.code
  -- Where year is 2015
  WHERE year = 2015
    -- And inflation rate in subquery (alias as subquery)
    AND inflation_rate IN (
        SELECT MAX(inflation_rate) AS max_inf
        FROM (
             SELECT name, continent, inflation_rate
             FROM countries
             INNER JOIN economies
             ON countries.code = economies.code
             WHERE year = 2015) AS subquery
      -- Group by continent
        GROUP BY continent);

Question:

Use a subquery to get 2015 economic data for countries that do not have 

  • gov_form of 'Constitutional Monarchy' or 
  • 'Republic' in their gov_form.

Here, gov_form stands for the form of the government for each country. 

-- Select fields
SELECT code, inflation_rate, unemployment_rate
  -- From economies
  FROM economies
  -- Where year is 2015 and code is not in
  WHERE year = 2015 AND code NOT IN
    -- Subquery
    (SELECT code
     FROM countries
     WHERE (gov_form = 'Constitutional Monarchy' OR gov_form LIKE '%Republic%'))
-- Order by inflation rate
ORDER BY inflation_rate;


Challenge 1:

Get the country names and other 2015 data in the economiestable and the countries table for Central American countries with an official language.

-- Select fields
SELECT DISTINCT c.name, total_investment, imports
  -- From table (with alias)
  FROM countries AS c
    -- Join with table (with alias)
    LEFT JOIN economies AS e
      -- Match on code
      ON (c.code = e.code
      -- and code in Subquery
        AND c.code IN (
          SELECT l.code
          FROM languages AS l
          WHERE official = 'true'
        ) )
  -- Where region and year are correct
  WHERE region = 'Central America' AND year = 2015
-- Order by field
ORDER BY c.name;


As above, a subquery can be used inside of ON statement of join.

Challenge 2:

Calculate the average fertility rate for each region in 2015.

-- Select fields
SELECT region, continent, AVG(fertility_rate) AS avg_fert_rate
  -- From left table
  FROM countries AS c
    -- Join to right table
    INNER JOIN populations AS p
      -- Match on join condition
      ON c.code = p.country_code
  -- Where specific records matching some condition
  WHERE year = 2015
-- Group appropriately
GROUP BY continent, region
-- Order appropriately
ORDER BY avg_fert_rate;


Challenge 3:

Determining the top 10 capital cities in Europe and the Americas in terms of a calculated percentage using city_proper_pop and metroarea_pop in cities.

Do not use table aliasing in this exercise.

-- Select fields
SELECT name, country_code, city_proper_pop, metroarea_pop,  
      -- Calculate city_perc
      city_proper_pop / metroarea_pop * 100 AS city_perc
  -- From appropriate table
  FROM cities
  -- Where 
  WHERE name IN
    -- Subquery
    (SELECT capital
     FROM countries
     WHERE (continent = 'Europe'
        OR continent LIKE '%America'))
       AND metroarea_pop IS NOT NULL
-- Order appropriately
ORDER BY city_perc DESC
-- Limit amount
LIMIT 10;


Subqueries inside SELECT clause

Second most common: Inside SELECT

Usage:
  • Returns s single value
    • Include aggregate values to compare to individuals values
  • Used in mathematical calculations
    • Deviation from average
Keep things in mind:
  • Need to return a single value
  • Properly filter both the main and the subquery

Some queries can be written using either a join or a subquery

SELECT DISTINCT continent,
                (SELECT COUNT(*)
                  FROM states
                   WHERE prime_ministers.continent = states.continent) AS countries_num
FROM prime_ministers;

Examples:

Joins: To get summarized results from multiple table we used GROUP BY and INNER JOIN.

Question: Select the top nine countries in terms of number of cities (popular cities) appearing in the cities table.

SELECT countries.name AS country, COUNT(*) AS cities_num
  FROM cities
    INNER JOIN countries
    ON countries.code = cities.country_code
GROUP BY country
ORDER BY cities_num DESC, country
LIMIT 9;

/* 
SELECT ___ AS ___,
  (SELECT ___
   FROM ___
   WHERE countries.code = cities.country_code) AS cities_num
FROM ___
ORDER BY ___ ___, ___
LIMIT 9;
*/

Subquery: Convert the GROUP BY code to use a subquery inside of SELECT


/*
SELECT countries.name AS country, COUNT(*) AS cities_num
  FROM cities
    INNER JOIN countries
    ON countries.code = cities.country_code
GROUP BY country
ORDER BY cities_num DESC, country
LIMIT 9;
*/

SELECT countries.name AS country,
  -- Subquery
  (SELECT COUNT(*)
   FROM cities
   WHERE countries.code = cities.country_code) AS cities_num
FROM countries
ORDER BY cities_num DESC, country
LIMIT 9;
























No comments: