10 Advanced SQL Concepts Every Data Scientist Should Master
This guide walks through ten essential advanced SQL concepts—including CTEs, recursive queries, temporary functions, CASE‑based pivoting, EXCEPT vs NOT IN, self‑joins, ranking functions, delta calculations, cumulative totals, and date‑time manipulation—providing clear explanations and runnable examples to help data‑science professionals ace interview challenges.
1. Common Table Expressions (CTEs)
CTEs let you define a temporary result set that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement, making complex queries easier to read and maintain.
SELECT
name,
salary
FROM
People
WHERE
NAME IN (SELECT DISTINCT NAME FROM population WHERE country = "Canada" AND city = "Toronto")
AND salary >= (
SELECT AVG(salary)
FROM salaries
WHERE gender = "Female")Using a CTE to break the same logic into reusable parts:
WITH toronto_ppl AS (
SELECT DISTINCT name
FROM population
WHERE country = "Canada" AND city = "Toronto"
), avg_female_salary AS (
SELECT AVG(salary) AS avgSalary
FROM salaries
WHERE gender = "Female"
)
SELECT name, salary
FROM People
WHERE name IN (SELECT name FROM toronto_ppl)
AND salary >= (SELECT avgSalary FROM avg_female_salary);2. Recursive CTEs
Recursive CTEs reference themselves, similar to recursive functions in programming languages, and are ideal for traversing hierarchical data such as organizational charts or graph structures.
Anchor member – returns the base rows.
Recursive member – repeatedly joins the CTE to itself.
Termination condition – stops the recursion.
WITH org_structure AS (
SELECT id, manager_id
FROM staff_members
WHERE manager_id IS NULL
UNION ALL
SELECT sm.id, sm.manager_id
FROM staff_members sm
INNER JOIN org_structure os ON os.id = sm.manager_id
)3. Temporary Functions
Temporary functions let you encapsulate reusable logic inside a query, improving readability and avoiding repetition.
Break code into smaller, maintainable blocks.
Promote clean, DRY SQL.
Reuse logic similarly to functions in programming languages.
SELECT name,
CASE WHEN tenure < 1 THEN "analyst"
WHEN tenure BETWEEN 1 AND 3 THEN "associate"
WHEN tenure BETWEEN 3 AND 5 THEN "senior"
WHEN tenure > 5 THEN "vp"
ELSE "n/a"
END AS seniority
FROM employees;Using a temporary function to achieve the same result:
CREATE TEMPORARY FUNCTION get_seniority(tenure INT64) AS (
CASE WHEN tenure < 1 THEN "analyst"
WHEN tenure BETWEEN 1 AND 3 THEN "associate"
WHEN tenure BETWEEN 3 AND 5 THEN "senior"
WHEN tenure > 5 THEN "vp"
ELSE "n/a"
END
);
SELECT name, get_seniority(tenure) AS seniority FROM employees;4. CASE WHEN for Pivoting Data
CASE WHEN can be used to transform rows into columns, effectively pivoting data without a dedicated PIVOT clause.
Initial table:
+----+----------+-------+
| id | revenue | month |
+----+----------+-------+
| 1 | 8000 | Jan |
| 2 | 9000 | Jan |
| 3 | 10000 | Feb |
| 1 | 7000 | Feb |
| 1 | 6000 | Mar |
+----+----------+-------+
Result table:
+----+------------+------------+------------+-----+------------+
| id | Jan_Revenue| Feb_Revenue| Mar_Revenue| ... | Dec_Revenue|
+----+------------+------------+------------+-----+------------+
| 1 | 8000 | 7000 | 6000 | ... | null |
| 2 | 9000 | null | null | ... | null |
| 3 | null | 10000 | null | ... | null |
+----+------------+------------+------------+-----+------------+5. EXCEPT vs NOT IN
Both operators compare two result sets, but they differ subtly: EXCEPT removes duplicate rows and returns rows present in the first query but not in the second, while NOT IN checks for non‑membership on a column‑by‑column basis and can behave differently with NULL values.
6. Self‑Join
A self‑join joins a table to itself, useful when hierarchical relationships are stored in a single table.
+----+-------+--------+-----------+
| Id | Name | Salary | ManagerId |
+----+-------+--------+-----------+
| 1 | Joe | 70000 | 3 |
| 2 | Henry | 80000 | 4 |
| 3 | Sam | 60000 | NULL |
| 4 | Max | 90000 | NULL |
+----+-------+--------+-----------+
SELECT a.Name AS Employee
FROM Employee a
JOIN Employee b ON a.ManagerID = b.Id
WHERE a.Salary > b.Salary;7. Rank vs Dense_Rank vs Row_Number
These window functions assign ranking numbers to rows based on an ordering criterion, each handling ties differently.
SELECT Name,
GPA,
ROW_NUMBER() OVER (ORDER BY GPA DESC) AS row_num,
RANK() OVER (ORDER BY GPA DESC) AS rank,
DENSE_RANK() OVER (ORDER BY GPA DESC) AS dense_rank
FROM student_grades;8. Calculating Delta Values
Use LAG or LEAD to compare a row with a previous or next row, enabling calculations such as month‑over‑month changes.
# Comparing each month's sales to last month
SELECT month,
sales,
sales - LAG(sales, 1) OVER (ORDER BY month) AS delta_month
FROM monthly_sales;
# Comparing each month's sales to the same month last year
SELECT month,
sales,
sales - LAG(sales, 12) OVER (ORDER BY month) AS delta_year
FROM monthly_sales;9. Computing Cumulative Totals
The SUM() window function with an ORDER BY clause produces a running total.
SELECT Month,
Revenue,
SUM(Revenue) OVER (ORDER BY Month) AS Cumulative
FROM monthly_revenue;10. Date‑Time Manipulation
Common functions for handling dates include EXTRACT, DATE_ADD, DATE_SUB, and DATE_TRUNC. They allow grouping, formatting, and calculating differences between dates.
Extract components (year, month, day).
Adjust dates with DATE_ADD / DATE_SUB.
Truncate to a specific granularity with DATE_TRUNC.
Example: Find IDs of days where the temperature was higher than the previous day.
SELECT a.Id
FROM Weather a, Weather b
WHERE a.Temperature > b.Temperature
AND DATEDIFF(a.RecordDate, b.RecordDate) = 1;Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
