SQL Interview Questions & Answers

FAANG Interview Preparation

Master SQL for technical interviews — from basic queries to advanced window functions, optimization, and query writing challenges used at FAANG companies.

15

Questions

4

Easy

8

Medium

3

Hard

Query Writing (7)Theory (5)Concepts (2)Optimization (1)
Q1 Easy Theory

What are the different types of SQL JOINs? Explain with examples

SQL JOINs combine rows from two or more tables based on a related column. INNER JOIN returns only matching rows from both tables — the intersection. LEFT JOIN returns all rows from the left table and matching rows from the right; unmatched right rows show NULL. RIGHT JOIN is the mirror: all right rows, matching left. FULL OUTER JOIN returns all rows from both tables, with NULLs where no match exists. CROSS JOIN produces the Cartesian product — every row from table A paired with every row from table B. SELF JOIN is joining a table to itself, useful for hierarchical data like employees and their managers.

In practice, INNER JOIN is most common for combining related data. LEFT JOIN is essential when you need to preserve all records from the primary table (e.g., all customers including those with no orders). FULL OUTER reveals gaps — rows that exist in one table but not the other. CROSS JOIN is rarely used in production but appears in interview puzzles. SELF JOIN solves "find employees earning more than their manager" by aliasing the same table twice.

SQL
-- Schema: employees(id, name, manager_id, salary), departments(id, name)

-- INNER JOIN: only employees with a department
SELECT e.name, d.name AS dept
FROM employees e
INNER JOIN departments d ON e.dept_id = d.id;

-- LEFT JOIN: all employees, NULL if no department
SELECT e.name, d.name AS dept
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id;

-- FULL OUTER: all from both, NULLs where no match
SELECT e.name, d.name AS dept
FROM employees e
FULL OUTER JOIN departments d ON e.dept_id = d.id;

-- CROSS JOIN: every employee with every department
SELECT e.name, d.name FROM employees e CROSS JOIN departments d;

-- SELF JOIN: employee and their manager
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

Key Takeaways

  • INNER = intersection; LEFT/RIGHT = preserve one side; FULL = union with NULLs; CROSS = Cartesian product
  • SELF JOIN uses table aliases to treat the same table as two different entities
  • Always specify JOIN condition — forgetting it with multiple tables can accidentally create CROSS JOIN
Q2 Easy Theory

Explain the difference between WHERE and HAVING clauses

WHERE filters rows before aggregation; HAVING filters groups after aggregation. This order matters: WHERE is applied first, reducing the dataset. Then GROUP BY creates groups. Finally, HAVING filters those groups based on aggregate values like COUNT, SUM, or AVG.

You cannot use aggregate functions in WHERE because aggregation has not happened yet. For example, "find departments with more than 5 employees" requires HAVING COUNT(*) > 5. Conversely, "find average salary of employees in departments with more than 5 employees" needs both: WHERE could filter individual rows first (e.g., only active employees), then HAVING filters the grouped result.

A common pattern: WHERE dept_id = 10 filters to one department, then GROUP BY and HAVING apply. If you put HAVING dept_id = 10, it would work only if dept_id is in the GROUP BY or is an aggregate. Best practice: use WHERE for row-level filters, HAVING for group-level conditions.

SQL
-- Schema: employees(id, name, dept_id, salary)

-- WHERE: filter rows BEFORE grouping
SELECT dept_id, AVG(salary) AS avg_sal
FROM employees
WHERE salary > 50000  -- only employees earning > 50k
GROUP BY dept_id;

-- HAVING: filter groups AFTER grouping
SELECT dept_id, COUNT(*) AS emp_count, AVG(salary) AS avg_sal
FROM employees
GROUP BY dept_id
HAVING COUNT(*) > 5;  -- only depts with more than 5 employees

-- Both together: filter rows first, then filter groups
SELECT dept_id, AVG(salary) AS avg_sal
FROM employees
WHERE hire_date >= '2020-01-01'  -- only recent hires
GROUP BY dept_id
HAVING AVG(salary) > 80000;  -- depts where avg salary of recent hires > 80k

Key Takeaways

  • WHERE filters rows before GROUP BY; HAVING filters groups after
  • Aggregate functions (COUNT, SUM, AVG) can only appear in HAVING, not WHERE
  • You can use both in the same query for row-level and group-level filtering
Q3 Easy Theory

What are ACID properties in database transactions?

ACID ensures reliable transaction processing. Atomicity means all-or-nothing: either every operation in a transaction commits, or none do. If a transfer deducts $100 from A but fails before crediting B, the entire transaction rolls back. Consistency means the database moves from one valid state to another, respecting constraints (foreign keys, unique, check). A transfer cannot leave total money changed.

Isolation means concurrent transactions do not see each other's uncommitted changes. Without it, one transaction might read another's dirty data. Levels range from READ UNCOMMITTED (dirty reads) to SERIALIZABLE (strictest). Durability means once committed, data survives crashes. Writes are persisted to disk; power loss does not lose committed data.

In practice, databases implement these via write-ahead logging, locks, and MVCC. Interviewers expect you to explain each property with a concrete scenario: "If the server crashes mid-transfer, Atomicity and Durability ensure we don't lose or corrupt money."

SQL
-- ACID in action: transfer between accounts
BEGIN;

UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;

-- If both succeed: COMMIT (Atomicity + Durability)
-- If either fails: ROLLBACK (Atomicity)
COMMIT;

-- Isolation: another session won't see uncommitted changes
-- Consistency: CHECK constraints ensure balance >= 0

Key Takeaways

  • Atomicity = all or nothing; Consistency = valid state transitions; Isolation = concurrent transactions invisible; Durability = committed data survives crashes
  • Real scenario: bank transfer — Atomicity prevents partial updates, Durability prevents loss after commit
  • Isolation levels trade off consistency for performance (READ COMMITTED vs SERIALIZABLE)
Q4 Medium Theory

Explain database normalization — 1NF through BCNF

Normalization reduces redundancy and update anomalies. 1NF requires atomic values — no repeating groups or arrays in a single cell. A column like "phones" storing "555-1234, 555-5678" violates 1NF; split into separate rows or a related table.

2NF requires 1NF plus no partial dependencies: every non-key attribute must depend on the entire primary key. If the key is (order_id, product_id), storing customer_name with only order_id creates partial dependency — customer_name depends only on order_id. Move customer_name to an orders table.

3NF requires 2NF plus no transitive dependencies: non-key attributes must not depend on other non-keys. If we have (emp_id, dept_id, dept_name), dept_name depends on dept_id, not emp_id. Move dept_name to a departments table.

BCNF (Boyce-Codd) strengthens 3NF: every determinant must be a candidate key. If a non-key determines another attribute, split the table. In practice, 3NF is often sufficient; BCNF handles edge cases with overlapping candidate keys.

SQL
-- 1NF violation: phones as comma-separated list
-- BAD: employees(id, name, phones)
-- GOOD: employees(id, name), employee_phones(emp_id, phone)

-- 2NF violation: order_details(order_id, product_id, qty, customer_name)
-- customer_name depends only on order_id
-- FIX: orders(order_id, customer_name), order_details(order_id, product_id, qty)

-- 3NF violation: employees(emp_id, dept_id, dept_name)
-- dept_name depends on dept_id
-- FIX: employees(emp_id, dept_id), departments(dept_id, dept_name)

-- Normalized schema example
CREATE TABLE departments (id INT PRIMARY KEY, name TEXT);
CREATE TABLE employees (id INT PRIMARY KEY, name TEXT, dept_id INT REFERENCES departments(id));
CREATE TABLE employee_phones (emp_id INT, phone TEXT, PRIMARY KEY (emp_id, phone));

Key Takeaways

  • 1NF = atomic values; 2NF = no partial key dependencies; 3NF = no transitive dependencies; BCNF = every determinant is a candidate key
  • Denormalization is sometimes used for read-heavy workloads — trade redundancy for query speed
  • Know when to stop: over-normalization can hurt performance with excessive JOINs
Q5 Medium Theory

What are indexes and how do they improve query performance?

Indexes are data structures that speed up lookups by avoiding full table scans. A B-tree index (most common) stores column values in a sorted tree structure, allowing O(log n) lookups instead of scanning every row. Hash indexes provide O(1) lookups for equality checks but do not support range queries or ordering.

Clustered indexes (e.g., primary key in many DBs) store table data in index order — the table is physically sorted. There is typically one per table. Non-clustered indexes store pointers to the actual rows; you can have many. A query filtering on an indexed column uses the index to jump directly to matching rows.

Indexes are not free: they consume storage and slow down INSERTs, UPDATEs, and DELETEs because the index must be maintained. Avoid indexing low-cardinality columns (e.g., boolean), columns rarely used in WHERE/JOIN, or very small tables where a full scan is cheaper. Composite indexes help multi-column filters; column order matters for left-prefix matching.

SQL
-- Create indexes for common query patterns
CREATE INDEX idx_employees_dept ON employees(dept_id);
CREATE INDEX idx_employees_salary ON employees(salary);
CREATE INDEX idx_orders_user_date ON orders(user_id, order_date);  -- composite

-- Check if index is used (PostgreSQL)
EXPLAIN ANALYZE
SELECT * FROM employees WHERE dept_id = 5;

-- When NOT to index:
-- - Columns with few distinct values (e.g., gender, status)
-- - Tables with < 1000 rows
-- - Columns frequently updated
-- - Columns never in WHERE, JOIN, or ORDER BY

Key Takeaways

  • B-tree for general use; hash for equality-only; clustered vs non-clustered affects storage and number of indexes
  • Indexes speed reads but slow writes; balance based on workload
  • Avoid indexing low-cardinality columns and very small tables
Q6 Medium Query Writing

Write a query to find the second highest salary in each department

Use a window function to rank salaries within each department, then filter for rank 2. ROW_NUMBER() assigns unique ranks (1, 2, 3...) and handles ties arbitrarily. DENSE_RANK() gives the same rank to ties but no gaps (1, 2, 2, 3). For "second highest," DENSE_RANK is often preferred: if two people have the highest salary, the next distinct salary is rank 2.

Wrap the ranking in a CTE or subquery, then SELECT WHERE rn = 2 (or dr = 2). Alternatively, use a correlated subquery: for each department, find the MAX salary, then find the MAX salary that is less than that. The window approach is cleaner and more extensible (e.g., top N per department).

SQL
-- Schema: employees(id, name, dept_id, salary)

-- Using DENSE_RANK (preferred: handles ties correctly)
WITH ranked AS (
  SELECT id, name, dept_id, salary,
    DENSE_RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS rn
  FROM employees
)
SELECT id, name, dept_id, salary
FROM ranked
WHERE rn = 2;

-- Using ROW_NUMBER (ties get different ranks)
WITH ranked AS (
  SELECT id, name, dept_id, salary,
    ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS rn
  FROM employees
)
SELECT id, name, dept_id, salary
FROM ranked
WHERE rn = 2;

-- Alternative: correlated subquery for second-highest
SELECT e1.*
FROM employees e1
WHERE 1 = (
  SELECT COUNT(DISTINCT e2.salary)
  FROM employees e2
  WHERE e2.dept_id = e1.dept_id AND e2.salary > e1.salary
);

Key Takeaways

  • DENSE_RANK for "second highest" when ties exist; ROW_NUMBER when ties should get different ranks
  • PARTITION BY dept_id groups within department; ORDER BY salary DESC ranks highest first
  • CTE or subquery required because window functions cannot be used in WHERE directly
Q7 Easy Query Writing

Write a query to find duplicate records in a table

Duplicates are rows that share the same values in key columns. Use GROUP BY on those columns and HAVING COUNT(*) > 1 to find groups with more than one row. To list the actual duplicate rows (not just the key), join back to the original table or use a subquery with IN.

For "show all duplicate rows," either group and join, or use a window function: COUNT(*) OVER (PARTITION BY col1, col2) and filter where count > 1. The GROUP BY approach is simpler for just identifying which keys are duplicated. To deduplicate and keep one row, use DISTINCT ON (PostgreSQL), ROW_NUMBER with a subquery, or DELETE with a self-join targeting duplicates.

SQL
-- Schema: users(id, email, name) — find duplicate emails

-- Which emails are duplicated?
SELECT email, COUNT(*) AS cnt
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

-- Show all rows that are duplicates (every duplicate row)
SELECT u.*
FROM users u
INNER JOIN (
  SELECT email FROM users GROUP BY email HAVING COUNT(*) > 1
) dup ON u.email = dup.email
ORDER BY u.email;

-- Using window function: mark duplicates
SELECT id, email, name,
  COUNT(*) OVER (PARTITION BY email) AS dup_count
FROM users;
-- Filter WHERE dup_count > 1 in outer query

-- Deduplicate: keep one row per email (PostgreSQL)
SELECT DISTINCT ON (email) id, email, name
FROM users
ORDER BY email, id;

Key Takeaways

  • GROUP BY + HAVING COUNT(*) > 1 finds duplicate keys
  • Join back to get full rows, or use window COUNT to mark duplicates
  • DISTINCT ON or ROW_NUMBER can deduplicate while keeping one row per group
Q8 Medium Query Writing

Write a query to find employees earning more than their managers

This is a classic self-join problem. The employees table has manager_id pointing to another employee. Join the table to itself: one alias for the employee, one for the manager. The join condition is employee.manager_id = manager.id. Then filter WHERE employee.salary > manager.salary.

Ensure you use LEFT JOIN if some employees have no manager (CEO), otherwise they would be excluded. For "employees and their managers" with the salary comparison, the self-join is the standard approach. Alternative: use a correlated subquery to fetch the manager's salary and compare, but the self-join is cleaner and more efficient.

SQL
-- Schema: employees(id, name, manager_id, salary)

-- Self join: employee and manager as two "tables"
SELECT e.name AS employee, e.salary AS emp_salary,
       m.name AS manager, m.salary AS mgr_salary
FROM employees e
INNER JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;

-- Include employees with no manager (show NULL for manager)
SELECT e.name AS employee, e.salary
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary OR m.id IS NULL;

-- Correlated subquery alternative
SELECT name, salary
FROM employees e
WHERE salary > (
  SELECT salary FROM employees WHERE id = e.manager_id
);

Key Takeaways

  • Self-join: alias the same table twice (e and m), join on employee.manager_id = manager.id
  • INNER JOIN excludes employees with no manager; use LEFT JOIN if you need them
  • Filter WHERE employee.salary > manager.salary after the join
Q9 Medium Query Writing

Write a query to calculate running totals

A running total (cumulative sum) adds each row's value to the sum of all previous rows. Window functions make this straightforward: SUM(amount) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW). The default frame for SUM with ORDER BY is exactly that — from the first row to the current row — so SUM(amount) OVER (ORDER BY date) suffices.

Order matters: ORDER BY date gives a chronological running total. Use PARTITION BY to reset the total per group (e.g., running total per user). For running totals, ROWS or RANGE between preceding and current row defines the window. RANGE handles ties differently (peers get same sum). ROWS is typically faster.

SQL
-- Schema: orders(id, order_date, amount)

-- Running total of order amounts over time
SELECT order_date, amount,
  SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM orders;

-- Running total per customer
SELECT customer_id, order_date, amount,
  SUM(amount) OVER (
    PARTITION BY customer_id
    ORDER BY order_date
  ) AS running_total
FROM orders;

-- Explicit frame (default for ORDER BY)
SELECT order_date, amount,
  SUM(amount) OVER (
    ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_total
FROM orders;

Key Takeaways

  • SUM(...) OVER (ORDER BY col) gives running total; default frame is all preceding rows
  • PARTITION BY resets the total for each group (e.g., per user)
  • ROWS vs RANGE: ROWS is row-based, RANGE is value-based (affects ties)
Q10 Hard Query Writing

Write a query to pivot rows into columns

Pivoting transforms rows into columns — e.g., months as rows become columns (Jan, Feb, Mar). The standard approach uses conditional aggregation: MAX(CASE WHEN month = 'Jan' THEN amount END) AS jan, and similarly for other months. Each CASE extracts the value for that category; MAX (or MIN) collapses the single non-null value. GROUP BY the row identifier (e.g., product_id).

Some databases have PIVOT syntax (SQL Server, Oracle). PostgreSQL does not, so CASE + GROUP BY is the norm. For dynamic pivoting (unknown number of columns), you need dynamic SQL. The static CASE approach is common in interviews. Ensure you handle NULLs: COALESCE or default values if needed.

SQL
-- Schema: sales(product_id, month, amount)
-- Pivot: one row per product, columns for each month

-- Using CASE + GROUP BY (PostgreSQL, MySQL)
SELECT product_id,
  MAX(CASE WHEN month = 1 THEN amount END) AS jan,
  MAX(CASE WHEN month = 2 THEN amount END) AS feb,
  MAX(CASE WHEN month = 3 THEN amount END) AS mar,
  MAX(CASE WHEN month = 4 THEN amount END) AS apr
FROM sales
GROUP BY product_id;

-- With SUM if multiple rows per product/month
SELECT product_id,
  SUM(CASE WHEN month = 1 THEN amount ELSE 0 END) AS jan,
  SUM(CASE WHEN month = 2 THEN amount ELSE 0 END) AS feb
FROM sales
GROUP BY product_id;

-- Unpivot (columns to rows) using UNION ALL or LATERAL

Key Takeaways

  • CASE WHEN category = X THEN value END + MAX/MIN + GROUP BY = pivot
  • MAX/MIN collapses one non-null per group; use SUM if aggregating multiple values
  • PostgreSQL has no PIVOT; CASE is the standard approach
Q11 Hard Query Writing

Write a query to find consecutive login days for users

Consecutive days form groups when you subtract a row number from the date — dates in the same consecutive streak yield the same difference. Use ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) to number each login. Then date - rn::int (or similar) gives a constant for consecutive days: 2024-01-01 with rn=1 gives 2023-12-31; 2024-01-02 with rn=2 gives 2023-12-31; 2024-01-05 with rn=3 gives 2024-01-02. Group by user_id and this difference to get streaks, then COUNT and filter for streaks of desired length.

Deduplicate login dates first (one row per user per day). The key insight: date - row_number is invariant within a consecutive run.

SQL
-- Schema: logins(user_id, login_date)
-- Find users with 3+ consecutive login days

WITH deduped AS (
  SELECT DISTINCT user_id, login_date FROM logins
),
ranked AS (
  SELECT user_id, login_date,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS rn
  FROM deduped
),
streaks AS (
  SELECT user_id, login_date::date - rn AS streak_start  -- date - int = subtract days
  FROM ranked
)
SELECT user_id, streak_start, COUNT(*) AS streak_length
FROM streaks
GROUP BY user_id, streak_start
HAVING COUNT(*) >= 3;

-- Simpler: date - rn (as integer days)
-- login_date::date - rn gives same value for consecutive days

Key Takeaways

  • date - ROW_NUMBER() is constant for consecutive days; group by it to get streaks
  • Deduplicate first: one row per user per day
  • HAVING COUNT(*) >= N filters for streaks of length N or more
Q12 Medium Query Writing

Write a query to find the Nth highest salary without using LIMIT

Without LIMIT/OFFSET, use a subquery that counts how many distinct salaries are greater than the current row's salary. If exactly N-1 salaries are higher, the current row has the Nth highest. For each employee, the correlated subquery counts SELECT COUNT(DISTINCT salary) FROM employees WHERE salary > e.salary. When that count equals N-1, we have the Nth highest.

Alternatively, use DENSE_RANK() or ROW_NUMBER() in a subquery/CTE and filter WHERE rn = N. The correlated subquery approach works in any SQL dialect and demonstrates understanding of correlated subqueries. For N=2, we want exactly 1 distinct salary greater — the second highest. Handle ties: DENSE_RANK treats ties as same rank; the COUNT approach naturally handles ties (multiple people can have 2nd highest).

SQL
-- Schema: employees(id, name, salary)
-- Find 2nd highest salary (N=2)

-- Correlated subquery: count distinct salaries greater than current
SELECT MAX(salary) AS second_highest
FROM employees e
WHERE 1 = (
  SELECT COUNT(DISTINCT e2.salary)
  FROM employees e2
  WHERE e2.salary > e.salary
);

-- For Nth: replace 1 with N-1
-- N=3: WHERE 2 = (SELECT COUNT(DISTINCT ...) WHERE salary > e.salary)

-- Alternative: DENSE_RANK (clearer, but uses window functions)
WITH ranked AS (
  SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rn
  FROM employees
)
SELECT DISTINCT salary FROM ranked WHERE rn = 2;

-- All employees with Nth highest salary
SELECT * FROM employees e
WHERE (SELECT COUNT(DISTINCT salary) FROM employees WHERE salary > e.salary) = 1;

Key Takeaways

  • Nth highest: exactly N-1 distinct salaries greater than current → correlated subquery with COUNT
  • DENSE_RANK() OVER (ORDER BY salary DESC) and filter rn = N is cleaner when allowed
  • Use MAX(salary) or DISTINCT to handle multiple employees with same Nth salary
Q13 Medium Concepts

What are Window Functions? Explain ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG

Window functions compute a value for each row based on a set of related rows (the window) without collapsing rows like GROUP BY. They use OVER (PARTITION BY ... ORDER BY ...) to define the window.

ROW_NUMBER assigns unique sequential integers (1, 2, 3...) per partition. Ties get arbitrary order. RANK assigns same rank to ties but leaves gaps: 1, 2, 2, 4. DENSE_RANK also ties but no gaps: 1, 2, 2, 3. Use ROW_NUMBER for unique ordering, RANK for "Olympic" ranking (two gold, no silver), DENSE_RANK when you want no gaps.

LEAD(column, n) returns the value n rows ahead; LAG returns n rows behind. Useful for comparing to previous/next row: LAG(salary) OVER (ORDER BY date) gives previous period's salary. Default n=1. Both accept a default for out-of-window (e.g., NULL).

SQL
-- Schema: employees(id, name, dept_id, salary)

-- ROW_NUMBER: unique rank, ties arbitrary
SELECT name, salary, ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn FROM employees;

-- RANK: ties same rank, gaps after
SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS rk FROM employees;

-- DENSE_RANK: ties same rank, no gaps
SELECT name, salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS dr FROM employees;

-- LEAD/LAG: access adjacent rows
SELECT order_date, amount,
  LAG(amount) OVER (ORDER BY order_date) AS prev_amount,
  LEAD(amount) OVER (ORDER BY order_date) AS next_amount
FROM orders;

-- YoY growth: (amount - LAG(amount)) / LAG(amount)

Key Takeaways

  • ROW_NUMBER=unique; RANK=ties with gaps; DENSE_RANK=ties no gaps
  • LEAD = next row; LAG = previous row; both take optional offset and default
  • Window functions preserve row count; use PARTITION BY for per-group windows
Q14 Medium Concepts

What is a Common Table Expression (CTE) and when should you use it?

A CTE is a named temporary result set defined with WITH ... AS (...) that exists only for the duration of the query. It improves readability by breaking complex queries into named steps. You can reference the CTE in the main query and in subsequent CTEs (chained CTEs).

Use CTEs when: the same subquery is used multiple times (defined once, referenced twice); the logic is complex and benefits from named steps; you need recursive queries. Recursive CTEs have a base case and a recursive part that references itself — essential for hierarchical data (org charts, tree traversal).

CTEs are not always optimized as separate materialized steps; the optimizer may inline them. In PostgreSQL, a CTE can be a "optimization fence" (materialized) with MATERIALIZED or NOT MATERIALIZED hints. Prefer CTEs over nested subqueries for readability; use them for recursive patterns.

SQL
-- Non-recursive CTE: reusable, readable
WITH recent_orders AS (
  SELECT * FROM orders WHERE order_date >= CURRENT_DATE - 30
),
top_customers AS (
  SELECT customer_id, SUM(amount) AS total
  FROM recent_orders
  GROUP BY customer_id
  ORDER BY total DESC LIMIT 10
)
SELECT c.*, tc.total FROM customers c
JOIN top_customers tc ON c.id = tc.customer_id;

-- Recursive CTE: org chart / tree
WITH RECURSIVE report_chain AS (
  SELECT id, name, manager_id, 1 AS level
  FROM employees WHERE manager_id IS NULL
  UNION ALL
  SELECT e.id, e.name, e.manager_id, rc.level + 1
  FROM employees e
  JOIN report_chain rc ON e.manager_id = rc.id
)
SELECT * FROM report_chain;

Key Takeaways

  • CTE = WITH name AS (query); improves readability and avoids repeating subqueries
  • Recursive CTE: base case UNION ALL recursive part referencing itself
  • Use for hierarchies, multi-step logic, and when same subquery appears multiple times
Q15 Hard Optimization

Explain query execution plans and optimization techniques

An execution plan shows how the database will run a query: which indexes it uses, join algorithms, and cost estimates. Use EXPLAIN (ANALYZE) to see the plan. ANALYZE actually runs the query and reports real timings. Look for: Seq Scan (full table scan — often bad on large tables), Index Scan/Index Only Scan (good), Nested Loop vs Hash Join vs Merge Join, and high cost or actual time.

Optimization techniques: add indexes on columns in WHERE, JOIN, and ORDER BY; avoid functions on indexed columns (WHERE LOWER(name) = 'x' cannot use index on name); use covering indexes to avoid heap lookups; rewrite OR as UNION if it helps index usage; avoid SELECT * when few columns needed; use EXISTS instead of IN for subqueries when appropriate; batch inserts; analyze statistics with ANALYZE. Watch for implicit type casts and unnecessary sorts.

SQL
-- PostgreSQL: get execution plan
EXPLAIN ANALYZE
SELECT e.name, d.name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 100000;

-- Look for: Seq Scan (bad), Index Scan (good), Hash Join vs Nested Loop
-- High "rows" or "cost" = potential problem

-- Optimization: index for filter
CREATE INDEX idx_emp_salary ON employees(salary);

-- Avoid: function on indexed column (can't use index)
-- BAD:  WHERE DATE(created_at) = '2024-01-01'
-- GOOD: WHERE created_at >= '2024-01-01' AND created_at < '2024-01-02'

-- EXISTS vs IN for subqueries (often EXISTS is faster)
SELECT * FROM orders o
WHERE EXISTS (SELECT 1 FROM customers c WHERE c.id = o.customer_id AND c.active);

Key Takeaways

  • EXPLAIN ANALYZE shows plan and actual execution time; look for Seq Scan, high cost
  • Index columns in WHERE/JOIN/ORDER BY; avoid functions on indexed columns
  • EXISTS vs IN, covering indexes, and query rewriting can significantly improve performance