Back to blog

SQL Fundamentals: SELECT, JOIN, and Subqueries

sqldatabasepostgresqlbackend
SQL Fundamentals: SELECT, JOIN, and Subqueries

Introduction

SQL is 50 years old and still the most important language in the backend world. Every major database — PostgreSQL, MySQL, SQLite, BigQuery, Snowflake, Redshift — speaks SQL. ORMs generate it behind the scenes. Knowing SQL means you can inspect what your ORM is actually doing, optimize slow queries, and write logic that would be painful to express in application code.

This post covers SQL fundamentals: how to retrieve exactly the data you need, how to combine data from multiple tables, and how to aggregate it into useful summaries. All examples run on PostgreSQL, but the syntax works in MySQL and SQLite too (with minor variations noted where they differ).

Prerequisites: This post is part of the SQL & NoSQL Database Roadmap. Familiarity with what tables and rows are is enough to start.

What You'll Learn

✅ Write SELECT queries from scratch
✅ Filter, sort, and paginate results
✅ Aggregate data with GROUP BY and HAVING
✅ Understand every JOIN type with diagrams
✅ Use subqueries in WHERE and FROM clauses
✅ Know the actual order SQL executes clauses
✅ Avoid common mistakes that silently return wrong results


The Sample Schema

All examples use a simple e-commerce schema. Set it up once and follow along:

CREATE TABLE customers (
    id         SERIAL PRIMARY KEY,
    name       VARCHAR(100) NOT NULL,
    email      VARCHAR(255) NOT NULL UNIQUE,
    country    VARCHAR(50)  NOT NULL,
    created_at TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);
 
CREATE TABLE products (
    id       SERIAL PRIMARY KEY,
    name     VARCHAR(100)   NOT NULL,
    category VARCHAR(50)    NOT NULL,
    price    NUMERIC(10, 2) NOT NULL
);
 
CREATE TABLE orders (
    id          SERIAL PRIMARY KEY,
    customer_id INT            NOT NULL REFERENCES customers(id),
    total       NUMERIC(10, 2) NOT NULL,
    status      VARCHAR(20)    NOT NULL DEFAULT 'pending',
    created_at  TIMESTAMPTZ    NOT NULL DEFAULT NOW()
);
 
CREATE TABLE order_items (
    id         SERIAL PRIMARY KEY,
    order_id   INT            NOT NULL REFERENCES orders(id),
    product_id INT            NOT NULL REFERENCES products(id),
    quantity   INT            NOT NULL,
    unit_price NUMERIC(10, 2) NOT NULL
);

Insert some sample data:

INSERT INTO customers (name, email, country) VALUES
    ('Alice Johnson', 'alice@example.com', 'US'),
    ('Bob Smith',     'bob@example.com',   'UK'),
    ('Carol White',   'carol@example.com', 'US'),
    ('David Lee',     'david@example.com', 'CA');
 
INSERT INTO products (name, category, price) VALUES
    ('Laptop Pro',    'Electronics', 1299.00),
    ('Wireless Mouse','Electronics',   49.99),
    ('Desk Chair',    'Furniture',    399.00),
    ('Standing Desk', 'Furniture',    699.00),
    ('USB Hub',       'Electronics',   29.99);
 
INSERT INTO orders (customer_id, total, status) VALUES
    (1, 1348.99, 'completed'),
    (1,   399.00, 'completed'),
    (2,   699.00, 'pending'),
    (3,   129.97, 'completed'),
    (4,  1299.00, 'cancelled');
 
INSERT INTO order_items (order_id, product_id, quantity, unit_price) VALUES
    (1, 1, 1, 1299.00),
    (1, 2, 1,   49.99),
    (2, 3, 1,  399.00),
    (3, 4, 1,  699.00),
    (4, 2, 3,   49.99),
    (5, 1, 1, 1299.00);

SELECT: Retrieving Data

The SELECT statement is the foundation of SQL. Its minimal form:

SELECT column1, column2
FROM table_name;

Selecting Specific Columns

Always select only the columns you need. SELECT * is convenient for exploration but wastes bandwidth, makes queries brittle when schema changes, and prevents index-only scans.

-- Bad: returns everything, even columns you don't use
SELECT * FROM customers;
 
-- Good: only what you need
SELECT id, name, email FROM customers;

Computed Columns and Aliases

You can compute new values directly in SELECT and give columns readable names with AS:

SELECT
    name,
    price,
    price * 1.1 AS price_with_tax,
    UPPER(category) AS category
FROM products;

DISTINCT: Removing Duplicates

DISTINCT removes duplicate rows from the result. It operates on the full row, not a single column:

-- All unique countries in the customers table
SELECT DISTINCT country FROM customers;
 
-- Unique (category, country) combinations -- not just unique categories
SELECT DISTINCT category FROM products;

WHERE: Filtering Rows

WHERE filters which rows are included. It's evaluated before SELECT, so you can filter on columns you don't include in the output.

Comparison Operators

SELECT name, price FROM products WHERE price > 100;
SELECT name, price FROM products WHERE price = 49.99;
SELECT name FROM products WHERE category != 'Furniture';

Logical Operators: AND, OR, NOT

-- Electronics under $100
SELECT name, price
FROM products
WHERE category = 'Electronics' AND price < 100;
 
-- Electronics or anything over $500
SELECT name, price
FROM products
WHERE category = 'Electronics' OR price > 500;
 
-- Everything except Furniture
SELECT name FROM products
WHERE NOT category = 'Furniture';

Operator precedence: AND binds tighter than OR. Use parentheses to be explicit:

-- Without parens: category='Electronics' AND (price < 50 OR price > 1000)
-- This is probably NOT what you intended:
SELECT name FROM products
WHERE category = 'Electronics' AND price < 50
   OR price > 1000;
 
-- With parens — clearly either electronics or expensive:
SELECT name FROM products
WHERE (category = 'Electronics' AND price < 50)
   OR price > 1000;

IN: Match Any Value in a List

SELECT name FROM products
WHERE category IN ('Electronics', 'Furniture');
 
-- Equivalent to:
SELECT name FROM products
WHERE category = 'Electronics' OR category = 'Furniture';

NOT IN excludes values — but beware: NOT IN returns no rows if the list contains NULL. This is a common bug:

-- If any status in the list is NULL, this returns nothing
SELECT * FROM orders WHERE status NOT IN ('pending', NULL); -- WRONG
 
-- Safe alternative:
SELECT * FROM orders WHERE status != 'pending';

BETWEEN: Range Filtering

SELECT name, price FROM products
WHERE price BETWEEN 50 AND 500;
-- Equivalent to: WHERE price >= 50 AND price <= 500 (inclusive on both ends)

LIKE and ILIKE: Pattern Matching

% matches any sequence of characters. _ matches exactly one character.

-- Names starting with 'A'
SELECT name FROM customers WHERE name LIKE 'A%';
 
-- Names containing 'son'
SELECT name FROM customers WHERE name LIKE '%son%';
 
-- ILIKE is case-insensitive (PostgreSQL specific)
SELECT name FROM customers WHERE name ILIKE 'alice%';

IS NULL / IS NOT NULL

NULL means "unknown" — it's not a value. You cannot compare with =:

-- WRONG: always returns no rows
SELECT * FROM orders WHERE cancelled_at = NULL;
 
-- CORRECT:
SELECT * FROM orders WHERE cancelled_at IS NULL;
SELECT * FROM orders WHERE cancelled_at IS NOT NULL;

ORDER BY: Sorting Results

ORDER BY sorts the result. Default is ascending (ASC). Use DESC to reverse.

-- Cheapest to most expensive
SELECT name, price FROM products ORDER BY price ASC;
 
-- Most expensive first (DESC is explicit but could be omitted if default)
SELECT name, price FROM products ORDER BY price DESC;
 
-- Sort by category, then by price within each category
SELECT name, category, price
FROM products
ORDER BY category ASC, price DESC;

NULL ordering: By default PostgreSQL sorts NULL last in ASC, first in DESC. Control this explicitly:

ORDER BY created_at DESC NULLS LAST;

LIMIT and OFFSET: Pagination

-- First 3 products
SELECT name FROM products ORDER BY id LIMIT 3;
 
-- Products 4-6 (skip first 3)
SELECT name FROM products ORDER BY id LIMIT 3 OFFSET 3;

Warning: OFFSET pagination is slow on large tables because the database must scan and discard the skipped rows. For production pagination, prefer cursor-based pagination using the last seen ID:

-- Page 1
SELECT id, name FROM products ORDER BY id LIMIT 10;
 
-- Page 2 (cursor-based — fast regardless of depth)
SELECT id, name FROM products
WHERE id > 10  -- last id from previous page
ORDER BY id LIMIT 10;

Aggregation: GROUP BY and HAVING

Aggregation functions collapse many rows into a single summary value.

Aggregate Functions

SELECT COUNT(*)           FROM orders;              -- total orders
SELECT COUNT(DISTINCT customer_id) FROM orders;     -- unique customers who ordered
SELECT SUM(total)         FROM orders;              -- total revenue
SELECT AVG(price)         FROM products;            -- average price
SELECT MIN(price)         FROM products;            -- cheapest product
SELECT MAX(price)         FROM products;            -- most expensive product

GROUP BY: Aggregate Per Group

GROUP BY splits rows into groups and applies aggregate functions to each group independently.

-- Number of products per category
SELECT category, COUNT(*) AS product_count
FROM products
GROUP BY category;

Result:

categoryproduct_count
Electronics3
Furniture2

The GROUP BY rule: Every column in SELECT must either be in GROUP BY or wrapped in an aggregate function. This query is invalid:

-- INVALID: 'name' is neither aggregated nor in GROUP BY
SELECT category, name, COUNT(*) FROM products GROUP BY category;

More useful grouping:

-- Total revenue per customer
SELECT customer_id, SUM(total) AS total_spent
FROM orders
GROUP BY customer_id
ORDER BY total_spent DESC;
 
-- Order count per status
SELECT status, COUNT(*) AS count
FROM orders
GROUP BY status;

HAVING: Filtering Groups

WHERE filters rows before grouping. HAVING filters groups after aggregation.

-- Only categories with more than 2 products
SELECT category, COUNT(*) AS product_count
FROM products
GROUP BY category
HAVING COUNT(*) > 2;
 
-- Customers who spent more than $500 total
SELECT customer_id, SUM(total) AS total_spent
FROM orders
GROUP BY customer_id
HAVING SUM(total) > 500;

You cannot use a column alias in HAVING — it hasn't been computed yet at that stage:

-- INVALID in most databases:
SELECT customer_id, SUM(total) AS total_spent
FROM orders
GROUP BY customer_id
HAVING total_spent > 500;  -- 'total_spent' doesn't exist here
 
-- VALID:
HAVING SUM(total) > 500;

SQL Execution Order

This is one of the most important things to understand about SQL. The order you write clauses is not the order they execute:

Execution order:
  1. FROM       — identify the source tables
  2. JOIN       — combine with other tables
  3. WHERE      — filter individual rows
  4. GROUP BY   — group rows
  5. HAVING     — filter groups
  6. SELECT     — compute output columns
  7. DISTINCT   — remove duplicate rows
  8. ORDER BY   — sort the result
  9. LIMIT      — trim the result

This explains several confusing behaviors:

  • You cannot use a SELECT alias in WHERE — WHERE runs before SELECT
  • You cannot use a SELECT alias in HAVING — same reason
  • You can use a SELECT alias in ORDER BY — ORDER BY runs after SELECT
  • WHERE filters before grouping, HAVING filters after
-- This is why this fails:
SELECT total * 1.1 AS total_with_tax
FROM orders
WHERE total_with_tax > 500;  -- ERROR: column doesn't exist yet
 
-- Fix: repeat the expression
WHERE total * 1.1 > 500;

JOINs: Combining Tables

JOINs combine rows from two or more tables based on a related column. The most common join condition is a foreign key relationship.

INNER JOIN

Returns only rows that have a match in both tables.

SELECT
    o.id AS order_id,
    c.name AS customer_name,
    o.total,
    o.status
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id;

Only orders with a matching customer are returned. In a well-designed schema with a foreign key constraint, this is every row. But if customer_id could be NULL or reference a deleted customer, INNER JOIN would silently drop those orders.

Table aliases (o for orders, c for customers) are essential when joining — they prevent ambiguous column references and make queries readable.

LEFT JOIN (LEFT OUTER JOIN)

Returns all rows from the left table plus matching rows from the right. Rows with no match get NULL for all right-table columns.

-- All customers, including those with no orders
SELECT
    c.name,
    c.email,
    COUNT(o.id) AS order_count,
    COALESCE(SUM(o.total), 0) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = c.id  -- wrong on purpose for illustration
LEFT JOIN orders o ON o.customer_id = c.id  -- correct
GROUP BY c.id, c.name, c.email;
-- Correct version:
SELECT
    c.name,
    c.email,
    COUNT(o.id) AS order_count,
    COALESCE(SUM(o.total), 0) AS total_spent
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
GROUP BY c.id, c.name, c.email
ORDER BY total_spent DESC;

COALESCE(value, fallback) replaces NULL with a fallback. Here it turns NULL (customer with no orders) into 0.

Finding rows with no match — classic use of LEFT JOIN:

-- Customers who have never placed an order
SELECT c.name, c.email
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
WHERE o.id IS NULL;

RIGHT JOIN (RIGHT OUTER JOIN)

Mirror of LEFT JOIN — all rows from the right table, plus matches from the left. Right joins are rare in practice because you can always rewrite them as a LEFT JOIN by swapping table order.

-- All orders, including those with no matching customer (unusual with FK constraints)
SELECT c.name, o.id, o.total
FROM customers c
RIGHT JOIN orders o ON o.customer_id = c.id;
 
-- Equivalent LEFT JOIN (more readable):
SELECT c.name, o.id, o.total
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id;

FULL OUTER JOIN

Returns all rows from both tables. Rows with no match in the other table get NULL for the missing side.

-- All customers AND all orders, matched where possible
SELECT c.name, o.id AS order_id, o.total
FROM customers c
FULL OUTER JOIN orders o ON o.customer_id = c.id;

Use cases: reconciling two datasets, finding rows missing from either side.

CROSS JOIN

Returns the Cartesian product — every row of the left table combined with every row of the right table. No ON clause.

-- Every customer paired with every product (rarely useful, but valid)
SELECT c.name, p.name AS product
FROM customers c
CROSS JOIN products p;
-- 4 customers × 5 products = 20 rows

Practical use: generating all combinations for a scheduling grid, pairing sizes with colors, etc.

JOIN Summary

Joining More Than Two Tables

Chain JOINs to combine three or more tables:

-- Order details: customer name, product name, quantity, unit price
SELECT
    c.name     AS customer,
    p.name     AS product,
    oi.quantity,
    oi.unit_price,
    oi.quantity * oi.unit_price AS line_total
FROM order_items oi
INNER JOIN orders   o ON oi.order_id   = o.id
INNER JOIN customers c ON o.customer_id = c.id
INNER JOIN products  p ON oi.product_id = p.id
ORDER BY c.name, o.id;

Subqueries

A subquery is a SELECT query nested inside another query. They appear in three positions: WHERE, FROM, and SELECT.

Subquery in WHERE

Filter based on the result of another query:

-- Customers who have placed at least one order
SELECT name, email
FROM customers
WHERE id IN (
    SELECT DISTINCT customer_id FROM orders
);
 
-- Products more expensive than the average
SELECT name, price
FROM products
WHERE price > (
    SELECT AVG(price) FROM products
);

The inner query runs first and its result is used by the outer query.

Correlated Subquery

A correlated subquery references columns from the outer query. It runs once per row of the outer query — which makes it powerful but potentially slow:

-- Each customer's most recent order date
SELECT
    c.name,
    (
        SELECT MAX(o.created_at)
        FROM orders o
        WHERE o.customer_id = c.id  -- references outer query's 'c'
    ) AS last_order_date
FROM customers c;

For large tables, this pattern is usually better expressed as a JOIN or window function (covered in DB-3). But for small datasets it's perfectly fine.

EXISTS and NOT EXISTS

EXISTS tests whether a subquery returns any rows at all. It stops as soon as it finds one match, making it more efficient than IN for large datasets:

-- Customers who have placed at least one order (EXISTS version)
SELECT name
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.id
);
 
-- Customers with no orders (NOT EXISTS)
SELECT name
FROM customers c
WHERE NOT EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.id
);

SELECT 1 inside EXISTS is conventional — the actual columns don't matter, only whether any row matches.

Subquery in FROM (Derived Table)

A subquery in FROM acts as a temporary table. You must give it an alias:

-- Average of per-customer totals
SELECT AVG(customer_total) AS avg_customer_spend
FROM (
    SELECT customer_id, SUM(total) AS customer_total
    FROM orders
    GROUP BY customer_id
) AS customer_totals;

Derived tables are useful for multi-step aggregations. They're being replaced by CTEs in modern SQL (covered in DB-3), but you'll encounter them frequently in legacy code.


Practical Query Patterns

Real queries combine multiple clauses. Here are patterns you'll write repeatedly.

Top N per Group

Customers who spent the most:

SELECT
    c.name,
    SUM(o.total) AS total_spent
FROM customers c
INNER JOIN orders o ON o.customer_id = c.id
WHERE o.status = 'completed'
GROUP BY c.id, c.name
ORDER BY total_spent DESC
LIMIT 5;

Reporting: Revenue by Month

SELECT
    DATE_TRUNC('month', created_at) AS month,
    COUNT(*)                         AS order_count,
    SUM(total)                       AS revenue
FROM orders
WHERE status = 'completed'
GROUP BY DATE_TRUNC('month', created_at)
ORDER BY month;

Finding Orphaned Records

-- Order items referencing a non-existent product (data integrity check)
SELECT oi.id, oi.product_id
FROM order_items oi
LEFT JOIN products p ON oi.product_id = p.id
WHERE p.id IS NULL;

Multi-Condition Aggregation with FILTER

PostgreSQL supports FILTER on aggregate functions — a cleaner alternative to CASE WHEN:

SELECT
    COUNT(*)                               AS total_orders,
    COUNT(*) FILTER (WHERE status = 'completed')  AS completed,
    COUNT(*) FILTER (WHERE status = 'pending')    AS pending,
    COUNT(*) FILTER (WHERE status = 'cancelled')  AS cancelled
FROM orders;

Without FILTER, the same result requires CASE WHEN:

SELECT
    COUNT(*) AS total_orders,
    SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) AS completed,
    SUM(CASE WHEN status = 'pending'   THEN 1 ELSE 0 END) AS pending,
    SUM(CASE WHEN status = 'cancelled' THEN 1 ELSE 0 END) AS cancelled
FROM orders;

Common Mistakes

1. Filtering on a JOIN condition vs WHERE

-- These are NOT equivalent:
 
-- Version A: LEFT JOIN with WHERE filter
-- Effectively becomes an INNER JOIN — nulls are filtered out
SELECT c.name, o.total
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
WHERE o.status = 'completed';   -- drops customers with no orders
 
-- Version B: Filter in the JOIN condition (keeps all customers)
SELECT c.name, o.total
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id AND o.status = 'completed';

Put filters on the LEFT table in WHERE. Put filters on the RIGHT table in the JOIN condition if you want to keep left-table rows.

2. COUNT(*) vs COUNT(column)

COUNT(*)         -- counts all rows, including NULLs
COUNT(column)    -- counts rows where column is NOT NULL
COUNT(DISTINCT column)  -- counts unique non-NULL values

3. Implicit GROUP BY with aggregate and non-aggregate columns

-- This fails because 'name' is not in GROUP BY:
SELECT name, category, COUNT(*) FROM products GROUP BY category;
 
-- Fix: include all non-aggregated columns in GROUP BY:
SELECT name, category, COUNT(*) FROM products GROUP BY name, category;
-- (but this defeats the purpose — one row per name+category combination)

4. LIKE with a leading wildcard disables index use

-- Can use an index on 'name':
WHERE name LIKE 'Alice%'
 
-- Cannot use an index on 'name' — full table scan:
WHERE name LIKE '%Alice%'

For arbitrary substring search, use PostgreSQL's full-text search or pg_trgm extension instead.


Exercises

Set up the sample schema and solve these:

  1. List all products in the 'Electronics' category ordered by price ascending.
  2. Find the total number of orders per customer, including customers with zero orders.
  3. Find the average order total for completed orders only.
  4. List customers who have spent more than $500 total across completed orders.
  5. Find products that appear in at least one order (use both IN and EXISTS — compare them).
  6. For each order, show the customer name, order total, and the number of distinct products in that order.
  7. Find the month with the highest total revenue.

Summary and Key Takeaways

Learning Outcomes:
✅ SELECT retrieves columns; WHERE filters rows; ORDER BY sorts; LIMIT paginates
✅ GROUP BY aggregates rows into groups; HAVING filters those groups
✅ SQL executes in logical order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT
✅ INNER JOIN returns matching rows only; LEFT JOIN keeps all left rows
✅ Subqueries can appear in WHERE, FROM, or SELECT clauses
✅ EXISTS is more efficient than IN for large subquery results
✅ NULL behavior is a source of many bugs — always use IS NULL, never = NULL


What's Next

DB-3: Advanced SQL — Window Functions & CTEs →

Window functions and CTEs let you write queries that were previously only possible with complex subqueries or application-level code: running totals, rankings per group, recursive hierarchies, and multi-step transformations — all in pure SQL.

📬 Subscribe to Newsletter

Get the latest blog posts delivered to your inbox every week. No spam, unsubscribe anytime.

We respect your privacy. Unsubscribe at any time.

💬 Comments

Sign in to leave a comment

We'll never post without your permission.