The Data Engineer's Field Guide
Chapter Two

The Query That Worked
Until It Didn't

Three-valued logicNULL Join cardinalityFan-outWindow functions
Central question

How do I know my query is safe? A query is safe when it returns the right answer not only on the data you tested but on the data you did not — the NULL you did not expect, the duplicate the join introduced, the volume that arrived next quarter. Most SQL does not fail loudly. It returns a number, and the number is wrong.

The incident

A revenue report had run every morning for a year: total the amount of every order that had not been refunded. It was written the way the requirement reads in English — orders whose ID is not in the set of refunded IDs.

SELECT SUM(amount)
FROM orders
WHERE order_id NOT IN (SELECT order_id FROM refunds);
One Monday · revenue reported zero

Not a dip — zero. Every order vanished from the total at once, and nothing errored. The job ran green. The cause was a single row in refunds with a NULL order_id — a refund recorded before its order reference was populated. One NULL zeroed out an entire company's reported revenue.

NOT IN does not mean what its English implies when the value set contains a NULL. SQL does not use two truth values — it uses three: TRUE, FALSE, and UNKNOWN. A comparison against NULL returns UNKNOWN, and a WHERE clause keeps a row only when its condition is TRUE. Run the incident yourself below.

Lab · Reproduce the zero-revenue bug

Here is the exact dataset: three real orders, and a refunds table containing one genuine refund plus that one poisoned NULL row. Run each query and watch what the engine actually does.

orders

order_idamount
101$300
102$150
103$250

refunds

refund_idorder_id
1102
2NULL
SQL playground

Run it against the poisoned data

Expected (correct) answer: $550 — orders 101 and 103
The fix, and why it is safe

EXISTS asks "does any row match?" — strictly TRUE or FALSE, never UNKNOWN. The NULL refund matches no order, so it excludes no order; a real refund still matches and excludes correctly. The query does what the English always meant. For "exclude rows that have a match," prefer NOT EXISTS — for correctness, not speed.

Lab · Why the row gets dropped

Trace the logic. order_id NOT IN (102, NULL) expands to order_id <> 102 AND order_id <> NULL. That last comparison is UNKNOWN for every row — and TRUE AND UNKNOWN is UNKNOWN. Explore how UNKNOWN propagates.

Logic explorer

Three-valued logic

The rule that prevents the most bugs: a WHERE clause keeps a row only when its condition is TRUE. UNKNOWN is discarded as silently as FALSE — which is exactly how one NULL dropped every row.

Lab · The join that multiplied the money

NULL produces wrong answers by dropping rows. Joins produce them by adding rows. Join an orders table (one row per order) to its order-lines (many rows per order), then SUM the order amount — and watch the total inflate.

Visualizer

Fan-out / duplicate amplification

Order 101 has 3 lines; order 102 has 1. Toggle the join and watch o.amount get duplicated once per matching line.

orders (the "one")

101 · $300
102 · $150

order_lines (the "many")

101 · keyboard
101 · mouse
101 · cable
102 · monitor

Rows after the join

The principle under it — same as Chapter 1

A measure has a grain, and you can only safely sum it at its own grain. The order total lives at the order grain; drag it across a join to the line grain and it multiplies by the number of lines. Know the cardinality of every join before you aggregate across it.

Before trusting a query in production

Run this on any query whose number someone will make a decision from. Tap to check.

    The answer to the chapter's question

    You know your query is safe because you stopped trusting the English reading of the SQL and started reasoning in the engine's terms: three truth values, not two; joins as claims about cardinality you can be wrong about; measures summed only at their own grain; and a query plan you have actually read. The query that worked until it didn't worked by luck. A safe query works by construction.