The Data Engineer's Field Guide
Chapter Four

The Backfill That Corrupted
Three Years of Data

IdempotencyIncremental loads Event-timeLate dataSkew
Central question

How do I know my pipeline can recover? Pipelines fail constantly — feeds arrive late, jobs are retried, history is backfilled. A recoverable pipeline is one where re-running it is safe by design: the same slice loaded twice leaves the same state, and a correction folds in cleanly instead of corrupting what was already there.

The incident

A logic bug was found in a revenue pipeline, so an engineer ran a backfill to correct three years of history. The backfill used the same load step the daily job did — and that step appended its results. Re-running it over three years of slices did not replace those years; it added a second copy of all of them.

The correction that doubled everything

Three years of revenue, doubled, in one run. The job succeeded. No error, no alert — just every historical total now twice what it should be, discovered when a figure failed to reconcile. The fix to a small bug created a far larger one, because the load path was not safe to re-run.

The root cause is one property: the load was not idempotent. Recovery — retries, backfills, reprocessing late data — is only safe when running a load twice leaves the same state as running it once. Everything in this chapter builds on that one idea.

Lab · Run the backfill yourself

Three years of correct revenue are loaded. Now run the correcting backfill. Choose the load strategy first, then press the button — and watch whether history gets fixed or doubled.

Simulator

Append vs. atomic partition replace

Load strategy:

The discipline

Establish state; do not accumulate it. Use one load path for daily runs and backfills — a backfill should be the same code run over more slices, not a special script. Wrap delete-and-insert (or partition overwrite) in a transaction so it is atomic: a half-finished re-run never leaves a partition in a torn state.

Lab · The fact that arrived two days late

An event happened Monday but the user's device was offline; it arrived Wednesday. Toggle how the pipeline buckets it — by when it arrived, or by when it occurred — and watch Monday's total.

Timeline

Event-time vs. processing-time

The interlocking design

Partition by event-time so a late fact lands in the day it belongs to. But that means a past partition can still change — so you need a watermark / reprocessing window: re-run the last N days every cycle, where N covers your expected lateness. And re-running past partitions is only safe if the load is idempotent. Event-time correctness, late data, and idempotency are one design — none optional.

Lab · Why one worker holds the whole job

A distributed job splits work by a partition key across workers. If the key's values are uneven — one customer is 60% of the rows, or NULLs all hash to one place — one worker gets the lion's share while the rest finish and idle. That is skew.

Visualizer

Partition skew

The judgment

Check the distribution of the partition key for a hotspot before you trust the parallelism — including the NULL bucket, which silently collapses to one worker. Mitigations (salting a hot key, isolating NULLs, choosing a higher-cardinality key) all trade simplicity for balance; pick deliberately.

Before trusting a pipeline to recover

Tap to check each off.

    The answer to the chapter's question

    Your pipeline can recover when every load is idempotent and atomic, daily runs and backfills share one load path, data is partitioned by event-time with a reprocessing window wide enough for realistic lateness, duplicates are removed on a stable key and updates applied latest-wins, and you have checked the partition key for skew. The backfill corrupted three years because one property — idempotency — was missing. With it, re-running is just recovery.