Back to blog

Write-Ahead Log, Segmented Log & Low-Water Mark

distributed-systemssystem-designsoftware-architecturebackenddesign-patterns
Write-Ahead Log, Segmented Log & Low-Water Mark

Every distributed system — Kafka, PostgreSQL, etcd, CockroachDB — stores data. And every one of them faces the same terrifying question:

"What happens when the server crashes mid-write?"

The answer starts with three patterns that form the durability foundation of virtually every data system you've ever used:

  • Write-Ahead Log (WAL) — Persist every change before applying it, so you can recover after a crash
  • Segmented Log — Split the ever-growing log into manageable pieces
  • Low-Water Mark — Know which log entries can be safely deleted

These three patterns are so fundamental that you cannot build a reliable distributed system without them. They solve the most basic problem: making sure data survives failures.

In this post, we'll cover:

✅ Write-Ahead Log — why "write first, apply later" prevents data corruption
✅ Segmented Log — how Kafka manages billions of messages with log segments
✅ Low-Water Mark — when and how to safely discard old log entries
✅ How all three patterns compose into a complete durability story
✅ Real-world implementations in PostgreSQL, Kafka, and etcd
✅ Code examples showing each pattern in action

Series: Patterns of Distributed Systems Roadmap
Next: Post #3 — Leader and Followers, HeartBeat & Generation Clock


The Durability Problem

Imagine you're writing a key-value store. A client sends SET user:123 "Alice". Your server:

  1. Receives the request
  2. Updates the in-memory hash map
  3. Returns "OK" to the client

What could go wrong? Everything.

If the server crashes after step 2 but before persisting to disk, the data is gone forever — even though the client received "OK". This is not a theoretical concern. It happens in production. Every day.

The naive fix — "just write to disk on every operation" — creates a different problem:

If you crash mid-write to the data file, you get a corrupted file — potentially worse than losing one record. You might lose the entire database.

We need a pattern that guarantees atomicity (all or nothing) and durability (survives crashes).


Pattern 1: Write-Ahead Log (WAL)

The Core Idea

The Write-Ahead Log pattern solves the durability problem with one rule:

Before applying any change to state, write the change to an append-only log on disk first.

The key insight: appending to a file is nearly atomic. Writing one log record to the end of a file either succeeds completely or fails without corrupting existing data. Updating a complex data structure (B-tree, hash table) in place is not atomic — a crash mid-update leaves the structure in an inconsistent state.

If the server crashes before the WAL write, the client doesn't get "OK" — so there's no data loss from the client's perspective. If the server crashes after the WAL write but before applying to state, we can replay the log on restart and recover the exact state.

Anatomy of a WAL Entry

Every WAL entry needs enough information to replay the operation independently:

interface WALEntry {
  // Unique, monotonically increasing sequence number
  logSequenceNumber: number;  // LSN — identifies position in the log
 
  // What happened
  operation: 'SET' | 'DELETE' | 'UPDATE';
  key: string;
  value?: string;
  previousValue?: string;  // For undo during rollback
 
  // Integrity
  timestamp: number;
  checksum: number;  // CRC32 to detect corruption
}

The log sequence number (LSN) is critical. It provides a total ordering of all operations. During recovery, you replay entries in LSN order and get the exact same state — regardless of when or why the server crashed.

Implementation: A Simple Write-Ahead Log

Here's a minimal WAL implementation that captures the essential pattern:

import {
  readFileSync, existsSync, appendFileSync,
  fsyncSync, openSync, closeSync
} from 'fs';
 
interface WALEntry {
  lsn: number;
  operation: 'SET' | 'DELETE';
  key: string;
  value?: string;
  timestamp: number;
}
 
class WriteAheadLog {
  private currentLSN = 0;
  private logPath: string;
  private state: Map<string, string> = new Map();
 
  constructor(logPath: string) {
    this.logPath = logPath;
    this.recover(); // Replay log on startup
  }
 
  /**
   * Core WAL pattern: write to log FIRST, then apply to state
   */
  set(key: string, value: string): void {
    // Step 1: Create log entry
    const entry: WALEntry = {
      lsn: ++this.currentLSN,
      operation: 'SET',
      key,
      value,
      timestamp: Date.now(),
    };
 
    // Step 2: Persist to log (with fsync for durability)
    this.appendToLog(entry);
 
    // Step 3: Apply to in-memory state
    this.state.set(key, value);
  }
 
  delete(key: string): void {
    const entry: WALEntry = {
      lsn: ++this.currentLSN,
      operation: 'DELETE',
      key,
      timestamp: Date.now(),
    };
 
    this.appendToLog(entry);
    this.state.delete(key);
  }
 
  get(key: string): string | undefined {
    return this.state.get(key); // Reads come from in-memory state
  }
 
  /**
   * Append entry and fsync — the critical durability guarantee
   */
  private appendToLog(entry: WALEntry): void {
    const line = JSON.stringify(entry) + '\n';
    const fd = openSync(this.logPath, 'a');
    appendFileSync(fd, line);
    fsyncSync(fd); // Force write to disk — not just OS buffer
    closeSync(fd);
  }
 
  /**
   * Recovery: replay the entire log to rebuild state
   */
  private recover(): void {
    if (!existsSync(this.logPath)) return;
 
    const content = readFileSync(this.logPath, 'utf-8');
    const lines = content.trim().split('\n').filter(l => l.length > 0);
 
    for (const line of lines) {
      const entry: WALEntry = JSON.parse(line);
      this.replayEntry(entry);
      this.currentLSN = Math.max(this.currentLSN, entry.lsn);
    }
 
    console.log(`Recovered ${lines.length} entries, LSN at ${this.currentLSN}`);
  }
 
  private replayEntry(entry: WALEntry): void {
    switch (entry.operation) {
      case 'SET':
        this.state.set(entry.key, entry.value!);
        break;
      case 'DELETE':
        this.state.delete(entry.key);
        break;
    }
  }
}

Notice the critical detail: fsyncSync(fd). Without fsync, the OS might buffer the write in memory — and a crash would lose it. The fsync call forces the data to physical disk, providing the actual durability guarantee.

The fsync Trade-off

fsync is expensive. On a typical SSD, each fsync takes 0.1-1ms. At 1ms per fsync, you're limited to ~1,000 writes per second. That's why real systems batch writes:

class BatchedWAL {
  private pendingEntries: WALEntry[] = [];
  private pendingCallbacks: (() => void)[] = [];
  private flushIntervalMs = 10; // Flush every 10ms
 
  constructor(private logPath: string) {
    // Periodic flush — batch multiple writes into one fsync
    setInterval(() => this.flush(), this.flushIntervalMs);
  }
 
  append(entry: WALEntry): Promise<void> {
    return new Promise(resolve => {
      this.pendingEntries.push(entry);
      this.pendingCallbacks.push(resolve);
    });
  }
 
  private flush(): void {
    if (this.pendingEntries.length === 0) return;
 
    const entries = this.pendingEntries;
    const callbacks = this.pendingCallbacks;
    this.pendingEntries = [];
    this.pendingCallbacks = [];
 
    // Write ALL pending entries, then ONE fsync
    const fd = openSync(this.logPath, 'a');
    for (const entry of entries) {
      appendFileSync(fd, JSON.stringify(entry) + '\n');
    }
    fsyncSync(fd); // One fsync for the entire batch
    closeSync(fd);
 
    // Notify all waiting clients
    callbacks.forEach(cb => cb());
  }
}

By batching 100 writes into one fsync, you go from 1,000 writes/sec to ~100,000 writes/sec. This is exactly what PostgreSQL, Kafka, and etcd do.

SystemConfig ParameterDefaultEffect
PostgreSQLwal_writer_delay200msFlush WAL every 200ms
Kafkalinger.ms0ms (immediate)Wait before sending batch
etcd--wal-sync-durationPer-writeSync WAL on each write

WAL in PostgreSQL

PostgreSQL's WAL is one of the most mature implementations. Every change — INSERT, UPDATE, DELETE, even index modifications — goes through the WAL:

pg_wal/
├── 000000010000000000000001   # 16MB WAL segment
├── 000000010000000000000002   # Next segment
├── 000000010000000000000003
└── ...

Each WAL record contains:

  • LSN (Log Sequence Number) — position in the WAL stream
  • Transaction ID — which transaction made the change
  • Resource Manager ID — which component (heap, B-tree index, etc.)
  • Record data — the actual change (before/after images of the row)

PostgreSQL uses WAL for three purposes:

  1. Crash recovery — replay WAL from the last checkpoint
  2. Streaming replication — ship WAL records to replicas
  3. Point-in-time recovery (PITR) — replay WAL to any point in time
-- See current WAL position
SELECT pg_current_wal_lsn();
-- Result: 0/1A3B4C0
 
-- See WAL statistics
SELECT * FROM pg_stat_wal;

WAL in Kafka

Kafka takes the WAL concept further — the log is the database. Instead of using a WAL to protect some other data structure, Kafka's primary data store is the append-only log itself:

partition-0/
├── 00000000000000000000.log    # First segment
├── 00000000000000000000.index  # Offset index
├── 00000000000065536000.log    # Next segment (starts at offset 65536)
├── 00000000000065536000.index
└── ...

Every message published to Kafka is appended to a partition's log. Consumers read from the log at their own pace. The log is never modified in place — only appended to, and eventually cleaned up.

This is the fundamental insight: the log is the source of truth, not a side channel for durability.

WAL in MySQL InnoDB

MySQL InnoDB's redo log works slightly differently — it uses a fixed-size circular buffer rather than an ever-growing log:

ib_logfile0   # 48MB (default)
ib_logfile1   # 48MB (default)

InnoDB writes redo log entries sequentially into these files, wrapping around when it reaches the end. The key constraint: InnoDB must flush modified ("dirty") pages from its buffer pool to the data files before the redo log wraps around and overwrites entries that haven't been applied yet.

This creates a natural tension between redo log size and write throughput — a larger redo log allows more writes before a flush is needed, but increases crash recovery time.

-- Check InnoDB log configuration
SHOW VARIABLES LIKE 'innodb_log_file_size';
-- Result: 50331648 (48MB)
 
SHOW VARIABLES LIKE 'innodb_log_files_in_group';
-- Result: 2

Pattern 2: Segmented Log

The Problem with a Single Log File

Our WAL works, but it has a growing problem — literally. The log file grows forever:

Day 1:    wal.log = 10 MB
Day 30:   wal.log = 300 MB
Day 365:  wal.log = 3.6 GB
Year 3:   wal.log = 10.8 GB  ← Recovery takes 30 minutes

A single, ever-growing log file creates three problems:

  1. Recovery time — replaying 10 GB of log entries on startup takes minutes
  2. File system limits — some file systems have maximum file sizes
  3. Cleanup — you can't delete old entries from the middle of a file without rewriting it

The Core Idea

The Segmented Log pattern solves this by splitting the log into multiple smaller files called segments:

When a log segment reaches a configured size or age, close it and start a new one. Only the newest segment accepts writes.

Key properties:

  • Only the active segment (the newest one) accepts writes
  • Sealed segments are immutable — they never change
  • Sealed segments can be individually deleted when no longer needed
  • Each segment has a base offset identifying the first entry it contains

Implementation: Segmented Log

import { mkdirSync, readdirSync, statSync, unlinkSync } from 'fs';
import { join } from 'path';
 
interface LogSegment {
  baseOffset: number;     // First LSN in this segment
  path: string;           // File path
  size: number;           // Current size in bytes
  sealed: boolean;        // No longer accepts writes
}
 
class SegmentedLog {
  private segments: LogSegment[] = [];
  private activeSegment: LogSegment | null = null;
  private currentLSN = 0;
  private maxSegmentSize: number;
  private logDir: string;
 
  constructor(logDir: string, maxSegmentSizeBytes = 64 * 1024 * 1024) {
    this.logDir = logDir;
    this.maxSegmentSize = maxSegmentSizeBytes;
    mkdirSync(logDir, { recursive: true });
    this.loadExistingSegments();
  }
 
  append(entry: WALEntry): void {
    // Roll to new segment if current one is full
    if (this.shouldRoll()) {
      this.rollSegment();
    }
 
    const line = JSON.stringify(entry) + '\n';
    const fd = openSync(this.activeSegment!.path, 'a');
    appendFileSync(fd, line);
    fsyncSync(fd);
    closeSync(fd);
 
    this.activeSegment!.size += Buffer.byteLength(line);
    this.currentLSN = entry.lsn;
  }
 
  /**
   * Roll: seal current segment, create a new active one
   */
  private rollSegment(): void {
    if (this.activeSegment) {
      this.activeSegment.sealed = true;
      console.log(`Sealed segment at LSN ${this.currentLSN}`);
    }
 
    const newSegment: LogSegment = {
      baseOffset: this.currentLSN + 1,
      path: join(this.logDir, this.formatName(this.currentLSN + 1)),
      size: 0,
      sealed: false,
    };
 
    this.segments.push(newSegment);
    this.activeSegment = newSegment;
  }
 
  private shouldRoll(): boolean {
    return (
      this.activeSegment === null ||
      this.activeSegment.size >= this.maxSegmentSize
    );
  }
 
  /**
   * Delete segments whose entries are all below the given LSN
   */
  deleteSegmentsBefore(lsn: number): number {
    let deleted = 0;
    this.segments = this.segments.filter(segment => {
      // Only delete sealed segments entirely below the threshold
      if (segment.sealed && this.getSegmentMaxLSN(segment) < lsn) {
        unlinkSync(segment.path);
        deleted++;
        return false;
      }
      return true;
    });
    return deleted;
  }
 
  /**
   * Read all entries across all segments in order
   */
  *readAll(): Generator<WALEntry> {
    for (const segment of this.segments) {
      const content = readFileSync(segment.path, 'utf-8');
      const lines = content.trim().split('\n').filter(l => l.length > 0);
      for (const line of lines) {
        yield JSON.parse(line);
      }
    }
  }
 
  // Segment names use zero-padded offsets for lexicographic sorting
  private formatName(offset: number): string {
    return offset.toString().padStart(20, '0') + '.log';
  }
 
  private getSegmentMaxLSN(segment: LogSegment): number {
    // In production, each segment would track its max LSN in a header
    // Simplified: read the last line of the segment
    const content = readFileSync(segment.path, 'utf-8').trim();
    const lines = content.split('\n');
    const last = JSON.parse(lines[lines.length - 1]);
    return last.lsn;
  }
 
  private loadExistingSegments(): void {
    try {
      const files = readdirSync(this.logDir)
        .filter(f => f.endsWith('.log'))
        .sort();
 
      for (const file of files) {
        const path = join(this.logDir, file);
        const baseOffset = parseInt(file.replace('.log', ''), 10);
        const size = statSync(path).size;
        this.segments.push({ baseOffset, path, size, sealed: true });
      }
 
      // The last segment becomes active
      if (this.segments.length > 0) {
        const last = this.segments[this.segments.length - 1];
        last.sealed = false;
        this.activeSegment = last;
      }
    } catch {
      // No existing segments
    }
  }
}

Segment Naming Conventions

Real systems use zero-padded offset numbers as segment names so that lexicographic sorting equals offset sorting:

# Kafka segment files (20-digit zero-padded offsets)
00000000000000000000.log    # Starts at offset 0
00000000000000065536.log    # Starts at offset 65536
00000000000000131072.log    # Starts at offset 131072
 
# PostgreSQL WAL segments (timeline + segment number)
000000010000000000000001    # Timeline 1, segment 1
000000010000000000000002    # Timeline 1, segment 2

This naming scheme enables:

  • Fast lookup: binary search across segment names to find a specific offset
  • Simple cleanup: delete files lexicographically before a cutoff point
  • Filesystem ordering: ls shows segments in chronological order

Segmented Log in Kafka

Kafka's segment management is highly tunable:

# Segment size — roll to new segment when size is reached
log.segment.bytes=1073741824  # 1GB (default)
 
# Segment age — roll to new segment after this time
log.roll.ms=604800000  # 7 days
 
# Each segment has companion index files
# .log       — the actual messages
# .index     — sparse offset-to-byte-position index
# .timeindex — timestamp-to-offset index

Each segment in Kafka consists of three files:

00000000000000065536.log        # Message data
00000000000000065536.index      # Offset → byte position (sparse)
00000000000000065536.timeindex  # Timestamp → offset (sparse)

The .index file is a sparse index — it maps every Nth offset to a byte position in the .log file. To find offset 65600:

  1. Look in .index for the largest offset <= 65600 (say, 65580 at byte 42000)
  2. Scan .log starting at byte 42000 until you find offset 65600

This gives O(1) lookup with minimal memory overhead.

Segmented Log in etcd

etcd splits its WAL into 64MB segments:

member/
└── wal/
    ├── 0000000000000000-0000000000000000.wal   # First segment
    ├── 0000000000000001-0000000000001000.wal   # Second segment
    └── ...

etcd's segment naming encodes both the sequence number and the index of the first entry — making it easy to find which segment contains a given entry.


Pattern 3: Low-Water Mark

The Growing Disk Problem

With WAL and segmented log, we have durability and manageable file sizes. But segments keep accumulating:

Month 1:  10 segments  (640 MB)
Month 6:  60 segments  (3.8 GB)
Month 12: 120 segments (7.6 GB)  ← Disk filling up

We need to delete old segments. But which ones are safe to delete? If we delete a segment that's still needed for recovery, we corrupt our system. If we keep everything, we run out of disk.

The Core Idea

The Low-Water Mark pattern tracks the oldest log entry that's still needed by any component of the system:

The low-water mark is the point in the log below which all entries can be safely discarded. Nothing in the system still needs entries before this point.

The low-water mark advances when:

  • A snapshot captures the complete state (so we don't need the log for recovery)
  • All consumers have read past a certain point (so we don't need the log for replay)
  • A time-based retention policy expires old data

Two Approaches to Low-Water Mark

Approach 1: Snapshot-Based Low-Water Mark

Take periodic snapshots of the full state. After a snapshot, the log entries before the snapshot LSN are no longer needed for recovery.

class SnapshotBasedLWM {
  private state: Map<string, string> = new Map();
  private segmentedLog: SegmentedLog;
  private lastSnapshotLSN = 0;
  private snapshotPath: string;
 
  constructor(segmentedLog: SegmentedLog, snapshotPath: string) {
    this.segmentedLog = segmentedLog;
    this.snapshotPath = snapshotPath;
  }
 
  /**
   * Take a snapshot — this advances the low-water mark
   */
  takeSnapshot(currentLSN: number): void {
    // Serialize entire state to disk
    const snapshot = {
      lsn: currentLSN,
      timestamp: Date.now(),
      data: Object.fromEntries(this.state),
    };
 
    writeFileSync(this.snapshotPath, JSON.stringify(snapshot));
    // Ensure snapshot is durable before deleting WAL segments
    fsyncSync(openSync(this.snapshotPath, 'r'));
 
    this.lastSnapshotLSN = currentLSN;
 
    // Now safely delete old segments
    const deleted = this.segmentedLog.deleteSegmentsBefore(currentLSN);
    console.log(`Snapshot at LSN ${currentLSN}, deleted ${deleted} segments`);
  }
 
  /**
   * Recovery: load snapshot + replay WAL entries after snapshot
   */
  recover(): void {
    // Step 1: Load latest snapshot
    if (existsSync(this.snapshotPath)) {
      const snapshot = JSON.parse(readFileSync(this.snapshotPath, 'utf-8'));
      this.state = new Map(Object.entries(snapshot.data));
      this.lastSnapshotLSN = snapshot.lsn;
      console.log(`Loaded snapshot at LSN ${snapshot.lsn}`);
    }
 
    // Step 2: Replay WAL entries AFTER the snapshot
    let replayed = 0;
    for (const entry of this.segmentedLog.readAll()) {
      if (entry.lsn > this.lastSnapshotLSN) {
        this.applyEntry(entry);
        replayed++;
      }
    }
    console.log(`Replayed ${replayed} entries after snapshot`);
  }
 
  private applyEntry(entry: WALEntry): void {
    if (entry.operation === 'SET') {
      this.state.set(entry.key, entry.value!);
    } else if (entry.operation === 'DELETE') {
      this.state.delete(entry.key);
    }
  }
}

This is the approach used by etcd and ZooKeeper. etcd takes periodic snapshots and then purges old WAL segments.

Approach 2: Consumer-Based Low-Water Mark

Track the position of every consumer (reader). The low-water mark is the minimum position across all consumers:

class ConsumerBasedLWM {
  private consumerOffsets: Map<string, number> = new Map();
 
  /**
   * Update a consumer's committed position
   */
  commitOffset(consumerId: string, offset: number): void {
    this.consumerOffsets.set(consumerId, offset);
  }
 
  /**
   * Low-water mark = minimum offset across all consumers
   */
  getLowWaterMark(): number {
    if (this.consumerOffsets.size === 0) return 0;
    return Math.min(...this.consumerOffsets.values());
  }
 
  /**
   * Clean up segments below the low-water mark
   */
  cleanup(segmentedLog: SegmentedLog): number {
    const lwm = this.getLowWaterMark();
    return segmentedLog.deleteSegmentsBefore(lwm);
  }
}

This is the approach used by Kafka. Each consumer group tracks its offset, and Kafka's log cleaner uses these offsets (plus retention policies) to determine what can be deleted.

Low-Water Mark in Kafka

Kafka's log cleanup combines consumer offsets with configurable retention policies:

# Time-based retention — delete segments older than 7 days
log.retention.hours=168
 
# Size-based retention — keep at most 100GB per partition
log.retention.bytes=107374182400
 
# Log compaction — keep only the latest value for each key
log.cleanup.policy=compact
 
# Or combine both: delete old data AND compact
log.cleanup.policy=compact,delete

Time-based retention is simple: delete segments whose newest message is older than the retention period. This is the low-water mark advancing with time.

Log compaction is more interesting. Instead of deleting entire segments, Kafka keeps only the most recent value for each key:

Log compaction is essential for changelog topics — topics that represent the current state of an entity. After compaction, replaying the log gives you the latest state of every key without reading the entire history.

Low-Water Mark in etcd

etcd uses the snapshot-based approach with a configurable snapshot interval:

# Take a snapshot every 10,000 applied entries
--snapshot-count=10000
 
# Maximum number of WAL files to retain
--max-wals=5

etcd's cleanup flow:

  1. After every 10,000 entries, take a snapshot of the entire state
  2. The snapshot includes the applied index (like an LSN)
  3. WAL files before the snapshot can be deleted
  4. On recovery: load snapshot, then replay WAL entries after snapshot

Low-Water Mark in PostgreSQL

PostgreSQL uses checkpoints as its low-water mark mechanism:

-- Checkpoint configuration
SHOW checkpoint_timeout;        -- default: 5min
SHOW max_wal_size;              -- default: 1GB

A checkpoint flushes all dirty pages (modified data) from shared buffers to disk. After a checkpoint completes, WAL segments before the checkpoint's LSN are no longer needed for crash recovery and can be recycled.

The max_wal_size setting controls how much WAL accumulates before PostgreSQL forces a checkpoint — effectively capping how far back recovery might need to replay.


How the Three Patterns Work Together

These three patterns form a complete durability stack. Each pattern solves a specific problem, and together they handle the full lifecycle of persistent data:

ProblemPatternSolution
"What if the server crashes mid-write?"Write-Ahead LogAppend change to durable log before applying
"The log file grows forever"Segmented LogSplit into fixed-size segments, only active one writable
"Old segments waste disk space"Low-Water MarkTrack what's safe to delete, clean up old segments

The Complete Lifecycle

Let's trace the full lifecycle with a concrete example:

Step 1: Client sends SET user:123 "Alice"
  → WAL appends entry #4001 to active segment
 
Step 2: Active segment reaches 64MB
  → Segmented Log rolls: seals segment-4, creates segment-5
 
Step 3: System takes a snapshot at LSN 4000
  → Low-Water Mark advances to 4000
 
Step 4: Segments 1-3 (LSN 1-3000) are all below the low-water mark
  → Segments 1-3 deleted from disk
 
Step 5: Server crashes and restarts
  → Load snapshot (state at LSN 4000)
  → Replay segment-4 entries 4001-4500
  → Replay segment-5 entries 4501-current
  → State fully recovered

Real-World Pattern Map

Here's how these three patterns are implemented across real systems:

SystemWALSegmented LogLow-Water Mark
PostgreSQLpg_wal/ files (16MB segments)WAL segment files, auto-recycledCheckpoint LSN — segments before checkpoint can be recycled
KafkaPartition log filesConfigurable segment size (1GB default)Consumer offsets + retention policy (time/size/compaction)
etcdWAL files in member/wal/WAL segments (64MB)Snapshot index — WAL files before snapshot are purged
MySQL InnoDBRedo log (ib_logfile0/1)Fixed-size circular log filesCheckpoint LSN — pages flushed to tablespace
CockroachDBPebble WALSST files (sorted string tables)Compaction — merges and discards old versions

Practice: Building a Durable Key-Value Store

Let's combine all three patterns into a working key-value store:

class DurableKVStore {
  private state: Map<string, string> = new Map();
  private currentLSN = 0;
  private segmentedLog: SegmentedLog;
  private snapshotPath: string;
  private lastSnapshotLSN = 0;
  private opsAfterSnapshot = 0;
  private snapshotInterval: number;
 
  constructor(
    dataDir: string,
    options: {
      maxSegmentSize?: number;
      snapshotInterval?: number;
    } = {}
  ) {
    this.segmentedLog = new SegmentedLog(
      join(dataDir, 'wal'),
      options.maxSegmentSize ?? 64 * 1024 * 1024,
    );
    this.snapshotPath = join(dataDir, 'snapshot.json');
    this.snapshotInterval = options.snapshotInterval ?? 10000;
 
    this.recover();
  }
 
  set(key: string, value: string): void {
    const entry: WALEntry = {
      lsn: ++this.currentLSN,
      operation: 'SET',
      key,
      value,
      timestamp: Date.now(),
    };
 
    // WAL: persist first
    this.segmentedLog.append(entry);
 
    // Apply to state
    this.state.set(key, value);
 
    // Maybe take a snapshot (advances low-water mark)
    this.maybeSnapshot();
  }
 
  delete(key: string): void {
    const entry: WALEntry = {
      lsn: ++this.currentLSN,
      operation: 'DELETE',
      key,
      timestamp: Date.now(),
    };
 
    this.segmentedLog.append(entry);
    this.state.delete(key);
    this.maybeSnapshot();
  }
 
  get(key: string): string | undefined {
    return this.state.get(key);
  }
 
  /**
   * Snapshot-based low-water mark:
   * After N operations, snapshot state and clean up old segments
   */
  private maybeSnapshot(): void {
    this.opsAfterSnapshot++;
    if (this.opsAfterSnapshot >= this.snapshotInterval) {
      this.takeSnapshot();
    }
  }
 
  private takeSnapshot(): void {
    const snapshot = {
      lsn: this.currentLSN,
      timestamp: Date.now(),
      entries: Object.fromEntries(this.state),
    };
 
    writeFileSync(this.snapshotPath, JSON.stringify(snapshot));
 
    // Advance low-water mark and clean up
    const deleted = this.segmentedLog.deleteSegmentsBefore(this.currentLSN);
    this.lastSnapshotLSN = this.currentLSN;
    this.opsAfterSnapshot = 0;
 
    console.log(
      `Snapshot at LSN ${this.currentLSN}, deleted ${deleted} old segments`
    );
  }
 
  /**
   * Recovery: snapshot + WAL replay
   */
  private recover(): void {
    // Step 1: Load snapshot if available
    if (existsSync(this.snapshotPath)) {
      const snapshot = JSON.parse(readFileSync(this.snapshotPath, 'utf-8'));
      this.state = new Map(Object.entries(snapshot.entries));
      this.lastSnapshotLSN = snapshot.lsn;
      this.currentLSN = snapshot.lsn;
      console.log(
        `Loaded snapshot: ${this.state.size} keys at LSN ${snapshot.lsn}`
      );
    }
 
    // Step 2: Replay WAL entries after snapshot
    let replayed = 0;
    for (const entry of this.segmentedLog.readAll()) {
      if (entry.lsn > this.lastSnapshotLSN) {
        this.applyEntry(entry);
        this.currentLSN = entry.lsn;
        replayed++;
      }
    }
 
    if (replayed > 0) {
      console.log(`Replayed ${replayed} WAL entries after snapshot`);
    }
  }
 
  private applyEntry(entry: WALEntry): void {
    if (entry.operation === 'SET') {
      this.state.set(entry.key, entry.value!);
    } else if (entry.operation === 'DELETE') {
      this.state.delete(entry.key);
    }
  }
}

Usage

// Create a durable KV store
const store = new DurableKVStore('./data', {
  maxSegmentSize: 10 * 1024 * 1024,  // 10MB segments
  snapshotInterval: 5000,             // Snapshot every 5000 ops
});
 
// These writes survive crashes
store.set('user:1', 'Alice');
store.set('user:2', 'Bob');
store.set('config:theme', 'dark');
 
// Reads come from memory — fast
console.log(store.get('user:1')); // "Alice"
 
// If the process crashes and restarts, DurableKVStore
// recovers from snapshot + WAL replay automatically

Common Pitfalls

Pitfall 1: Forgetting fsync

// WRONG — data may be in OS buffer, not on disk
appendFileSync(logPath, data);
 
// CORRECT — force data to physical disk
const fd = openSync(logPath, 'a');
appendFileSync(fd, data);
fsyncSync(fd);
closeSync(fd);

Without fsync, your WAL provides no durability guarantee. The OS can buffer writes and lose them on crash. PostgreSQL's fsync=off setting explicitly states: "may result in unrecoverable data corruption."

Pitfall 2: Not Handling Partial Writes

A crash can happen mid-write, leaving a truncated JSON line:

{"lsn":42,"operation":"SET","key":"user:1","value":"Al

Your recovery code must handle this:

private recoverEntry(line: string): WALEntry | null {
  try {
    const entry = JSON.parse(line);
    return entry;
  } catch {
    // Truncated entry — last write was interrupted
    console.warn(`Skipping corrupted entry: ${line.substring(0, 50)}...`);
    return null; // Safe: client never received "OK" for this entry
  }
}

This is safe because: if the write was interrupted, the client never received "OK", so there's no data loss from the client's perspective.

Pitfall 3: Deleting Segments That Are Still Needed

// WRONG — what if a consumer is still reading from 2 days ago?
deleteSegmentsOlderThan(24 * 60 * 60 * 1000);
 
// CORRECT — check all consumers before deleting
const lwm = Math.min(...allConsumerOffsets);
deleteSegmentsBefore(lwm);

Always ensure that no component still needs the data before deleting segments. In a replicated system, this includes checking all follower positions — you can't delete segments that a follower hasn't replicated yet.

Pitfall 4: Snapshots Without WAL Coordination

// WRONG — snapshot doesn't record its LSN
takeSnapshot(state);
deleteAllWAL(); // Lost entries between snapshot and crash!
 
// CORRECT — snapshot records exact LSN
takeSnapshot(state, currentLSN);
deleteWALBefore(currentLSN); // Only delete what snapshot covers

The snapshot must record the exact LSN it represents, so recovery knows where to start replaying the WAL.


When to Use These Patterns

Use WAL When:

  • You need crash recovery — any system that persists state
  • You need replication — WAL entries can be shipped to replicas (PostgreSQL streaming replication)
  • You need audit logging — the WAL is a complete history of changes
  • You're building any database, message broker, or coordination service

Use Segmented Log When:

  • Your WAL grows unbounded — which it always will in production
  • You need to clean up old data without rewriting the entire log
  • You want to limit recovery time — only recent segments need replaying
  • You need parallel reads — different consumers can read different segments simultaneously

Use Low-Water Mark When:

  • You need to reclaim disk space from old log segments
  • You have multiple consumers reading the log at different speeds
  • You take periodic snapshots and want to truncate the log behind them
  • You need retention policies — keep 7 days, keep 100GB, or keep latest-per-key

Don't Use These Patterns When:

  • Your data fits in memory and doesn't need to survive restarts
  • You're using an existing database that already implements these patterns
  • You're building a stateless service with no persistent state

Summary and Key Takeaways

Write-Ahead Log ensures durability: write to the log first, apply to state second — so crashes never corrupt data
fsync is non-negotiable — without it, the OS can lose your "persisted" data on crash
Batching fsync calls is the key optimization — one fsync for 100 writes instead of 100 fsyncs
Segmented Log splits the ever-growing WAL into manageable files — only the active segment accepts writes
Sealed segments are immutable — making them safe to delete, copy, or archive independently
Low-Water Mark tracks what's safe to delete — via snapshots (etcd) or consumer offsets (Kafka)
Log compaction keeps only the latest value per key — essential for changelog topics in Kafka
These three patterns compose: WAL provides durability, segments manage file sizes, low-water mark enables cleanup
Every distributed system starts here — PostgreSQL, Kafka, etcd, MySQL, CockroachDB all implement these patterns
Recovery = snapshot + WAL replay — load the last snapshot, replay WAL entries after it


What's Next

With durability foundations in place, the next question is: how do we replicate this log across multiple machines? That requires answering:

  • Who accepts writes? (Leader)
  • How do followers stay in sync? (Replication)
  • How do we detect a leader failure? (HeartBeat)
  • How do we prevent stale leaders? (Generation Clock)

These are the patterns of leader election — the topic of our next post.


Series: Patterns of Distributed Systems Roadmap
Next: Post #3 — Leader and Followers, HeartBeat & Generation Clock


Have questions about durability patterns or WAL implementations? Feel free to reach out!

📬 Subscribe to Newsletter

Get the latest blog posts delivered to your inbox every week. No spam, unsubscribe anytime.

We respect your privacy. Unsubscribe at any time.

💬 Comments

Sign in to leave a comment

We'll never post without your permission.