Write-Ahead Log, Segmented Log & Low-Water Mark

Every distributed system — Kafka, PostgreSQL, etcd, CockroachDB — stores data. And every one of them faces the same terrifying question:
"What happens when the server crashes mid-write?"
The answer starts with three patterns that form the durability foundation of virtually every data system you've ever used:
- Write-Ahead Log (WAL) — Persist every change before applying it, so you can recover after a crash
- Segmented Log — Split the ever-growing log into manageable pieces
- Low-Water Mark — Know which log entries can be safely deleted
These three patterns are so fundamental that you cannot build a reliable distributed system without them. They solve the most basic problem: making sure data survives failures.
In this post, we'll cover:
✅ Write-Ahead Log — why "write first, apply later" prevents data corruption
✅ Segmented Log — how Kafka manages billions of messages with log segments
✅ Low-Water Mark — when and how to safely discard old log entries
✅ How all three patterns compose into a complete durability story
✅ Real-world implementations in PostgreSQL, Kafka, and etcd
✅ Code examples showing each pattern in action
Series: Patterns of Distributed Systems Roadmap
Next: Post #3 — Leader and Followers, HeartBeat & Generation Clock
The Durability Problem
Imagine you're writing a key-value store. A client sends SET user:123 "Alice". Your server:
- Receives the request
- Updates the in-memory hash map
- Returns "OK" to the client
What could go wrong? Everything.
If the server crashes after step 2 but before persisting to disk, the data is gone forever — even though the client received "OK". This is not a theoretical concern. It happens in production. Every day.
The naive fix — "just write to disk on every operation" — creates a different problem:
If you crash mid-write to the data file, you get a corrupted file — potentially worse than losing one record. You might lose the entire database.
We need a pattern that guarantees atomicity (all or nothing) and durability (survives crashes).
Pattern 1: Write-Ahead Log (WAL)
The Core Idea
The Write-Ahead Log pattern solves the durability problem with one rule:
Before applying any change to state, write the change to an append-only log on disk first.
The key insight: appending to a file is nearly atomic. Writing one log record to the end of a file either succeeds completely or fails without corrupting existing data. Updating a complex data structure (B-tree, hash table) in place is not atomic — a crash mid-update leaves the structure in an inconsistent state.
If the server crashes before the WAL write, the client doesn't get "OK" — so there's no data loss from the client's perspective. If the server crashes after the WAL write but before applying to state, we can replay the log on restart and recover the exact state.
Anatomy of a WAL Entry
Every WAL entry needs enough information to replay the operation independently:
interface WALEntry {
// Unique, monotonically increasing sequence number
logSequenceNumber: number; // LSN — identifies position in the log
// What happened
operation: 'SET' | 'DELETE' | 'UPDATE';
key: string;
value?: string;
previousValue?: string; // For undo during rollback
// Integrity
timestamp: number;
checksum: number; // CRC32 to detect corruption
}The log sequence number (LSN) is critical. It provides a total ordering of all operations. During recovery, you replay entries in LSN order and get the exact same state — regardless of when or why the server crashed.
Implementation: A Simple Write-Ahead Log
Here's a minimal WAL implementation that captures the essential pattern:
import {
readFileSync, existsSync, appendFileSync,
fsyncSync, openSync, closeSync
} from 'fs';
interface WALEntry {
lsn: number;
operation: 'SET' | 'DELETE';
key: string;
value?: string;
timestamp: number;
}
class WriteAheadLog {
private currentLSN = 0;
private logPath: string;
private state: Map<string, string> = new Map();
constructor(logPath: string) {
this.logPath = logPath;
this.recover(); // Replay log on startup
}
/**
* Core WAL pattern: write to log FIRST, then apply to state
*/
set(key: string, value: string): void {
// Step 1: Create log entry
const entry: WALEntry = {
lsn: ++this.currentLSN,
operation: 'SET',
key,
value,
timestamp: Date.now(),
};
// Step 2: Persist to log (with fsync for durability)
this.appendToLog(entry);
// Step 3: Apply to in-memory state
this.state.set(key, value);
}
delete(key: string): void {
const entry: WALEntry = {
lsn: ++this.currentLSN,
operation: 'DELETE',
key,
timestamp: Date.now(),
};
this.appendToLog(entry);
this.state.delete(key);
}
get(key: string): string | undefined {
return this.state.get(key); // Reads come from in-memory state
}
/**
* Append entry and fsync — the critical durability guarantee
*/
private appendToLog(entry: WALEntry): void {
const line = JSON.stringify(entry) + '\n';
const fd = openSync(this.logPath, 'a');
appendFileSync(fd, line);
fsyncSync(fd); // Force write to disk — not just OS buffer
closeSync(fd);
}
/**
* Recovery: replay the entire log to rebuild state
*/
private recover(): void {
if (!existsSync(this.logPath)) return;
const content = readFileSync(this.logPath, 'utf-8');
const lines = content.trim().split('\n').filter(l => l.length > 0);
for (const line of lines) {
const entry: WALEntry = JSON.parse(line);
this.replayEntry(entry);
this.currentLSN = Math.max(this.currentLSN, entry.lsn);
}
console.log(`Recovered ${lines.length} entries, LSN at ${this.currentLSN}`);
}
private replayEntry(entry: WALEntry): void {
switch (entry.operation) {
case 'SET':
this.state.set(entry.key, entry.value!);
break;
case 'DELETE':
this.state.delete(entry.key);
break;
}
}
}Notice the critical detail: fsyncSync(fd). Without fsync, the OS might buffer the write in memory — and a crash would lose it. The fsync call forces the data to physical disk, providing the actual durability guarantee.
The fsync Trade-off
fsync is expensive. On a typical SSD, each fsync takes 0.1-1ms. At 1ms per fsync, you're limited to ~1,000 writes per second. That's why real systems batch writes:
class BatchedWAL {
private pendingEntries: WALEntry[] = [];
private pendingCallbacks: (() => void)[] = [];
private flushIntervalMs = 10; // Flush every 10ms
constructor(private logPath: string) {
// Periodic flush — batch multiple writes into one fsync
setInterval(() => this.flush(), this.flushIntervalMs);
}
append(entry: WALEntry): Promise<void> {
return new Promise(resolve => {
this.pendingEntries.push(entry);
this.pendingCallbacks.push(resolve);
});
}
private flush(): void {
if (this.pendingEntries.length === 0) return;
const entries = this.pendingEntries;
const callbacks = this.pendingCallbacks;
this.pendingEntries = [];
this.pendingCallbacks = [];
// Write ALL pending entries, then ONE fsync
const fd = openSync(this.logPath, 'a');
for (const entry of entries) {
appendFileSync(fd, JSON.stringify(entry) + '\n');
}
fsyncSync(fd); // One fsync for the entire batch
closeSync(fd);
// Notify all waiting clients
callbacks.forEach(cb => cb());
}
}By batching 100 writes into one fsync, you go from 1,000 writes/sec to ~100,000 writes/sec. This is exactly what PostgreSQL, Kafka, and etcd do.
| System | Config Parameter | Default | Effect |
|---|---|---|---|
| PostgreSQL | wal_writer_delay | 200ms | Flush WAL every 200ms |
| Kafka | linger.ms | 0ms (immediate) | Wait before sending batch |
| etcd | --wal-sync-duration | Per-write | Sync WAL on each write |
WAL in PostgreSQL
PostgreSQL's WAL is one of the most mature implementations. Every change — INSERT, UPDATE, DELETE, even index modifications — goes through the WAL:
pg_wal/
├── 000000010000000000000001 # 16MB WAL segment
├── 000000010000000000000002 # Next segment
├── 000000010000000000000003
└── ...Each WAL record contains:
- LSN (Log Sequence Number) — position in the WAL stream
- Transaction ID — which transaction made the change
- Resource Manager ID — which component (heap, B-tree index, etc.)
- Record data — the actual change (before/after images of the row)
PostgreSQL uses WAL for three purposes:
- Crash recovery — replay WAL from the last checkpoint
- Streaming replication — ship WAL records to replicas
- Point-in-time recovery (PITR) — replay WAL to any point in time
-- See current WAL position
SELECT pg_current_wal_lsn();
-- Result: 0/1A3B4C0
-- See WAL statistics
SELECT * FROM pg_stat_wal;WAL in Kafka
Kafka takes the WAL concept further — the log is the database. Instead of using a WAL to protect some other data structure, Kafka's primary data store is the append-only log itself:
partition-0/
├── 00000000000000000000.log # First segment
├── 00000000000000000000.index # Offset index
├── 00000000000065536000.log # Next segment (starts at offset 65536)
├── 00000000000065536000.index
└── ...Every message published to Kafka is appended to a partition's log. Consumers read from the log at their own pace. The log is never modified in place — only appended to, and eventually cleaned up.
This is the fundamental insight: the log is the source of truth, not a side channel for durability.
WAL in MySQL InnoDB
MySQL InnoDB's redo log works slightly differently — it uses a fixed-size circular buffer rather than an ever-growing log:
ib_logfile0 # 48MB (default)
ib_logfile1 # 48MB (default)InnoDB writes redo log entries sequentially into these files, wrapping around when it reaches the end. The key constraint: InnoDB must flush modified ("dirty") pages from its buffer pool to the data files before the redo log wraps around and overwrites entries that haven't been applied yet.
This creates a natural tension between redo log size and write throughput — a larger redo log allows more writes before a flush is needed, but increases crash recovery time.
-- Check InnoDB log configuration
SHOW VARIABLES LIKE 'innodb_log_file_size';
-- Result: 50331648 (48MB)
SHOW VARIABLES LIKE 'innodb_log_files_in_group';
-- Result: 2Pattern 2: Segmented Log
The Problem with a Single Log File
Our WAL works, but it has a growing problem — literally. The log file grows forever:
Day 1: wal.log = 10 MB
Day 30: wal.log = 300 MB
Day 365: wal.log = 3.6 GB
Year 3: wal.log = 10.8 GB ← Recovery takes 30 minutesA single, ever-growing log file creates three problems:
- Recovery time — replaying 10 GB of log entries on startup takes minutes
- File system limits — some file systems have maximum file sizes
- Cleanup — you can't delete old entries from the middle of a file without rewriting it
The Core Idea
The Segmented Log pattern solves this by splitting the log into multiple smaller files called segments:
When a log segment reaches a configured size or age, close it and start a new one. Only the newest segment accepts writes.
Key properties:
- Only the active segment (the newest one) accepts writes
- Sealed segments are immutable — they never change
- Sealed segments can be individually deleted when no longer needed
- Each segment has a base offset identifying the first entry it contains
Implementation: Segmented Log
import { mkdirSync, readdirSync, statSync, unlinkSync } from 'fs';
import { join } from 'path';
interface LogSegment {
baseOffset: number; // First LSN in this segment
path: string; // File path
size: number; // Current size in bytes
sealed: boolean; // No longer accepts writes
}
class SegmentedLog {
private segments: LogSegment[] = [];
private activeSegment: LogSegment | null = null;
private currentLSN = 0;
private maxSegmentSize: number;
private logDir: string;
constructor(logDir: string, maxSegmentSizeBytes = 64 * 1024 * 1024) {
this.logDir = logDir;
this.maxSegmentSize = maxSegmentSizeBytes;
mkdirSync(logDir, { recursive: true });
this.loadExistingSegments();
}
append(entry: WALEntry): void {
// Roll to new segment if current one is full
if (this.shouldRoll()) {
this.rollSegment();
}
const line = JSON.stringify(entry) + '\n';
const fd = openSync(this.activeSegment!.path, 'a');
appendFileSync(fd, line);
fsyncSync(fd);
closeSync(fd);
this.activeSegment!.size += Buffer.byteLength(line);
this.currentLSN = entry.lsn;
}
/**
* Roll: seal current segment, create a new active one
*/
private rollSegment(): void {
if (this.activeSegment) {
this.activeSegment.sealed = true;
console.log(`Sealed segment at LSN ${this.currentLSN}`);
}
const newSegment: LogSegment = {
baseOffset: this.currentLSN + 1,
path: join(this.logDir, this.formatName(this.currentLSN + 1)),
size: 0,
sealed: false,
};
this.segments.push(newSegment);
this.activeSegment = newSegment;
}
private shouldRoll(): boolean {
return (
this.activeSegment === null ||
this.activeSegment.size >= this.maxSegmentSize
);
}
/**
* Delete segments whose entries are all below the given LSN
*/
deleteSegmentsBefore(lsn: number): number {
let deleted = 0;
this.segments = this.segments.filter(segment => {
// Only delete sealed segments entirely below the threshold
if (segment.sealed && this.getSegmentMaxLSN(segment) < lsn) {
unlinkSync(segment.path);
deleted++;
return false;
}
return true;
});
return deleted;
}
/**
* Read all entries across all segments in order
*/
*readAll(): Generator<WALEntry> {
for (const segment of this.segments) {
const content = readFileSync(segment.path, 'utf-8');
const lines = content.trim().split('\n').filter(l => l.length > 0);
for (const line of lines) {
yield JSON.parse(line);
}
}
}
// Segment names use zero-padded offsets for lexicographic sorting
private formatName(offset: number): string {
return offset.toString().padStart(20, '0') + '.log';
}
private getSegmentMaxLSN(segment: LogSegment): number {
// In production, each segment would track its max LSN in a header
// Simplified: read the last line of the segment
const content = readFileSync(segment.path, 'utf-8').trim();
const lines = content.split('\n');
const last = JSON.parse(lines[lines.length - 1]);
return last.lsn;
}
private loadExistingSegments(): void {
try {
const files = readdirSync(this.logDir)
.filter(f => f.endsWith('.log'))
.sort();
for (const file of files) {
const path = join(this.logDir, file);
const baseOffset = parseInt(file.replace('.log', ''), 10);
const size = statSync(path).size;
this.segments.push({ baseOffset, path, size, sealed: true });
}
// The last segment becomes active
if (this.segments.length > 0) {
const last = this.segments[this.segments.length - 1];
last.sealed = false;
this.activeSegment = last;
}
} catch {
// No existing segments
}
}
}Segment Naming Conventions
Real systems use zero-padded offset numbers as segment names so that lexicographic sorting equals offset sorting:
# Kafka segment files (20-digit zero-padded offsets)
00000000000000000000.log # Starts at offset 0
00000000000000065536.log # Starts at offset 65536
00000000000000131072.log # Starts at offset 131072
# PostgreSQL WAL segments (timeline + segment number)
000000010000000000000001 # Timeline 1, segment 1
000000010000000000000002 # Timeline 1, segment 2This naming scheme enables:
- Fast lookup: binary search across segment names to find a specific offset
- Simple cleanup: delete files lexicographically before a cutoff point
- Filesystem ordering:
lsshows segments in chronological order
Segmented Log in Kafka
Kafka's segment management is highly tunable:
# Segment size — roll to new segment when size is reached
log.segment.bytes=1073741824 # 1GB (default)
# Segment age — roll to new segment after this time
log.roll.ms=604800000 # 7 days
# Each segment has companion index files
# .log — the actual messages
# .index — sparse offset-to-byte-position index
# .timeindex — timestamp-to-offset indexEach segment in Kafka consists of three files:
00000000000000065536.log # Message data
00000000000000065536.index # Offset → byte position (sparse)
00000000000000065536.timeindex # Timestamp → offset (sparse)The .index file is a sparse index — it maps every Nth offset to a byte position in the .log file. To find offset 65600:
- Look in
.indexfor the largest offset<=65600 (say, 65580 at byte 42000) - Scan
.logstarting at byte 42000 until you find offset 65600
This gives O(1) lookup with minimal memory overhead.
Segmented Log in etcd
etcd splits its WAL into 64MB segments:
member/
└── wal/
├── 0000000000000000-0000000000000000.wal # First segment
├── 0000000000000001-0000000000001000.wal # Second segment
└── ...etcd's segment naming encodes both the sequence number and the index of the first entry — making it easy to find which segment contains a given entry.
Pattern 3: Low-Water Mark
The Growing Disk Problem
With WAL and segmented log, we have durability and manageable file sizes. But segments keep accumulating:
Month 1: 10 segments (640 MB)
Month 6: 60 segments (3.8 GB)
Month 12: 120 segments (7.6 GB) ← Disk filling upWe need to delete old segments. But which ones are safe to delete? If we delete a segment that's still needed for recovery, we corrupt our system. If we keep everything, we run out of disk.
The Core Idea
The Low-Water Mark pattern tracks the oldest log entry that's still needed by any component of the system:
The low-water mark is the point in the log below which all entries can be safely discarded. Nothing in the system still needs entries before this point.
The low-water mark advances when:
- A snapshot captures the complete state (so we don't need the log for recovery)
- All consumers have read past a certain point (so we don't need the log for replay)
- A time-based retention policy expires old data
Two Approaches to Low-Water Mark
Approach 1: Snapshot-Based Low-Water Mark
Take periodic snapshots of the full state. After a snapshot, the log entries before the snapshot LSN are no longer needed for recovery.
class SnapshotBasedLWM {
private state: Map<string, string> = new Map();
private segmentedLog: SegmentedLog;
private lastSnapshotLSN = 0;
private snapshotPath: string;
constructor(segmentedLog: SegmentedLog, snapshotPath: string) {
this.segmentedLog = segmentedLog;
this.snapshotPath = snapshotPath;
}
/**
* Take a snapshot — this advances the low-water mark
*/
takeSnapshot(currentLSN: number): void {
// Serialize entire state to disk
const snapshot = {
lsn: currentLSN,
timestamp: Date.now(),
data: Object.fromEntries(this.state),
};
writeFileSync(this.snapshotPath, JSON.stringify(snapshot));
// Ensure snapshot is durable before deleting WAL segments
fsyncSync(openSync(this.snapshotPath, 'r'));
this.lastSnapshotLSN = currentLSN;
// Now safely delete old segments
const deleted = this.segmentedLog.deleteSegmentsBefore(currentLSN);
console.log(`Snapshot at LSN ${currentLSN}, deleted ${deleted} segments`);
}
/**
* Recovery: load snapshot + replay WAL entries after snapshot
*/
recover(): void {
// Step 1: Load latest snapshot
if (existsSync(this.snapshotPath)) {
const snapshot = JSON.parse(readFileSync(this.snapshotPath, 'utf-8'));
this.state = new Map(Object.entries(snapshot.data));
this.lastSnapshotLSN = snapshot.lsn;
console.log(`Loaded snapshot at LSN ${snapshot.lsn}`);
}
// Step 2: Replay WAL entries AFTER the snapshot
let replayed = 0;
for (const entry of this.segmentedLog.readAll()) {
if (entry.lsn > this.lastSnapshotLSN) {
this.applyEntry(entry);
replayed++;
}
}
console.log(`Replayed ${replayed} entries after snapshot`);
}
private applyEntry(entry: WALEntry): void {
if (entry.operation === 'SET') {
this.state.set(entry.key, entry.value!);
} else if (entry.operation === 'DELETE') {
this.state.delete(entry.key);
}
}
}This is the approach used by etcd and ZooKeeper. etcd takes periodic snapshots and then purges old WAL segments.
Approach 2: Consumer-Based Low-Water Mark
Track the position of every consumer (reader). The low-water mark is the minimum position across all consumers:
class ConsumerBasedLWM {
private consumerOffsets: Map<string, number> = new Map();
/**
* Update a consumer's committed position
*/
commitOffset(consumerId: string, offset: number): void {
this.consumerOffsets.set(consumerId, offset);
}
/**
* Low-water mark = minimum offset across all consumers
*/
getLowWaterMark(): number {
if (this.consumerOffsets.size === 0) return 0;
return Math.min(...this.consumerOffsets.values());
}
/**
* Clean up segments below the low-water mark
*/
cleanup(segmentedLog: SegmentedLog): number {
const lwm = this.getLowWaterMark();
return segmentedLog.deleteSegmentsBefore(lwm);
}
}This is the approach used by Kafka. Each consumer group tracks its offset, and Kafka's log cleaner uses these offsets (plus retention policies) to determine what can be deleted.
Low-Water Mark in Kafka
Kafka's log cleanup combines consumer offsets with configurable retention policies:
# Time-based retention — delete segments older than 7 days
log.retention.hours=168
# Size-based retention — keep at most 100GB per partition
log.retention.bytes=107374182400
# Log compaction — keep only the latest value for each key
log.cleanup.policy=compact
# Or combine both: delete old data AND compact
log.cleanup.policy=compact,deleteTime-based retention is simple: delete segments whose newest message is older than the retention period. This is the low-water mark advancing with time.
Log compaction is more interesting. Instead of deleting entire segments, Kafka keeps only the most recent value for each key:
Log compaction is essential for changelog topics — topics that represent the current state of an entity. After compaction, replaying the log gives you the latest state of every key without reading the entire history.
Low-Water Mark in etcd
etcd uses the snapshot-based approach with a configurable snapshot interval:
# Take a snapshot every 10,000 applied entries
--snapshot-count=10000
# Maximum number of WAL files to retain
--max-wals=5etcd's cleanup flow:
- After every 10,000 entries, take a snapshot of the entire state
- The snapshot includes the applied index (like an LSN)
- WAL files before the snapshot can be deleted
- On recovery: load snapshot, then replay WAL entries after snapshot
Low-Water Mark in PostgreSQL
PostgreSQL uses checkpoints as its low-water mark mechanism:
-- Checkpoint configuration
SHOW checkpoint_timeout; -- default: 5min
SHOW max_wal_size; -- default: 1GBA checkpoint flushes all dirty pages (modified data) from shared buffers to disk. After a checkpoint completes, WAL segments before the checkpoint's LSN are no longer needed for crash recovery and can be recycled.
The max_wal_size setting controls how much WAL accumulates before PostgreSQL forces a checkpoint — effectively capping how far back recovery might need to replay.
How the Three Patterns Work Together
These three patterns form a complete durability stack. Each pattern solves a specific problem, and together they handle the full lifecycle of persistent data:
| Problem | Pattern | Solution |
|---|---|---|
| "What if the server crashes mid-write?" | Write-Ahead Log | Append change to durable log before applying |
| "The log file grows forever" | Segmented Log | Split into fixed-size segments, only active one writable |
| "Old segments waste disk space" | Low-Water Mark | Track what's safe to delete, clean up old segments |
The Complete Lifecycle
Let's trace the full lifecycle with a concrete example:
Step 1: Client sends SET user:123 "Alice"
→ WAL appends entry #4001 to active segment
Step 2: Active segment reaches 64MB
→ Segmented Log rolls: seals segment-4, creates segment-5
Step 3: System takes a snapshot at LSN 4000
→ Low-Water Mark advances to 4000
Step 4: Segments 1-3 (LSN 1-3000) are all below the low-water mark
→ Segments 1-3 deleted from disk
Step 5: Server crashes and restarts
→ Load snapshot (state at LSN 4000)
→ Replay segment-4 entries 4001-4500
→ Replay segment-5 entries 4501-current
→ State fully recoveredReal-World Pattern Map
Here's how these three patterns are implemented across real systems:
| System | WAL | Segmented Log | Low-Water Mark |
|---|---|---|---|
| PostgreSQL | pg_wal/ files (16MB segments) | WAL segment files, auto-recycled | Checkpoint LSN — segments before checkpoint can be recycled |
| Kafka | Partition log files | Configurable segment size (1GB default) | Consumer offsets + retention policy (time/size/compaction) |
| etcd | WAL files in member/wal/ | WAL segments (64MB) | Snapshot index — WAL files before snapshot are purged |
| MySQL InnoDB | Redo log (ib_logfile0/1) | Fixed-size circular log files | Checkpoint LSN — pages flushed to tablespace |
| CockroachDB | Pebble WAL | SST files (sorted string tables) | Compaction — merges and discards old versions |
Practice: Building a Durable Key-Value Store
Let's combine all three patterns into a working key-value store:
class DurableKVStore {
private state: Map<string, string> = new Map();
private currentLSN = 0;
private segmentedLog: SegmentedLog;
private snapshotPath: string;
private lastSnapshotLSN = 0;
private opsAfterSnapshot = 0;
private snapshotInterval: number;
constructor(
dataDir: string,
options: {
maxSegmentSize?: number;
snapshotInterval?: number;
} = {}
) {
this.segmentedLog = new SegmentedLog(
join(dataDir, 'wal'),
options.maxSegmentSize ?? 64 * 1024 * 1024,
);
this.snapshotPath = join(dataDir, 'snapshot.json');
this.snapshotInterval = options.snapshotInterval ?? 10000;
this.recover();
}
set(key: string, value: string): void {
const entry: WALEntry = {
lsn: ++this.currentLSN,
operation: 'SET',
key,
value,
timestamp: Date.now(),
};
// WAL: persist first
this.segmentedLog.append(entry);
// Apply to state
this.state.set(key, value);
// Maybe take a snapshot (advances low-water mark)
this.maybeSnapshot();
}
delete(key: string): void {
const entry: WALEntry = {
lsn: ++this.currentLSN,
operation: 'DELETE',
key,
timestamp: Date.now(),
};
this.segmentedLog.append(entry);
this.state.delete(key);
this.maybeSnapshot();
}
get(key: string): string | undefined {
return this.state.get(key);
}
/**
* Snapshot-based low-water mark:
* After N operations, snapshot state and clean up old segments
*/
private maybeSnapshot(): void {
this.opsAfterSnapshot++;
if (this.opsAfterSnapshot >= this.snapshotInterval) {
this.takeSnapshot();
}
}
private takeSnapshot(): void {
const snapshot = {
lsn: this.currentLSN,
timestamp: Date.now(),
entries: Object.fromEntries(this.state),
};
writeFileSync(this.snapshotPath, JSON.stringify(snapshot));
// Advance low-water mark and clean up
const deleted = this.segmentedLog.deleteSegmentsBefore(this.currentLSN);
this.lastSnapshotLSN = this.currentLSN;
this.opsAfterSnapshot = 0;
console.log(
`Snapshot at LSN ${this.currentLSN}, deleted ${deleted} old segments`
);
}
/**
* Recovery: snapshot + WAL replay
*/
private recover(): void {
// Step 1: Load snapshot if available
if (existsSync(this.snapshotPath)) {
const snapshot = JSON.parse(readFileSync(this.snapshotPath, 'utf-8'));
this.state = new Map(Object.entries(snapshot.entries));
this.lastSnapshotLSN = snapshot.lsn;
this.currentLSN = snapshot.lsn;
console.log(
`Loaded snapshot: ${this.state.size} keys at LSN ${snapshot.lsn}`
);
}
// Step 2: Replay WAL entries after snapshot
let replayed = 0;
for (const entry of this.segmentedLog.readAll()) {
if (entry.lsn > this.lastSnapshotLSN) {
this.applyEntry(entry);
this.currentLSN = entry.lsn;
replayed++;
}
}
if (replayed > 0) {
console.log(`Replayed ${replayed} WAL entries after snapshot`);
}
}
private applyEntry(entry: WALEntry): void {
if (entry.operation === 'SET') {
this.state.set(entry.key, entry.value!);
} else if (entry.operation === 'DELETE') {
this.state.delete(entry.key);
}
}
}Usage
// Create a durable KV store
const store = new DurableKVStore('./data', {
maxSegmentSize: 10 * 1024 * 1024, // 10MB segments
snapshotInterval: 5000, // Snapshot every 5000 ops
});
// These writes survive crashes
store.set('user:1', 'Alice');
store.set('user:2', 'Bob');
store.set('config:theme', 'dark');
// Reads come from memory — fast
console.log(store.get('user:1')); // "Alice"
// If the process crashes and restarts, DurableKVStore
// recovers from snapshot + WAL replay automaticallyCommon Pitfalls
Pitfall 1: Forgetting fsync
// WRONG — data may be in OS buffer, not on disk
appendFileSync(logPath, data);
// CORRECT — force data to physical disk
const fd = openSync(logPath, 'a');
appendFileSync(fd, data);
fsyncSync(fd);
closeSync(fd);Without fsync, your WAL provides no durability guarantee. The OS can buffer writes and lose them on crash. PostgreSQL's fsync=off setting explicitly states: "may result in unrecoverable data corruption."
Pitfall 2: Not Handling Partial Writes
A crash can happen mid-write, leaving a truncated JSON line:
{"lsn":42,"operation":"SET","key":"user:1","value":"AlYour recovery code must handle this:
private recoverEntry(line: string): WALEntry | null {
try {
const entry = JSON.parse(line);
return entry;
} catch {
// Truncated entry — last write was interrupted
console.warn(`Skipping corrupted entry: ${line.substring(0, 50)}...`);
return null; // Safe: client never received "OK" for this entry
}
}This is safe because: if the write was interrupted, the client never received "OK", so there's no data loss from the client's perspective.
Pitfall 3: Deleting Segments That Are Still Needed
// WRONG — what if a consumer is still reading from 2 days ago?
deleteSegmentsOlderThan(24 * 60 * 60 * 1000);
// CORRECT — check all consumers before deleting
const lwm = Math.min(...allConsumerOffsets);
deleteSegmentsBefore(lwm);Always ensure that no component still needs the data before deleting segments. In a replicated system, this includes checking all follower positions — you can't delete segments that a follower hasn't replicated yet.
Pitfall 4: Snapshots Without WAL Coordination
// WRONG — snapshot doesn't record its LSN
takeSnapshot(state);
deleteAllWAL(); // Lost entries between snapshot and crash!
// CORRECT — snapshot records exact LSN
takeSnapshot(state, currentLSN);
deleteWALBefore(currentLSN); // Only delete what snapshot coversThe snapshot must record the exact LSN it represents, so recovery knows where to start replaying the WAL.
When to Use These Patterns
Use WAL When:
- You need crash recovery — any system that persists state
- You need replication — WAL entries can be shipped to replicas (PostgreSQL streaming replication)
- You need audit logging — the WAL is a complete history of changes
- You're building any database, message broker, or coordination service
Use Segmented Log When:
- Your WAL grows unbounded — which it always will in production
- You need to clean up old data without rewriting the entire log
- You want to limit recovery time — only recent segments need replaying
- You need parallel reads — different consumers can read different segments simultaneously
Use Low-Water Mark When:
- You need to reclaim disk space from old log segments
- You have multiple consumers reading the log at different speeds
- You take periodic snapshots and want to truncate the log behind them
- You need retention policies — keep 7 days, keep 100GB, or keep latest-per-key
Don't Use These Patterns When:
- Your data fits in memory and doesn't need to survive restarts
- You're using an existing database that already implements these patterns
- You're building a stateless service with no persistent state
Summary and Key Takeaways
✅ Write-Ahead Log ensures durability: write to the log first, apply to state second — so crashes never corrupt data
✅ fsync is non-negotiable — without it, the OS can lose your "persisted" data on crash
✅ Batching fsync calls is the key optimization — one fsync for 100 writes instead of 100 fsyncs
✅ Segmented Log splits the ever-growing WAL into manageable files — only the active segment accepts writes
✅ Sealed segments are immutable — making them safe to delete, copy, or archive independently
✅ Low-Water Mark tracks what's safe to delete — via snapshots (etcd) or consumer offsets (Kafka)
✅ Log compaction keeps only the latest value per key — essential for changelog topics in Kafka
✅ These three patterns compose: WAL provides durability, segments manage file sizes, low-water mark enables cleanup
✅ Every distributed system starts here — PostgreSQL, Kafka, etcd, MySQL, CockroachDB all implement these patterns
✅ Recovery = snapshot + WAL replay — load the last snapshot, replay WAL entries after it
What's Next
With durability foundations in place, the next question is: how do we replicate this log across multiple machines? That requires answering:
- Who accepts writes? (Leader)
- How do followers stay in sync? (Replication)
- How do we detect a leader failure? (HeartBeat)
- How do we prevent stale leaders? (Generation Clock)
These are the patterns of leader election — the topic of our next post.
Series: Patterns of Distributed Systems Roadmap
Next: Post #3 — Leader and Followers, HeartBeat & Generation Clock
Have questions about durability patterns or WAL implementations? Feel free to reach out!
📬 Subscribe to Newsletter
Get the latest blog posts delivered to your inbox every week. No spam, unsubscribe anytime.
We respect your privacy. Unsubscribe at any time.
💬 Comments
Sign in to leave a comment
We'll never post without your permission.