Back to blog

Redis Persistence Internals: How RDB and AOF Work

redisdatabasebackendperformance
Redis Persistence Internals: How RDB and AOF Work

Introduction

Every time you restart a Redis server, you expect your data to still be there. But Redis is an in-memory database — when the process exits, RAM is wiped. So how does Redis survive a restart?

The answer is persistence: Redis writes data to disk in the background, so it can reload on startup. It offers two complementary mechanisms:

  • RDB (Redis Database) — periodic snapshots of the entire dataset
  • AOF (Append-Only File) — a write-ahead log of every mutating command

Both are implemented in the Redis source code with careful engineering to avoid blocking the event loop or introducing latency spikes. This post reads through rdb.c, aof.c, and supporting files to understand exactly how they work.

What You'll Learn:
✅ How Redis forks its process to take a snapshot without pausing
✅ What copy-on-write means and why it makes BGSAVE safe
✅ How AOF records commands and the three fsync strategies
✅ The dual-buffer trick that prevents AOF rewrite from losing data
✅ How Redis combines RDB and AOF for the best of both worlds

Prerequisites:


The Two Persistence Files

git clone https://github.com/redis/redis.git
cd redis/src
FileWhat It Does
rdb.cRDB snapshot creation, encoding, and loading
rdb.hRDB format constants and function declarations
aof.cAOF write, fsync, and background rewrite
server.cserverCron — the timer that triggers both
server.hredisServer fields for persistence state

The persistence state in redisServer (from server.h) is worth noting upfront:

struct redisServer {
    // RDB state
    pid_t rdb_child_pid;          // PID of background save process (-1 if none)
    int rdb_bgsave_scheduled;     // BGSAVE requested but blocked by AOF rewrite
    time_t rdb_last_save;         // Unix timestamp of last successful save
    int rdb_last_bgsave_status;   // C_OK or C_ERR
    long long dirty;              // Changes since last RDB save
    long long dirty_before_bgsave;// dirty count when BGSAVE started
 
    // AOF state
    int aof_state;                // AOF_OFF, AOF_ON, or AOF_WAIT_REWRITE
    int aof_fd;                   // File descriptor of the AOF file
    pid_t aof_child_pid;          // PID of background rewrite process
    sds aof_buf;                  // In-memory buffer: commands waiting to be flushed
    sds aof_rewrite_buf_blocks;   // Secondary buffer: commands during rewrite
    off_t aof_current_size;       // Current AOF file size in bytes
};

Two child PIDs, two buffers, a dirty counter — already you can see the shape of how the system works. Let's explore each side.


Part 1: RDB — Snapshots with fork()

What an RDB File Is

An RDB file is a compact binary snapshot of the entire Redis dataset at a point in time. It contains every key, every value, every TTL, encoded in Redis's own binary format. On startup, Redis loads this file and reconstructs the in-memory state.

You can trigger a save manually:

SAVE      # Synchronous — blocks Redis until done (avoid in production)
BGSAVE    # Asynchronous — forks a child process, returns immediately

You can also configure automatic saves in redis.conf:

# Save if at least 1 key changed in the last 900 seconds
save 900 1
# Save if at least 10 keys changed in the last 300 seconds
save 300 10
# Save if at least 10000 keys changed in the last 60 seconds
save 60 10000

These thresholds are checked by serverCron, which runs every 100ms.

The BGSAVE Implementation

The core of non-blocking snapshot creation is rdbSaveBackground() in rdb.c:

int rdbSaveBackground(int req, char *filename,
                      rdbSaveInfo *rsi, int rdbflags) {
    pid_t childpid;
 
    // Can't run two background saves simultaneously
    if (hasActiveChildProcess()) return C_ERR;
 
    server.dirty_before_bgsave = server.dirty;
    server.lastbgsave_try = time(NULL);
 
    // *** THE KEY OPERATION ***
    if ((childpid = redisFork(CHILD_TYPE_RDB)) == 0) {
        // === Child process ===
        int retval;
        redisSetProcTitle("redis-rdb-bgsave");
        redisSetCpuAffinity(server.bgsave_cpulist);
        retval = rdbSave(req, filename, rsi, rdbflags);
        if (retval == C_OK) {
            sendChildCowInfo(CHILD_INFO_TYPE_RDB_COW_SIZE,
                             "RDB");
        }
        exitFromChild((retval == C_OK) ? 0 : 1);
    } else {
        // === Parent process (the main Redis server) ===
        if (childpid == -1) {
            // fork() failed
            server.lastbgsave_status = C_ERR;
            return C_ERR;
        }
        serverLog(LL_NOTICE,
            "Background saving started by pid %ld",
            (long) childpid);
        server.rdb_save_time_start = time(NULL);
        server.rdb_child_pid = childpid;
        return C_OK;
    }
    return C_OK; // unreachable in parent
}

The key insight is the fork() call. fork() creates an exact copy of the Redis process. The child writes the snapshot; the parent continues serving commands. They share the same memory pages — at zero copy cost initially.

Copy-on-Write: Why Fork Is (Almost) Free

After fork(), both the parent and child process point to the same physical memory pages. The OS marks all these pages as copy-on-write (COW):

When the parent modifies a key (because clients are still writing), the OS transparently duplicates just that page. The child still sees the old, unmodified version. This means:

  • The child gets a consistent point-in-time view of the data
  • The parent never blocks — it keeps serving requests normally
  • Memory overhead is proportional to how much data changes during the save, not the total dataset size

This is why Redis logs RDB: X MB of memory used by copy-on-write after a save — it's telling you how many pages were duplicated.

The Child: Serializing the Dataset

Inside the child process, rdbSave() iterates over all databases and serializes every key-value pair:

int rdbSave(int req, char *filename,
            rdbSaveInfo *rsi, int rdbflags) {
    // Write to a temp file first, then atomic rename
    snprintf(tmpfile, 256, "temp-%d.rdb", (int) getpid());
    fp = fopen(tmpfile, "w");
    rioInitWithFile(&rdb, fp);
 
    if (rdbSaveRio(req, &rdb, &error, rdbflags, rsi) == C_ERR) {
        // Error handling...
    }
 
    // Flush OS buffer to disk
    if (fflush(fp) == EOF) goto werr;
    if (fsync(fileno(fp)) == -1) goto werr;
    if (fclose(fp) == EOF) goto werr;
 
    // Atomic rename: old dump.rdb is replaced in one syscall
    if (rename(tmpfile, filename) == -1) goto werr;
 
    return C_OK;
}

Two reliability details stand out:

  1. Write to a temp file first. If the process crashes mid-write, the existing dump.rdb is untouched. An incomplete file never becomes the live snapshot.
  2. rename() is atomic. On POSIX systems, renaming a file is a single syscall. There's no window where a reader could see a half-written file.

The RDB Binary Format

The rdbSaveRio() function writes the data in a custom binary format. A simplified view:

[REDIS][version][aux fields]
  For each database:
    [SELECTDB][db number]
    [RESIZE_DB][key count][expire count]
    For each key:
      [optional: EXPIRETIME ms][unix timestamp]
      [type byte]
      [encoded key]
      [encoded value]
[EOF]
[8-byte CRC64 checksum]

The encoding is type-specific and compact. Integers are stored as variable-length integers (not ASCII), strings use length-prefixed bytes, and complex types like sorted sets use their own binary representations. The final CRC64 detects corruption.

When the Child Finishes

The parent detects child completion in serverCron via wait3() (a non-blocking child status check). If the child exited with status 0 (success):

// In server.c, serverCron():
if (pid == server.rdb_child_pid) {
    backgroundSaveDoneHandler(exitcode, bysignal);
}

backgroundSaveDoneHandler updates server.rdb_last_save, clears server.dirty (the change counter), and logs the save time. If it failed, Redis logs the error and schedules a retry.


Part 2: AOF — Write-Ahead Logging

RDB snapshots are efficient but have a gap: if Redis crashes 50 seconds after a snapshot, you lose 50 seconds of writes. The AOF closes this gap by recording every mutating command to a log file.

How AOF Works at the Call Site

Every command that modifies data calls propagate() after execution. Inside propagate(), feedAppendOnlyFile() is called:

// Simplified from aof.c
void feedAppendOnlyFile(int dictid, robj **argv, int argc) {
    sds buf = sdsempty();
 
    // If the client is on a different database, emit SELECT first
    if (server.aof_selected_db != dictid) {
        char seldb[64];
        snprintf(seldb, sizeof(seldb), "%d", dictid);
        buf = sdscatprintf(buf, "*2\r\n$6\r\nSELECT\r\n$%lu\r\n%s\r\n",
                           (unsigned long)strlen(seldb), seldb);
        server.aof_selected_db = dictid;
    }
 
    // Translate relative expiries to absolute timestamps
    // (same reason as in replication: avoid drift)
    if (argc == 3 && !strcasecmp(argv[0]->ptr, "set")) {
        // Rewrite SET key value EX 60
        // to:    SET key value PXAT <absolute_ms>
    }
 
    // Encode the command in RESP format and append to buffer
    buf = catAppendOnlyGenericCommand(buf, argc, argv);
    server.aof_buf = sdscatlen(server.aof_buf, buf, sdslen(buf));
    sdsfree(buf);
}

The command is serialized in RESP format (the same protocol Redis uses for client communication) and appended to server.aof_buf — an in-memory buffer. It does NOT immediately write to disk.

The Three fsync Strategies

Writing to aof_buf is fast, but durability requires the data to reach the disk. This happens in flushAppendOnlyFile(), called from the event loop's "before sleep" hook:

void flushAppendOnlyFile(int force) {
    ssize_t nwritten;
 
    if (sdslen(server.aof_buf) == 0) {
        // Nothing to flush, but still maybe fsync on a timer
        if (server.aof_fsync == AOF_FSYNC_EVERYSEC && ...) {
            goto try_fsync;
        }
        return;
    }
 
    // Write the buffer to the file descriptor
    nwritten = aofWrite(server.aof_fd,
                        server.aof_buf,
                        sdslen(server.aof_buf));
 
    // Truncate the buffer (data is now in OS page cache)
    sdsclear(server.aof_buf);
 
try_fsync:
    if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
        // Flush OS page cache to physical disk NOW
        // Guarantees durability: max 1 command lost on crash
        // Cost: one fsync per command (very slow on spinning disk)
        redis_fsync(server.aof_fd);
 
    } else if (server.aof_fsync == AOF_FSYNC_EVERYSEC) {
        // Delegate fsync to a background thread, once per second
        // Max 1 second of data lost on crash
        // Cost: ~1 fsync per second (the practical default)
        if (server.aof_last_fsync < now) {
            aof_background_fsync(server.aof_fd);
        }
 
    } else {
        // AOF_FSYNC_NO: never call fsync explicitly
        // OS decides when to flush (typically every 30 seconds)
        // Max ~30 seconds of data lost on crash
        // Cost: fastest, but lowest durability guarantee
    }
}

The three strategies represent a durability vs. performance trade-off:

StrategyMax data lossPerformanceUse case
always0 commandsSlow (fsync per write)Financial, critical data
everysec~1 secondFast (default)Most applications
noOS-dependentFastestCache-only, data is reproducible

The write() syscall moves data from aof_buf into the OS kernel's page cache — this is fast but not durable. fsync() forces the OS to flush the page cache to the physical disk. The gap between these two calls is the window where a crash can lose data.

AOF Rewrite: Compacting the Log

AOF files grow forever — every INCR adds a line, even if the key was incremented a million times. An AOF for a counter that went from 0 to 1,000,000 would have a million lines, even though the current state is just SET counter 1000000.

The solution is BGREWRITEAOF, which creates a new minimal AOF representing the current state:

BGREWRITEAOF   # Trigger manually
# Redis also does this automatically based on aof-rewrite-min-size
# and aof-rewrite-percentage config values

The implementation in aofRewriteBackground() again uses fork():

int aofRewriteBackground(void) {
    pid_t childpid;
 
    if (hasActiveChildProcess()) return C_ERR;
 
    if ((childpid = redisFork(CHILD_TYPE_AOF)) == 0) {
        // === Child process ===
        char tmpfile[256];
        snprintf(tmpfile, 256, "temp-rewriteaof-bg-%d.aof",
                 (int) getpid());
        if (rewriteAppendOnlyFile(tmpfile) == C_OK) {
            sendChildCowInfo(CHILD_INFO_TYPE_AOF_COW_SIZE, "AOF rewrite");
            exitFromChild(0);
        } else {
            exitFromChild(1);
        }
    } else {
        // === Parent process ===
        server.aof_child_pid = childpid;
        // Start accumulating commands into the secondary buffer
        aofRewriteBufferReset();
        server.aof_rewrite_buf_blocks = listCreate();
        return C_OK;
    }
}

The Dual-Buffer Problem

Here is the most subtle engineering challenge in AOF: what happens to writes that arrive while the rewrite is in progress?

The child has a point-in-time snapshot (via COW), but the parent keeps serving client writes. If those new writes go only to aof_buf (the main AOF file), the new compact AOF won't include them. When the rewrite finishes and we swap to the new file, those writes would be lost.

Redis solves this with a secondary buffer: aof_rewrite_buf_blocks. While rewrite is running, every new mutating command is written to both aof_buf (the current AOF file) and aof_rewrite_buf_blocks (the accumulation buffer):

void feedAppendOnlyFile(int dictid, robj **argv, int argc) {
    // ... encode command into buf ...
 
    // Always append to main AOF buffer
    server.aof_buf = sdscatlen(server.aof_buf, buf, sdslen(buf));
 
    // ALSO append to rewrite buffer if rewrite is in progress
    if (server.aof_child_pid != -1)
        aofRewriteBufferAppend((unsigned char*)buf, sdslen(buf));
}

When the child finishes and signals the parent:

void backgroundRewriteDoneHandler(int exitcode, int bysignal) {
    if (!bysignal && exitcode == 0) {
        // 1. Open the new compact AOF file
        newfd = open(tmpfile, O_WRONLY|O_APPEND);
 
        // 2. Append the accumulated secondary buffer to it
        //    (all writes that arrived during rewrite)
        if (aofRewriteBufferWrite(newfd) == -1) { ... }
 
        // 3. Atomic rename: new file replaces old
        rename(tmpfile, server.aof_filename);
 
        // 4. Switch file descriptor to the new file
        oldfd = server.aof_fd;
        server.aof_fd = newfd;
        close(oldfd);  // This can be slow, but happens after rename
    }
}

This guarantees no writes are lost. The secondary buffer is the bridge between the child's frozen snapshot and the parent's live write stream.


Part 3: Loading on Startup

When Redis starts, loadDataFromDisk() in server.c decides what to load:

void loadDataFromDisk(void) {
    long long start = ustime();
 
    if (server.aof_state == AOF_ON) {
        // AOF takes priority — it's more complete
        if (loadAppendOnlyFiles(server.aof_manifest) == AOF_FAILED) {
            exit(1);
        }
    } else {
        // No AOF, try RDB
        rdbSaveInfo rsi = RDB_SAVE_INFO_INIT;
        errno = 0;
        int rdb_flags = RDBFLAGS_MAIN_FILE;
        if (rdbLoad(server.rdb_filename, &rsi, rdb_flags) == C_OK) {
            serverLog(LL_NOTICE, "DB loaded from disk: %.3f seconds",
                      (float)(ustime()-start)/1000000);
        } else if (errno != ENOENT) {
            serverLog(LL_WARNING, "Fatal error loading RDB file: %s. Exiting.",
                      server.rdb_filename);
            exit(1);
        }
    }
}

AOF has priority because it is more up-to-date. RDB is loaded only when AOF is disabled or unavailable.

The AOF loader (loadSingleAppendOnlyFile()) replays commands by creating a fake client and executing each command as if it arrived from the network — using the exact same command dispatch path as normal operation. This reuse of the command table means AOF loading benefits from any future command optimizations automatically.


Part 4: RDB + AOF Combined (Hybrid Persistence)

Redis 4.0 introduced a hybrid mode that combines both:

# redis.conf
aof-use-rdb-preamble yes

When BGREWRITEAOF runs in hybrid mode, the child writes an RDB-format preamble (fast binary encoding) followed by AOF-format commands for writes that happened after the snapshot:

[RDB binary data — compact, fast to load]
[AOF commands — only what changed since the RDB preamble]

This means:

  • Load time is as fast as RDB (bulk binary decoding, not command replay)
  • Data loss is as small as AOF (at most 1 second with everysec)
  • File size is smaller than a plain AOF after many small writes

The loader detects the preamble by checking if the file starts with the RDB magic bytes (REDIS). If it does, it loads the RDB portion, then switches to AOF replay for the tail.


Putting It Together

Here's the full picture of what happens during Redis's life:


Key Design Principles

1. Never block the event loop for disk I/O
Both BGSAVE and BGREWRITEAOF use fork() to move all disk work to a child process. The parent returns immediately and continues serving requests.

2. fork() + copy-on-write makes snapshots cheap
The OS shares memory pages between parent and child until one of them writes. A snapshot of a 4 GB dataset takes microseconds to initiate — not gigabytes of copying.

3. Write to temp file, rename atomically
Both RDB and AOF rewrite use the temp-file-then-rename pattern. A crash mid-write never corrupts the existing file.

4. The secondary buffer bridges two timelines
During AOF rewrite, new writes go to both the live AOF and a secondary buffer. When the rewrite finishes, the buffer is appended to the new file, ensuring no commands fall through the gap.

5. AOF replays commands through the normal dispatch path
Startup loading creates a fake client and calls the same setCommand, incrCommand, etc. that production traffic uses. No special loader logic is needed.


How to Explore the Code Yourself

# Key functions to read in order:
grep -n "rdbSaveBackground" src/rdb.c     # BGSAVE entry point
grep -n "rdbSave\b" src/rdb.c             # The child's work
grep -n "feedAppendOnlyFile" src/aof.c    # Where commands are logged
grep -n "flushAppendOnlyFile" src/aof.c   # The fsync decision
grep -n "aofRewriteBackground" src/aof.c  # BGREWRITEAOF entry point
grep -n "backgroundRewriteDoneHandler" src/aof.c  # Secondary buffer merge
grep -n "loadDataFromDisk" src/server.c   # Startup loading logic

To see what's in an RDB file:

# Redis ships a tool for this
src/redis-check-rdb dump.rdb
 
# Or use redis-cli
redis-cli --rdb /path/to/dump.rdb

To see what's in an AOF file — it's human-readable RESP:

cat appendonly.aof | head -50
# You'll see:
# *3
# $3
# SET
# $6
# mykey
# $5
# hello

Summary

Redis persistence is a study in avoiding the obvious solutions:

✅ Instead of pausing to snapshot, Redis fork()s and uses copy-on-write
✅ Instead of risking corrupt files, Redis writes to temp files and renames atomically
✅ Instead of losing writes during AOF rewrite, Redis accumulates them in a secondary buffer
✅ Instead of special loading code, AOF replay reuses the normal command dispatch path
✅ Instead of choosing between RDB and AOF, hybrid mode combines both

The next time you configure save 300 10 or appendfsync everysec, you'll know exactly what code runs, why those defaults exist, and what trade-off you're making.


Additional Resources

Redis Source Code:

Related Posts:

📬 Subscribe to Newsletter

Get the latest blog posts delivered to your inbox every week. No spam, unsubscribe anytime.

We respect your privacy. Unsubscribe at any time.

💬 Comments

Sign in to leave a comment

We'll never post without your permission.