Galera Cluster Architecture: Synchronous Multi-Master Replication

Post 2 showed you what MySQL replication does well and where it breaks. Async replication has a durability gap. Semi-sync closes the gap partially but can silently fall back. Failover requires external tools and ~30 seconds of downtime. And none of it gives you multi-master writes.

Galera takes a fundamentally different approach. Instead of streaming binary log events from one node to another, Galera packages each transaction into a write-set, broadcasts it to all nodes, and uses a certification algorithm to decide — atomically, on every node — whether that transaction can commit.

This post explains how that works, from protocol layer to operational behavior.

What Makes Galera Different

Traditional MySQL replication is asynchronous and single-primary: one node accepts writes, others replay them later. Galera is virtually synchronous and multi-master: every node can accept writes, and every committed transaction is guaranteed to exist on all nodes.

Key differences at a glance:

Property	Async Replication	Galera
Write nodes	1 (primary)	All nodes
Commit guarantee	Local only	All nodes in quorum
Replication lag	Yes (ms to hours)	None
Failover	External tool + downtime	Automatic, zero data loss
Conflict resolution	Not applicable (single writer)	Certification-based
Protocol	Binlog streaming	wsrep (Write-Set Replication)

The term "virtually synchronous" is important: Galera doesn't use distributed two-phase commit (2PC). Instead, it uses a group communication protocol where all nodes agree on transaction ordering, then each node independently applies the transaction. The commit is synchronous from the application's perspective — when your COMMIT returns OK, every node has the write-set — but the actual apply step happens asynchronously on remote nodes.

The wsrep API

Galera doesn't modify the MySQL server itself. Instead, it uses a replication plugin API called wsrep (Write-Set Replication). This is the interface between MySQL (or MariaDB, or Percona Server) and the Galera replication library.

The wsrep API defines a set of callbacks:

wsrep_apply_cb — called when a remote write-set arrives and needs to be applied locally
wsrep_commit_cb — called to commit a remotely applied write-set
wsrep_unordered_cb — for unordered events (not commonly used)
wsrep_sst_donate_cb — called on the donor node during State Snapshot Transfer

From the MySQL side, the wsrep API hooks into the commit path:

A transaction executes normally on the local node (INSERT, UPDATE, DELETE)
At COMMIT time, MySQL builds a write-set (the set of row changes) and passes it to the Galera library via wsrep
Galera broadcasts the write-set to all nodes using group communication
All nodes certify the write-set (check for conflicts)
If certification passes, the transaction commits on all nodes
If certification fails, the originating node rolls back and returns an error

The key insight: Galera doesn't replicate SQL statements. It replicates the effect of statements — the actual row-level changes. This is similar to row-based binary logging, but the format and protocol are entirely different.

wsrep Provider Options

The Galera library is loaded via the wsrep_provider configuration variable:

[mysqld]
wsrep_provider=/usr/lib64/galera4/libgalera_smm.so
wsrep_cluster_address=gcomm://node1,node2,node3
wsrep_cluster_name=my_cluster
wsrep_node_name=node1
wsrep_node_address=192.168.1.10
wsrep_sst_method=xtrabackup-v2

The wsrep_provider_options variable controls the Galera library's internal behavior:

wsrep_provider_options="gcache.size=1G; evs.keepalive_period=PT3S; evs.suspect_timeout=PT30S; evs.inactive_timeout=PT1M"

We'll cover tuning these in detail in Post 8 (Performance Tuning), but the important ones to know now:

gcache.size — Size of the Galera write-set cache (used for IST). Default is 128MB, production clusters typically use 1–4GB
evs.suspect_timeout — How long before a node that stops responding is suspected of being dead
evs.inactive_timeout — How long before a suspected node is evicted from the cluster

Write-Set Replication: How Transactions Flow

This is the heart of Galera. Let's trace a single write from application to commit across all nodes.

Step 1: Local Execution

The application connects to any Galera node and executes a transaction:

BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 42;
UPDATE accounts SET balance = balance + 100 WHERE id = 99;
COMMIT;

During the BEGIN ... COMMIT block, the transaction executes locally on the connected node — just like any normal MySQL transaction. InnoDB acquires row locks, modifies pages in the buffer pool, writes redo log entries. At this point, no other node knows about this transaction.

Step 2: Write-Set Construction

At COMMIT time, the wsrep hook intercepts the commit. MySQL builds a write-set containing:

The row changes (before and after images for each modified row)
The key set — primary keys of all rows touched (used for certification)
The database and table names
The transaction's GTID (Galera uses its own GTID format, separate from MySQL's)

The write-set is a compact binary representation — it doesn't contain SQL text, just the row-level diff.

Step 3: Broadcast (Total Order)

Galera sends the write-set to all nodes using a group communication system (GCS). The GCS ensures total order delivery: every node receives write-sets in exactly the same sequence, even if multiple nodes are sending write-sets simultaneously.

The sequence number (seqno) is critical. It guarantees that all nodes process write-sets in the same order. This is what makes certification deterministic — every node sees the same history and reaches the same certification decision independently.

Step 4: Certification

When a node receives a write-set (including the originating node), it runs the certification test. This is a conflict detection algorithm that checks:

"Does this write-set's key set overlap with any write-set that was certified AFTER this transaction started but BEFORE this transaction's sequence number?"

In simpler terms: did any other transaction modify the same rows between when this transaction started reading and when it tried to commit?

The certification index is an in-memory data structure that maps primary keys to the sequence number of the last write-set that modified them. It's lightweight and fast.

If certification passes: The write-set is added to the apply queue. On the originating node, the transaction commits immediately. On remote nodes, a background applier thread replays the write-set.

If certification fails: The originating node receives a DEADLOCK error (error code 1213) and must retry the transaction. Remote nodes simply discard the write-set.

Why Certification Works

Certification is deterministic: every node has the same certification index (because they process write-sets in the same order), so every node reaches the same PASS/FAIL decision independently. No voting round is needed.

This is fundamentally different from traditional two-phase commit (2PC):

2PC: Coordinator asks "can you commit?" → nodes vote → coordinator decides → nodes commit/abort
Galera certification: Each node independently determines if the write-set can commit, and they always agree because they see the same sequence of events

This makes Galera faster than 2PC for normal operations — there's only one network round-trip (broadcast), not two (prepare + commit).

Certification Conflicts: When Writes Collide

In a multi-master cluster, two nodes might accept writes that modify the same row at the same time. Galera handles this through certification — but your application must be prepared for certification failures.

Example: Two Concurrent Updates

Node 2's transaction was valid locally, but globally it conflicts with Node 1's transaction that got a lower sequence number. Node 2 returns a deadlock error to the application.

Handling Certification Failures in Your Application

Your application must handle error 1213 with retry logic:

import mysql.connector
import time
 
def execute_with_retry(connection, query, params=None, max_retries=3):
    for attempt in range(max_retries):
        try:
            cursor = connection.cursor()
            cursor.execute("BEGIN")
            cursor.execute(query, params)
            connection.commit()
            return cursor
        except mysql.connector.Error as e:
            if e.errno == 1213 and attempt < max_retries - 1:
                # Certification failure — retry with backoff
                time.sleep(0.1 * (2 ** attempt))
                continue
            raise

Key insight: Certification failures are expected in multi-master mode. They are not bugs — they are the mechanism that preserves consistency. If you never get certification failures, you're either writing to a single node (which means ProxySQL is routing correctly) or your workload has no write contention.

Minimizing Certification Conflicts

In practice, most production Galera deployments route all writes to a single node via ProxySQL. This eliminates certification conflicts entirely while keeping the multi-master capability as a failover feature.

If you do write to multiple nodes:

Avoid hot rows: A single row updated from multiple nodes will conflict frequently
Keep transactions small: Fewer rows touched = fewer keys in the write-set = lower conflict probability
Avoid long transactions: A transaction that runs for 5 seconds has a much larger window for conflicts than one that runs for 5ms
Use optimistic locking in the application: Check-and-set patterns reduce the blast radius of conflicts

Galera Node States

Each node in a Galera cluster transitions through a series of states. Understanding these states is essential for operations — they tell you whether a node is healthy, joining, or in trouble.

State Details

State	wsrep_local_state	Meaning
Open	1	Node has joined the group communication but hasn't received the cluster state yet
Primary	2	Node has received the primary component view (knows who's in the cluster)
Joiner	3	Node is receiving a state transfer (SST or IST) from a donor
Joined	4	State transfer complete, node is catching up with the apply queue
Synced	5	Node is fully synchronized — safe for reads and writes
Donor	6	Node is sending state to a joining node

Monitoring Node State

-- Current node state
SHOW STATUS LIKE 'wsrep_local_state_comment';
-- Output: Synced
 
-- Is the node ready to accept queries?
SHOW STATUS LIKE 'wsrep_ready';
-- Output: ON
 
-- Is the node part of the primary component?
SHOW STATUS LIKE 'wsrep_cluster_status';
-- Output: Primary
 
-- How many nodes in the cluster?
SHOW STATUS LIKE 'wsrep_cluster_size';
-- Output: 3

Important: A node with wsrep_ready = OFF should not receive traffic. ProxySQL checks this automatically via health checks (covered in Post 5).

The Donor State Problem

When a new node joins and needs SST (full state transfer), the donor node enters the Donor state. Depending on the SST method:

xtrabackup-v2: Donor remains available for reads and writes during SST. This is the recommended method.
rsync: Donor is blocked for reads during the transfer. Avoid in production.
mysqldump: Donor is available but under heavy load. Slow for large datasets.

This is why Percona XtraDB Cluster defaults to xtrabackup-v2 — it's the only SST method that keeps the donor fully operational.

Flow Control: The Cluster-Wide Throttle

In any replication system, nodes can fall behind. In async replication, you see this as replication lag. In Galera, the cluster doesn't allow lag — instead, it uses flow control to slow down the entire cluster when a node can't keep up.

How Flow Control Works

Each node has an apply queue (also called the receive queue) — write-sets that have been certified but not yet applied to the local database. If this queue grows too large, the node sends a flow control PAUSE message to the group. All nodes then pause certification until the slow node catches up.

Flow Control Configuration

# Number of write-sets in the receive queue before sending FC_PAUSE
# Default: 16
wsrep_provider_options="gcs.fc_limit=64"
 
# Queue length at which FC_RESUME is sent (fraction of fc_limit)
# Default: 0.5 (resume when queue drops to 50% of fc_limit)
wsrep_provider_options="gcs.fc_master_slave=NO; gcs.fc_factor=0.5"

Monitoring Flow Control

-- Fraction of time the cluster has been paused due to flow control (0.0 = never, 1.0 = always)
SHOW STATUS LIKE 'wsrep_flow_control_paused';
-- Target: < 0.01 (less than 1% paused)
 
-- Number of flow control PAUSE events sent by this node
SHOW STATUS LIKE 'wsrep_flow_control_sent';
 
-- Current receive queue length
SHOW STATUS LIKE 'wsrep_local_recv_queue_avg';
-- Target: < 1.0

Red flag: If wsrep_flow_control_paused is consistently above 0.1 (10%), your cluster's write throughput is being throttled. Common causes:

Slow disks on one node — SSDs vs spinning disks
Unequal hardware — one node has less RAM or slower CPU
Large transactions — a single INSERT ... SELECT moving millions of rows blocks the apply queue
DDL operations — ALTER TABLE under TOI mode blocks all nodes

Quorum and Split-Brain Protection

Distributed systems must handle network partitions — what happens when nodes can't communicate? Galera uses a quorum-based approach to prevent split-brain (two halves of the cluster accepting conflicting writes independently).

The Quorum Rule

A partition must contain more than half of the last known cluster membership to remain operational. This is called the Primary Component.

For a 3-node cluster:

2 nodes online: Quorum (2 > 3/2). Cluster operates normally.
1 node online: No quorum (1 ≤ 3/2). Node refuses writes.
3 nodes online: Quorum. Normal operation.

Why Odd Numbers Matter

With an even number of nodes, a perfect 50/50 split means neither side has quorum:

Cluster Size	Tolerate Failures	Minimum for Quorum
2 nodes	0 (!)	2
3 nodes	1	2
4 nodes	1	3
5 nodes	2	3
6 nodes	2	4
7 nodes	3	4

A 2-node cluster is dangerous: if one node dies, the other has no quorum and stops accepting writes. You need at least 3 nodes. And notice that 4 nodes tolerate the same number of failures as 3 — so 3 is the sweet spot for most clusters.

If you need a 2-node setup (e.g., two data centers), you can use a Galera Arbitrator (garbd) — a lightweight daemon that participates in quorum voting but doesn't store data. It acts as the tiebreaker.

Monitoring Quorum

-- Current cluster status
SHOW STATUS LIKE 'wsrep_cluster_status';
-- "Primary" = in quorum
-- "Non-primary" = lost quorum, refusing writes
 
-- Cluster size (number of nodes in the current view)
SHOW STATUS LIKE 'wsrep_cluster_size';
 
-- This node's ID in the cluster
SHOW STATUS LIKE 'wsrep_local_index';

Recovering from a Non-Primary State

If all nodes lose quorum (e.g., a 3-node cluster reboots simultaneously), you need to bootstrap one node:

# Find the node with the most recent data
# Check grastate.dat on each node:
cat /var/lib/mysql/grastate.dat
 
# Look for the highest seqno:
# ...
# seqno: 12345
# safe_to_bootstrap: 1
# ...
 
# Bootstrap from the most advanced node:
# On the chosen node:
mysqld --wsrep-new-cluster
 
# Then start the other nodes normally:
systemctl start mysql

Never bootstrap a node that doesn't have the latest data — you'll lose committed transactions.

SST vs IST: How Nodes (Re)Join the Cluster

When a node joins the cluster — either for the first time or after a crash — it needs the current dataset. Galera has two mechanisms:

SST (State Snapshot Transfer)

A full copy of the entire dataset from a donor node to the joining node. Used when:

A node joins for the first time
The node was down so long that the donor's gcache doesn't cover the gap
IST is not possible for any reason

SST methods:

Method	Donor Impact	Speed	Recommended
xtrabackup-v2	Minimal (non-blocking)	Fast	✅ Yes
rsync	Blocks reads	Moderate	❌ No (for production)
mysqldump	High CPU/IO	Slow	❌ No
clone (MySQL 8.0.17+)	Minimal	Fast	✅ Alternative

SST can take a long time for large datasets. A 500GB database over a 1Gbps network takes ~70 minutes at best. During this time, the joiner is not serving traffic and the donor (if using rsync) is partially degraded.

IST (Incremental State Transfer)

A partial transfer of only the write-sets the node missed while it was down. Much faster than SST. Used when:

The node was down briefly
The missing write-sets still exist in the donor's gcache (Galera cache)

IST is orders of magnitude faster than SST — catching up on 500 write-sets takes seconds, not minutes or hours.

gcache: The Key to Fast Rejoins

The gcache (Galera Cache) is a ring buffer on each node that stores recent write-sets. When a node comes back and requests state transfer, the donor checks: "Do I have all the write-sets this node missed in my gcache?"

If yes → IST (fast). If no → SST (slow).

# Size the gcache based on your write volume
# Rule of thumb: enough to hold write-sets for the longest expected downtime
wsrep_provider_options="gcache.size=2G"

How to size gcache: If your cluster generates 100MB of write-sets per hour and you want nodes that were down for 4 hours to rejoin via IST, set gcache.size=400M (plus safety margin). In practice, 1–4GB handles most workloads.

You can monitor gcache usage:

SHOW STATUS LIKE 'wsrep_local_cached_downto';
-- Lowest seqno still in gcache
 
SHOW STATUS LIKE 'wsrep_last_committed';
-- Current highest seqno

The difference tells you how many write-sets are cached. If a rejoining node needs write-sets below wsrep_local_cached_downto, IST is impossible and SST kicks in.

Galera Limitations

Galera is powerful, but it's not a universal solution. Understanding the limitations saves you from architectural mistakes.

InnoDB Only

Galera replicates at the InnoDB level. MyISAM, MEMORY, and other storage engines are not replicated. Writes to non-InnoDB tables are local-only — invisible to other nodes.

Percona XtraDB Cluster's Strict Mode (pxc_strict_mode=ENFORCING) rejects non-InnoDB table creation entirely.

All Tables Need Primary Keys

Galera's certification uses primary keys to detect conflicts. Tables without primary keys:

Can't be efficiently certified
Cause full-table scans during apply on remote nodes
Are rejected by PXC Strict Mode

Always define a primary key on every table.

No Large Transactions

Galera write-sets are held in memory during certification and broadcasting. A transaction that modifies millions of rows creates a massive write-set that:

Consumes memory on all nodes
Takes time to broadcast
Triggers flow control while other nodes process it
Can cause wsrep_max_ws_size violations (default: 2GB)

Break bulk operations into batches. Instead of:

-- Bad: one massive transaction
DELETE FROM logs WHERE created_at < '2025-01-01';

Use:

-- Good: chunked deletes
DELETE FROM logs WHERE created_at < '2025-01-01' LIMIT 10000;
-- Repeat until 0 rows affected

Write Scalability

Galera does not scale writes horizontally. Every write must be certified and applied on every node. Adding more nodes actually adds more certification overhead. Galera scales reads (any node can serve SELECT queries) but not writes.

If you need write scalability beyond what a single node can handle, look at application-level sharding or Vitess (covered in Post 10).

WAN Latency

Every COMMIT in Galera requires a network round-trip to all nodes. On a LAN (< 1ms latency), this adds negligible overhead. Across a WAN (50ms+ latency), every commit takes at least 50ms regardless of transaction complexity.

Galera supports WAN segments (configurable via gmcast.segment) to optimize communication topology, but the fundamental latency cost remains.

Essential wsrep Status Variables

Here's a monitoring cheat sheet — the variables you should watch in production:

Cluster Health

-- Is the cluster operational?
SHOW STATUS LIKE 'wsrep_cluster_status';     -- Primary = good
SHOW STATUS LIKE 'wsrep_ready';              -- ON = accepting queries
SHOW STATUS LIKE 'wsrep_connected';          -- ON = connected to cluster
SHOW STATUS LIKE 'wsrep_cluster_size';       -- Number of nodes

Replication Performance

-- Flow control (should be near 0)
SHOW STATUS LIKE 'wsrep_flow_control_paused';     -- Fraction of time paused
SHOW STATUS LIKE 'wsrep_flow_control_sent';       -- FC events sent by this node
 
-- Queue lengths (should be near 0)
SHOW STATUS LIKE 'wsrep_local_recv_queue_avg';    -- Avg receive queue
SHOW STATUS LIKE 'wsrep_local_send_queue_avg';    -- Avg send queue
 
-- Certification
SHOW STATUS LIKE 'wsrep_cert_deps_distance';      -- How many write-sets can apply in parallel
SHOW STATUS LIKE 'wsrep_apply_oooe';              -- Out-of-order apply efficiency

Transaction Statistics

-- Write-set statistics
SHOW STATUS LIKE 'wsrep_last_committed';          -- Highest committed seqno
SHOW STATUS LIKE 'wsrep_replicated';              -- Write-sets sent by this node
SHOW STATUS LIKE 'wsrep_replicated_bytes';        -- Bytes sent
SHOW STATUS LIKE 'wsrep_received';                -- Write-sets received from others
SHOW STATUS LIKE 'wsrep_local_cert_failures';     -- Certification failures (conflicts)
SHOW STATUS LIKE 'wsrep_local_bf_aborts';         -- Brute-force aborts (conflicts with remote apply)

Putting It All Together

Here's the complete picture of how a 3-node Galera cluster handles a write:

The guarantees:

No data loss on failover — every committed write exists on all surviving nodes
No replication lag — certification is synchronous; reads from any node are current
Automatic conflict resolution — certification catches conflicts without distributed locks
Self-healing — nodes rejoin via IST/SST automatically, no manual intervention
No split-brain — quorum prevents divergence during network partitions

What's Next

You now understand the theory behind Galera. In the next post, we turn theory into practice with Percona XtraDB Cluster (PXC) — Galera's production-ready distribution. You'll bootstrap a 3-node cluster with Docker Compose, configure SST with XtraBackup, test node failure and recovery, and see PXC Strict Mode in action.

Continue to Post 4 → Percona XtraDB Cluster Setup

Series navigation:
← Post 2: MySQL Replication Fundamentals
→ Post 4: Percona XtraDB Cluster Setup