MySQL High Availability: Galera, Percona & ProxySQL Roadmap

Every production application eventually faces the same hard question: what happens to your database when a server dies?

If the answer is "we wait for someone to notice and restart it," you have a problem. If the answer is "we have async replication with a manual failover script," you have a different problem — one that surfaces at 3 AM during your on-call rotation. If the answer is "the cluster automatically re-elects, traffic re-routes, and users don't notice," you have a high-availability MySQL setup.

This roadmap teaches you exactly how to build that second answer into a first one. We'll go from MySQL replication fundamentals all the way to running a production-grade Galera / Percona XtraDB Cluster with ProxySQL routing traffic intelligently in front of it, Percona XtraBackup keeping backups safe, and Percona Monitoring and Management (PMM) giving you full observability.

Why This Stack?

The open-source MySQL HA ecosystem has converged around a handful of battle-tested tools. Here's why these specific pieces matter:

Galera Cluster is a synchronous multi-master replication library. Every write is committed to a quorum of nodes before it's acknowledged. No lag, no "replication delay," no risk of losing committed data on failover. It powers thousands of production databases at companies you've heard of.

Percona XtraDB Cluster (PXC) is the production-hardened distribution of Galera. Percona ships their own MySQL fork (XtraDB storage engine with InnoDB enhancements) bundled with Galera, Docker images, operator charts — plus Percona's own tooling ecosystem.

ProxySQL sits in front of your cluster as a smart proxy. It reads your MySQL traffic, routes SELECT statements to readers, routes writes to the primary, enforces connection pooling, handles failover health checks, and does query rewrites — all at wire speed with zero application change.

Percona XtraBackup takes non-blocking physical backups of InnoDB tables. No table locks, no downtime, no replication lag from backup operations. It's the industry standard for MySQL backup in production.

Percona Monitoring and Management (PMM) is a full observability stack for MySQL — query analytics, slow query profiling, InnoDB metrics, replication lag graphs, Galera flow control metrics — all in one UI, backed by Grafana and VictoriaMetrics.

The Problem with "Just Use Replication"

Standard MySQL async replication is good enough for reads, but it has well-known failure modes:

The core issue: replication is asynchronous by default. The primary acknowledges your COMMIT before replicas have applied the change. If the primary dies, you must choose between:

Promoting a replica and accepting potential data loss (the replica might be seconds behind)
Waiting for replicas to catch up — but the primary is gone, so you can't
Failover scripts (MHA, Orchestrator) that automate promotion but can't recover already-lost transactions

Semi-synchronous replication (MySQL built-in) solves the data loss problem partially, but it adds latency and can downgrade to async under load. And you still need external tooling to automate failover.

Galera's approach is fundamentally different: every node is a writer, every write is certified across the cluster before commit, and nodes that fall behind are SST-recovered automatically.

Architecture Overview

Here's what a minimal production Galera + ProxySQL stack looks like:

Key properties of this architecture:

Any node can accept writes — but ProxySQL typically routes writes to one node to minimize write conflicts
Cluster survives the loss of any single node (3-node quorum = majority at 2)
ProxySQL detects node failures via health checks and removes failed nodes from rotation in seconds
New nodes join automatically via SST (State Snapshot Transfer) — no manual restore needed

What You'll Learn in This Series

Series Breakdown

Phase 1 — MySQL Replication Fundamentals

Post 2: MySQL Replication Fundamentals (Async, Semi-Sync, GTID)

Before Galera makes sense, you need to understand what it's replacing and why. This post covers:

How binary log (binlog) replication works — statement-based vs row-based vs mixed
Async vs semi-sync replication — what you gain and what it costs
GTID (Global Transaction Identifiers) — understanding GTIDs is mandatory for Galera because PXC also uses transaction identifiers for certification
Replication lag — how to measure it, why it matters, and what Galera does instead
Common replication failure modes and how operators traditionally handled them
Why async replication is still useful (analytics replicas, cross-region disaster recovery)

This post sets the conceptual baseline. If you're already comfortable with MySQL replication internals, you can skim it.

Phase 2 — Galera Cluster Architecture

Post 3: Galera Cluster Architecture — Synchronous Multi-Master Replication

The core of the series. You'll understand:

The wsrep API — Galera's replication hook into MySQL/MariaDB
Write-set replication — how Galera packages a transaction, broadcasts it to all nodes, and certifies it atomically
Certification-based replication vs lock-based — why certification is faster for multi-master
Galera states: Open, Primary, Joiner, Joined, Synced, Donor
Flow control — the cluster-wide mechanism that throttles writes when a node falls behind, and how to tune it
Split-brain and quorum — what happens when nodes can't see each other, how quorum prevents data divergence
SST (State Snapshot Transfer) vs IST (Incremental State Transfer) — full copy vs catching up from gcache

Phase 3 — Percona XtraDB Cluster

Post 4: Percona XtraDB Cluster (PXC) — Galera Done Right

PXC is the recommended way to run Galera in production. This practical post covers:

PXC vs vanilla Galera on MariaDB — what's different and why Percona's builds are preferred for operational teams
Bootstrapping a 3-node cluster from scratch with Docker Compose
SST methods — xtrabackup-v2 (recommended, non-blocking donor), rsync (blocks donor reads), mariabackup (MariaDB alternative)
PXC Strict Mode — what operations are rejected and why (non-transactional tables, LOCK TABLES, explicit table locks)
Node operations: adding a node, removing a node gracefully, recovering a crashed node
PXC on Kubernetes with the Percona Operator
Configuration tunables: wsrep_cluster_size, wsrep_sst_method, wsrep_provider_options (gcache size, segment, EVS parameters)

Deep Dives

Post 5: ProxySQL — Smart Load Balancing & Query Routing

ProxySQL is the piece that makes your cluster invisible to applications. This deep dive covers:

ProxySQL architecture: multi-layer thread pool, in-memory SQLite config, runtime vs disk config
Hostgroups — writer_hostgroup, reader_hostgroup, offline_hostgroup, backup_writer_hostgroup
The Galera-aware ProxySQL scheduler (scheduler.sh that pings wsrep state and updates hostgroups automatically)
Query rules — routing reads to replicas, blocking dangerous queries, query rewriting
Connection pooling — transaction vs session vs multiplexed pooling modes
Query analytics in ProxySQL — slow query log, statistics tables, digests
Zero-downtime ProxySQL config changes via LOAD ... TO RUNTIME; SAVE ... TO DISK

Post 6: Backup & Recovery with Percona XtraBackup

The industry standard for production MySQL backup:

How physical backup (XtraBackup) differs from logical backup (mysqldump)
Full backup workflow — backup, prepare, restore
Incremental backup — LSN-based incremental strategy, how to chain incrementals
Streaming backups to S3 — xbstream + xbcloud for remote backup storage
Point-in-time recovery (PITR) with XtraBackup + binlog position
Encryption at rest — --encrypt and --encrypt-key options
Backup verification — innobackupex --apply-log and validating restored data
Backup scheduling with cron, monitoring backup freshness

Post 7: Schema Changes in Galera: pt-osc, gh-ost & Percona Toolkit

Online DDL in Galera is the topic that catches most teams off-guard:

Why ALTER TABLE in Galera is dangerous — TOI (Total Order Isolation) blocks the entire cluster
RSU (Rolling Schema Upgrade) — run DDL node by node, desync from cluster first
pt-online-schema-change (pt-osc) — trigger-based online DDL that works with Galera
gh-ost — triggerless online DDL via binlog streaming (requires careful Galera config)
Other Percona Toolkit utilities: pt-archiver, pt-table-checksum, pt-table-sync
Decision matrix: when to use TOI vs RSU vs pt-osc vs gh-ost

Post 8: Tuning & Performance for Galera Clusters

Galera adds overhead — here's how to manage it:

Galera-specific InnoDB tunables — innodb_flush_log_at_trx_commit, sync_binlog, write-set buffer sizing
Flow control deep dive — wsrep_flow_control_paused, how to read it, how to reduce it
Workload patterns that hurt Galera — large transactions, bulk inserts, hot-row updates, DDL storms
Write scaling strategies — segment Galera (LAN vs WAN segments), sharding at application layer
gcache tuning — sizing the Galera write-set cache for faster IST joins
Benchmarking methodology: sysbench workloads for Galera, reading wsrep_* status variables

Post 9: Monitoring with PMM (Percona Monitoring and Management)

Visibility into your cluster is non-negotiable:

PMM architecture: pmm-server (Grafana + VictoriaMetrics + Alertmanager) + pmm-client agents
Query Analytics (QAN) — identify your slowest queries, see execution plans, track query trends
MySQL dashboards: InnoDB metrics, buffer pool hit rate, thread state, table I/O
Galera-specific dashboards: wsrep_cluster_size, flow control, replication lag, certification conflicts
Alerting rules — node down, replication lag, flow control %, disk I/O thresholds
PMM with Docker Compose and Kubernetes Helm chart
Custom Grafana panels for application-level SLOs

Post 10: InnoDB Cluster, Group Replication & Vitess — Alternatives Compared

The ecosystem doesn't end at Galera:

MySQL InnoDB Cluster (MySQL Shell + Group Replication + MySQL Router) — Oracle's official HA stack comparison vs PXC
Group Replication vs Galera — same goal, different protocol, different trade-offs
Vitess — horizontal sharding proxy, how YouTube runs MySQL at scale
Orchestrator — standalone MySQL topology manager for async replication setups
MaxScale — MariaDB's equivalent to ProxySQL
Decision matrix: which stack fits which use case

Learning Paths

Path 1 — Developer who needs to understand HA

You write application code and want to understand how the database layer handles failures. You're not the primary DBA.

Week	Focus	Posts
1	Replication concepts	Post 2
2	Galera architecture	Post 3
3	ProxySQL from app perspective	Post 5 (connection pooling, query routing)
4	Monitoring — reading dashboards	Post 9

Path 2 — DevOps / Platform engineer setting up HA for the first time

You need to get a cluster running, understand operations, and keep it healthy.

Week	Focus	Posts
1	Replication fundamentals	Post 2
2	Galera architecture	Post 3
3	PXC setup (hands-on)	Post 4
4	ProxySQL setup	Post 5
5	Backup setup	Post 6
6	Schema change strategy	Post 7
7	Monitoring setup	Post 9
8	Performance review	Post 8

Path 3 — Evaluating HA options before committing

You have a write-heavy MySQL workload and need to decide whether Galera, InnoDB Cluster, or something else is right.

Reading	Posts
Start here	Post 3 (Galera architecture)
Compare alternatives	Post 10
Understand operational cost	Post 7 (schema changes are the hardest part)
Understand monitoring investment	Post 9

Prerequisites

Before starting this series you should be comfortable with:

MySQL basics — CREATE TABLE, SELECT, INSERT, UPDATE, EXPLAIN, user management
Docker / Docker Compose — all examples run in containers
Basic Linux operations — systemctl, log tailing, file editing
Networking fundamentals — ports, firewall rules, what a proxy is

If you're weak on any of these:

MySQL basics → SQL & NoSQL Roadmap
Docker → Docker & Kubernetes Roadmap
Networking → Networking for Web Developers

Galera vs The Alternatives — Quick Reference

Feature	Async Replication	Semi-Sync	Galera / PXC	InnoDB Cluster	Vitess
Write destinations	1 primary	1 primary	All nodes	1 primary	Varies (sharding)
Data loss on failover	Possible	Minimal	None	None (sync)	None
Failover automation	External tool	External tool	Built-in	MySQL Router	Built-in
Write throughput	Single-node limit	Single-node limit	Limited by wsrep overhead	Single-node limit	Horizontal scale
Read scale	Replicas (async)	Replicas	All nodes	Replicas	All nodes
Replication lag	Yes	Minimal	None	None	None
Operational complexity	Low	Medium	Medium-High	Medium	Very High
Best for	Read-heavy, DR	Mission-critical single-primary	Multi-master HA	Oracle ecosystem	Web-scale sharding

Common Misconceptions

"Galera eliminates all latency" — No. Galera adds certification round-trip latency to every commit. WAN clusters feel this significantly. The trade-off is zero data loss, not zero latency.

"ProxySQL makes Galera truly multi-master" — Practically, ProxySQL routes writes to one node to avoid write conflict retries. Galera supports writes to any node, but your application needs to handle deadlock-like certification failures if you actually write to multiple nodes simultaneously.

"You can use MyISAM tables in a Galera cluster" — PXC Strict Mode will reject this. Galera only replicates InnoDB (transactional) writes. MyISAM writes are local-only.

"SST is fast" — SST is a full data copy from a donor node. For a 500 GB dataset, SST takes time. Size your gcache appropriately so rejoining nodes can use IST (incremental) instead.

"Galera solves database scaling" — Galera solves availability, not throughput. All writes touch all nodes. For write-intensive workloads beyond what one node can handle, you need sharding (Vitess) or a different database.

Essential Tools in This Ecosystem

Tool	Purpose	Where it runs
Galera Cluster	Synchronous replication library	Bundled in PXC / MariaDB
Percona XtraDB Cluster	Production MySQL + Galera distribution	Database nodes
ProxySQL	MySQL proxy, query routing, pooling	Separate proxy layer
Percona XtraBackup	Non-blocking InnoDB backup	Database nodes / backup server
PMM Server	Monitoring, query analytics, dashboards	Dedicated monitoring server
PMM Client	Agent on each database node	Database nodes
Percona Toolkit	`pt-osc`, `pt-table-checksum`, `pt-archiver`	Ops workstation / cron
gh-ost	Triggerless online DDL	Ops workstation
Orchestrator	Topology management (for async setups)	Separate server
MySQL Shell	InnoDB Cluster management (alternative)	Ops workstation

This series assumes MySQL as the database. For broader context:

SQL & NoSQL Roadmap — SQL fundamentals, schema design, indexing, MongoDB, Redis
Patterns of Distributed Systems — Consensus, leader election, replication patterns that Galera implements
Docker & Kubernetes Roadmap — All examples in this series use Docker Compose; Kubernetes deployment is covered for PXC
Load Balancing Explained — ProxySQL is a specialized Layer 7 proxy; this post gives the general foundation
IoT Data Pipelines — High-write IoT workloads are a common reason teams reach for Galera

Let's Get Started

The next post dives into MySQL replication fundamentals — binlog formats, GTID, async vs semi-sync, and the practical failure modes that Galera was designed to solve.

If you already have solid replication knowledge, jump straight to Post 3: Galera Cluster Architecture — that's where the interesting stuff starts.

Continue to Post 2 → MySQL Replication Fundamentals

MySQL High Availability: Galera, Percona & ProxySQL Roadmap

Why This Stack?

The Problem with "Just Use Replication"

Architecture Overview

What You'll Learn in This Series

Series Breakdown

Phase 1 — MySQL Replication Fundamentals

Phase 2 — Galera Cluster Architecture

Phase 3 — Percona XtraDB Cluster

Deep Dives

Learning Paths

Path 1 — Developer who needs to understand HA

Path 2 — DevOps / Platform engineer setting up HA for the first time

Path 3 — Evaluating HA options before committing

Prerequisites

Galera vs The Alternatives — Quick Reference

Common Misconceptions

Essential Tools in This Ecosystem

Let's Get Started

📬 Subscribe to Newsletter

💬 Comments

Why This Stack?

The Problem with "Just Use Replication"

Architecture Overview

What You'll Learn in This Series

Series Breakdown

Phase 1 — MySQL Replication Fundamentals

Phase 2 — Galera Cluster Architecture

Phase 3 — Percona XtraDB Cluster

Deep Dives

Learning Paths

Path 1 — Developer who needs to understand HA

Path 2 — DevOps / Platform engineer setting up HA for the first time

Path 3 — Evaluating HA options before committing

Prerequisites

Galera vs The Alternatives — Quick Reference

Common Misconceptions

Essential Tools in This Ecosystem

Related Series on This Blog

Let's Get Started

📬 Subscribe to Newsletter

💬 Comments