MySQL High Availability: Galera, Percona & ProxySQL Roadmap

Every production application eventually faces the same hard question: what happens to your database when a server dies?
If the answer is "we wait for someone to notice and restart it," you have a problem. If the answer is "we have async replication with a manual failover script," you have a different problem — one that surfaces at 3 AM during your on-call rotation. If the answer is "the cluster automatically re-elects, traffic re-routes, and users don't notice," you have a high-availability MySQL setup.
This roadmap teaches you exactly how to build that second answer into a first one. We'll go from MySQL replication fundamentals all the way to running a production-grade Galera / Percona XtraDB Cluster with ProxySQL routing traffic intelligently in front of it, Percona XtraBackup keeping backups safe, and Percona Monitoring and Management (PMM) giving you full observability.
Why This Stack?
The open-source MySQL HA ecosystem has converged around a handful of battle-tested tools. Here's why these specific pieces matter:
Galera Cluster is a synchronous multi-master replication library. Every write is committed to a quorum of nodes before it's acknowledged. No lag, no "replication delay," no risk of losing committed data on failover. It powers thousands of production databases at companies you've heard of.
Percona XtraDB Cluster (PXC) is the production-hardened distribution of Galera. Percona ships their own MySQL fork (XtraDB storage engine with InnoDB enhancements) bundled with Galera, Docker images, operator charts — plus Percona's own tooling ecosystem.
ProxySQL sits in front of your cluster as a smart proxy. It reads your MySQL traffic, routes SELECT statements to readers, routes writes to the primary, enforces connection pooling, handles failover health checks, and does query rewrites — all at wire speed with zero application change.
Percona XtraBackup takes non-blocking physical backups of InnoDB tables. No table locks, no downtime, no replication lag from backup operations. It's the industry standard for MySQL backup in production.
Percona Monitoring and Management (PMM) is a full observability stack for MySQL — query analytics, slow query profiling, InnoDB metrics, replication lag graphs, Galera flow control metrics — all in one UI, backed by Grafana and VictoriaMetrics.
The Problem with "Just Use Replication"
Standard MySQL async replication is good enough for reads, but it has well-known failure modes:
The core issue: replication is asynchronous by default. The primary acknowledges your COMMIT before replicas have applied the change. If the primary dies, you must choose between:
- Promoting a replica and accepting potential data loss (the replica might be seconds behind)
- Waiting for replicas to catch up — but the primary is gone, so you can't
- Failover scripts (MHA, Orchestrator) that automate promotion but can't recover already-lost transactions
Semi-synchronous replication (MySQL built-in) solves the data loss problem partially, but it adds latency and can downgrade to async under load. And you still need external tooling to automate failover.
Galera's approach is fundamentally different: every node is a writer, every write is certified across the cluster before commit, and nodes that fall behind are SST-recovered automatically.
Architecture Overview
Here's what a minimal production Galera + ProxySQL stack looks like:
Key properties of this architecture:
- Any node can accept writes — but ProxySQL typically routes writes to one node to minimize write conflicts
- Cluster survives the loss of any single node (3-node quorum = majority at 2)
- ProxySQL detects node failures via health checks and removes failed nodes from rotation in seconds
- New nodes join automatically via SST (State Snapshot Transfer) — no manual restore needed
What You'll Learn in This Series
Series Breakdown
Phase 1 — MySQL Replication Fundamentals
Post 2: MySQL Replication Fundamentals (Async, Semi-Sync, GTID)
Before Galera makes sense, you need to understand what it's replacing and why. This post covers:
- How binary log (binlog) replication works — statement-based vs row-based vs mixed
- Async vs semi-sync replication — what you gain and what it costs
- GTID (Global Transaction Identifiers) — understanding GTIDs is mandatory for Galera because PXC also uses transaction identifiers for certification
- Replication lag — how to measure it, why it matters, and what Galera does instead
- Common replication failure modes and how operators traditionally handled them
- Why async replication is still useful (analytics replicas, cross-region disaster recovery)
This post sets the conceptual baseline. If you're already comfortable with MySQL replication internals, you can skim it.
Phase 2 — Galera Cluster Architecture
Post 3: Galera Cluster Architecture — Synchronous Multi-Master Replication
The core of the series. You'll understand:
- The wsrep API — Galera's replication hook into MySQL/MariaDB
- Write-set replication — how Galera packages a transaction, broadcasts it to all nodes, and certifies it atomically
- Certification-based replication vs lock-based — why certification is faster for multi-master
- Galera states: Open, Primary, Joiner, Joined, Synced, Donor
- Flow control — the cluster-wide mechanism that throttles writes when a node falls behind, and how to tune it
- Split-brain and quorum — what happens when nodes can't see each other, how quorum prevents data divergence
- SST (State Snapshot Transfer) vs IST (Incremental State Transfer) — full copy vs catching up from gcache
Phase 3 — Percona XtraDB Cluster
Post 4: Percona XtraDB Cluster (PXC) — Galera Done Right
PXC is the recommended way to run Galera in production. This practical post covers:
- PXC vs vanilla Galera on MariaDB — what's different and why Percona's builds are preferred for operational teams
- Bootstrapping a 3-node cluster from scratch with Docker Compose
- SST methods —
xtrabackup-v2(recommended, non-blocking donor),rsync(blocks donor reads),mariabackup(MariaDB alternative) - PXC Strict Mode — what operations are rejected and why (non-transactional tables, LOCK TABLES, explicit table locks)
- Node operations: adding a node, removing a node gracefully, recovering a crashed node
- PXC on Kubernetes with the Percona Operator
- Configuration tunables:
wsrep_cluster_size,wsrep_sst_method,wsrep_provider_options(gcache size, segment, EVS parameters)
Deep Dives
Post 5: ProxySQL — Smart Load Balancing & Query Routing
ProxySQL is the piece that makes your cluster invisible to applications. This deep dive covers:
- ProxySQL architecture: multi-layer thread pool, in-memory SQLite config, runtime vs disk config
- Hostgroups —
writer_hostgroup,reader_hostgroup,offline_hostgroup,backup_writer_hostgroup - The Galera-aware ProxySQL scheduler (
scheduler.shthat pings wsrep state and updates hostgroups automatically) - Query rules — routing reads to replicas, blocking dangerous queries, query rewriting
- Connection pooling — transaction vs session vs multiplexed pooling modes
- Query analytics in ProxySQL — slow query log, statistics tables, digests
- Zero-downtime ProxySQL config changes via
LOAD ... TO RUNTIME; SAVE ... TO DISK
Post 6: Backup & Recovery with Percona XtraBackup
The industry standard for production MySQL backup:
- How physical backup (XtraBackup) differs from logical backup (
mysqldump) - Full backup workflow — backup, prepare, restore
- Incremental backup — LSN-based incremental strategy, how to chain incrementals
- Streaming backups to S3 —
xbstream+xbcloudfor remote backup storage - Point-in-time recovery (PITR) with XtraBackup + binlog position
- Encryption at rest —
--encryptand--encrypt-keyoptions - Backup verification —
innobackupex --apply-logand validating restored data - Backup scheduling with cron, monitoring backup freshness
Post 7: Schema Changes in Galera: pt-osc, gh-ost & Percona Toolkit
Online DDL in Galera is the topic that catches most teams off-guard:
- Why
ALTER TABLEin Galera is dangerous — TOI (Total Order Isolation) blocks the entire cluster - RSU (Rolling Schema Upgrade) — run DDL node by node, desync from cluster first
- pt-online-schema-change (pt-osc) — trigger-based online DDL that works with Galera
- gh-ost — triggerless online DDL via binlog streaming (requires careful Galera config)
- Other Percona Toolkit utilities:
pt-archiver,pt-table-checksum,pt-table-sync - Decision matrix: when to use TOI vs RSU vs pt-osc vs gh-ost
Post 8: Tuning & Performance for Galera Clusters
Galera adds overhead — here's how to manage it:
- Galera-specific InnoDB tunables —
innodb_flush_log_at_trx_commit,sync_binlog, write-set buffer sizing - Flow control deep dive —
wsrep_flow_control_paused, how to read it, how to reduce it - Workload patterns that hurt Galera — large transactions, bulk inserts, hot-row updates, DDL storms
- Write scaling strategies — segment Galera (LAN vs WAN segments), sharding at application layer
- gcache tuning — sizing the Galera write-set cache for faster IST joins
- Benchmarking methodology:
sysbenchworkloads for Galera, readingwsrep_*status variables
Post 9: Monitoring with PMM (Percona Monitoring and Management)
Visibility into your cluster is non-negotiable:
- PMM architecture:
pmm-server(Grafana + VictoriaMetrics + Alertmanager) +pmm-clientagents - Query Analytics (QAN) — identify your slowest queries, see execution plans, track query trends
- MySQL dashboards: InnoDB metrics, buffer pool hit rate, thread state, table I/O
- Galera-specific dashboards:
wsrep_cluster_size, flow control, replication lag, certification conflicts - Alerting rules — node down, replication lag, flow control %, disk I/O thresholds
- PMM with Docker Compose and Kubernetes Helm chart
- Custom Grafana panels for application-level SLOs
Post 10: InnoDB Cluster, Group Replication & Vitess — Alternatives Compared
The ecosystem doesn't end at Galera:
- MySQL InnoDB Cluster (MySQL Shell + Group Replication + MySQL Router) — Oracle's official HA stack comparison vs PXC
- Group Replication vs Galera — same goal, different protocol, different trade-offs
- Vitess — horizontal sharding proxy, how YouTube runs MySQL at scale
- Orchestrator — standalone MySQL topology manager for async replication setups
- MaxScale — MariaDB's equivalent to ProxySQL
- Decision matrix: which stack fits which use case
Learning Paths
Path 1 — Developer who needs to understand HA
You write application code and want to understand how the database layer handles failures. You're not the primary DBA.
| Week | Focus | Posts |
|---|---|---|
| 1 | Replication concepts | Post 2 |
| 2 | Galera architecture | Post 3 |
| 3 | ProxySQL from app perspective | Post 5 (connection pooling, query routing) |
| 4 | Monitoring — reading dashboards | Post 9 |
Path 2 — DevOps / Platform engineer setting up HA for the first time
You need to get a cluster running, understand operations, and keep it healthy.
| Week | Focus | Posts |
|---|---|---|
| 1 | Replication fundamentals | Post 2 |
| 2 | Galera architecture | Post 3 |
| 3 | PXC setup (hands-on) | Post 4 |
| 4 | ProxySQL setup | Post 5 |
| 5 | Backup setup | Post 6 |
| 6 | Schema change strategy | Post 7 |
| 7 | Monitoring setup | Post 9 |
| 8 | Performance review | Post 8 |
Path 3 — Evaluating HA options before committing
You have a write-heavy MySQL workload and need to decide whether Galera, InnoDB Cluster, or something else is right.
| Reading | Posts |
|---|---|
| Start here | Post 3 (Galera architecture) |
| Compare alternatives | Post 10 |
| Understand operational cost | Post 7 (schema changes are the hardest part) |
| Understand monitoring investment | Post 9 |
Prerequisites
Before starting this series you should be comfortable with:
- MySQL basics — CREATE TABLE, SELECT, INSERT, UPDATE, EXPLAIN, user management
- Docker / Docker Compose — all examples run in containers
- Basic Linux operations —
systemctl, log tailing, file editing - Networking fundamentals — ports, firewall rules, what a proxy is
If you're weak on any of these:
- MySQL basics → SQL & NoSQL Roadmap
- Docker → Docker & Kubernetes Roadmap
- Networking → Networking for Web Developers
Galera vs The Alternatives — Quick Reference
| Feature | Async Replication | Semi-Sync | Galera / PXC | InnoDB Cluster | Vitess |
|---|---|---|---|---|---|
| Write destinations | 1 primary | 1 primary | All nodes | 1 primary | Varies (sharding) |
| Data loss on failover | Possible | Minimal | None | None (sync) | None |
| Failover automation | External tool | External tool | Built-in | MySQL Router | Built-in |
| Write throughput | Single-node limit | Single-node limit | Limited by wsrep overhead | Single-node limit | Horizontal scale |
| Read scale | Replicas (async) | Replicas | All nodes | Replicas | All nodes |
| Replication lag | Yes | Minimal | None | None | None |
| Operational complexity | Low | Medium | Medium-High | Medium | Very High |
| Best for | Read-heavy, DR | Mission-critical single-primary | Multi-master HA | Oracle ecosystem | Web-scale sharding |
Common Misconceptions
"Galera eliminates all latency" — No. Galera adds certification round-trip latency to every commit. WAN clusters feel this significantly. The trade-off is zero data loss, not zero latency.
"ProxySQL makes Galera truly multi-master" — Practically, ProxySQL routes writes to one node to avoid write conflict retries. Galera supports writes to any node, but your application needs to handle deadlock-like certification failures if you actually write to multiple nodes simultaneously.
"You can use MyISAM tables in a Galera cluster" — PXC Strict Mode will reject this. Galera only replicates InnoDB (transactional) writes. MyISAM writes are local-only.
"SST is fast" — SST is a full data copy from a donor node. For a 500 GB dataset, SST takes time. Size your gcache appropriately so rejoining nodes can use IST (incremental) instead.
"Galera solves database scaling" — Galera solves availability, not throughput. All writes touch all nodes. For write-intensive workloads beyond what one node can handle, you need sharding (Vitess) or a different database.
Essential Tools in This Ecosystem
| Tool | Purpose | Where it runs |
|---|---|---|
| Galera Cluster | Synchronous replication library | Bundled in PXC / MariaDB |
| Percona XtraDB Cluster | Production MySQL + Galera distribution | Database nodes |
| ProxySQL | MySQL proxy, query routing, pooling | Separate proxy layer |
| Percona XtraBackup | Non-blocking InnoDB backup | Database nodes / backup server |
| PMM Server | Monitoring, query analytics, dashboards | Dedicated monitoring server |
| PMM Client | Agent on each database node | Database nodes |
| Percona Toolkit | pt-osc, pt-table-checksum, pt-archiver | Ops workstation / cron |
| gh-ost | Triggerless online DDL | Ops workstation |
| Orchestrator | Topology management (for async setups) | Separate server |
| MySQL Shell | InnoDB Cluster management (alternative) | Ops workstation |
Related Series on This Blog
This series assumes MySQL as the database. For broader context:
- SQL & NoSQL Roadmap — SQL fundamentals, schema design, indexing, MongoDB, Redis
- Patterns of Distributed Systems — Consensus, leader election, replication patterns that Galera implements
- Docker & Kubernetes Roadmap — All examples in this series use Docker Compose; Kubernetes deployment is covered for PXC
- Load Balancing Explained — ProxySQL is a specialized Layer 7 proxy; this post gives the general foundation
- IoT Data Pipelines — High-write IoT workloads are a common reason teams reach for Galera
Let's Get Started
The next post dives into MySQL replication fundamentals — binlog formats, GTID, async vs semi-sync, and the practical failure modes that Galera was designed to solve.
If you already have solid replication knowledge, jump straight to Post 3: Galera Cluster Architecture — that's where the interesting stuff starts.
📬 Subscribe to Newsletter
Get the latest blog posts delivered to your inbox every week. No spam, unsubscribe anytime.
We respect your privacy. Unsubscribe at any time.
💬 Comments
Sign in to leave a comment
We'll never post without your permission.