Back to blog

MySQL High Availability: Galera, Percona & ProxySQL Roadmap

mysqlgaleraperconaproxysqldatabasehigh-availability
MySQL High Availability: Galera, Percona & ProxySQL Roadmap

Every production application eventually faces the same hard question: what happens to your database when a server dies?

If the answer is "we wait for someone to notice and restart it," you have a problem. If the answer is "we have async replication with a manual failover script," you have a different problem — one that surfaces at 3 AM during your on-call rotation. If the answer is "the cluster automatically re-elects, traffic re-routes, and users don't notice," you have a high-availability MySQL setup.

This roadmap teaches you exactly how to build that second answer into a first one. We'll go from MySQL replication fundamentals all the way to running a production-grade Galera / Percona XtraDB Cluster with ProxySQL routing traffic intelligently in front of it, Percona XtraBackup keeping backups safe, and Percona Monitoring and Management (PMM) giving you full observability.


Why This Stack?

The open-source MySQL HA ecosystem has converged around a handful of battle-tested tools. Here's why these specific pieces matter:

Galera Cluster is a synchronous multi-master replication library. Every write is committed to a quorum of nodes before it's acknowledged. No lag, no "replication delay," no risk of losing committed data on failover. It powers thousands of production databases at companies you've heard of.

Percona XtraDB Cluster (PXC) is the production-hardened distribution of Galera. Percona ships their own MySQL fork (XtraDB storage engine with InnoDB enhancements) bundled with Galera, Docker images, operator charts — plus Percona's own tooling ecosystem.

ProxySQL sits in front of your cluster as a smart proxy. It reads your MySQL traffic, routes SELECT statements to readers, routes writes to the primary, enforces connection pooling, handles failover health checks, and does query rewrites — all at wire speed with zero application change.

Percona XtraBackup takes non-blocking physical backups of InnoDB tables. No table locks, no downtime, no replication lag from backup operations. It's the industry standard for MySQL backup in production.

Percona Monitoring and Management (PMM) is a full observability stack for MySQL — query analytics, slow query profiling, InnoDB metrics, replication lag graphs, Galera flow control metrics — all in one UI, backed by Grafana and VictoriaMetrics.


The Problem with "Just Use Replication"

Standard MySQL async replication is good enough for reads, but it has well-known failure modes:

The core issue: replication is asynchronous by default. The primary acknowledges your COMMIT before replicas have applied the change. If the primary dies, you must choose between:

  • Promoting a replica and accepting potential data loss (the replica might be seconds behind)
  • Waiting for replicas to catch up — but the primary is gone, so you can't
  • Failover scripts (MHA, Orchestrator) that automate promotion but can't recover already-lost transactions

Semi-synchronous replication (MySQL built-in) solves the data loss problem partially, but it adds latency and can downgrade to async under load. And you still need external tooling to automate failover.

Galera's approach is fundamentally different: every node is a writer, every write is certified across the cluster before commit, and nodes that fall behind are SST-recovered automatically.


Architecture Overview

Here's what a minimal production Galera + ProxySQL stack looks like:

Key properties of this architecture:

  • Any node can accept writes — but ProxySQL typically routes writes to one node to minimize write conflicts
  • Cluster survives the loss of any single node (3-node quorum = majority at 2)
  • ProxySQL detects node failures via health checks and removes failed nodes from rotation in seconds
  • New nodes join automatically via SST (State Snapshot Transfer) — no manual restore needed

What You'll Learn in This Series


Series Breakdown

Phase 1 — MySQL Replication Fundamentals

Post 2: MySQL Replication Fundamentals (Async, Semi-Sync, GTID)

Before Galera makes sense, you need to understand what it's replacing and why. This post covers:

  • How binary log (binlog) replication works — statement-based vs row-based vs mixed
  • Async vs semi-sync replication — what you gain and what it costs
  • GTID (Global Transaction Identifiers) — understanding GTIDs is mandatory for Galera because PXC also uses transaction identifiers for certification
  • Replication lag — how to measure it, why it matters, and what Galera does instead
  • Common replication failure modes and how operators traditionally handled them
  • Why async replication is still useful (analytics replicas, cross-region disaster recovery)

This post sets the conceptual baseline. If you're already comfortable with MySQL replication internals, you can skim it.


Phase 2 — Galera Cluster Architecture

Post 3: Galera Cluster Architecture — Synchronous Multi-Master Replication

The core of the series. You'll understand:

  • The wsrep API — Galera's replication hook into MySQL/MariaDB
  • Write-set replication — how Galera packages a transaction, broadcasts it to all nodes, and certifies it atomically
  • Certification-based replication vs lock-based — why certification is faster for multi-master
  • Galera states: Open, Primary, Joiner, Joined, Synced, Donor
  • Flow control — the cluster-wide mechanism that throttles writes when a node falls behind, and how to tune it
  • Split-brain and quorum — what happens when nodes can't see each other, how quorum prevents data divergence
  • SST (State Snapshot Transfer) vs IST (Incremental State Transfer) — full copy vs catching up from gcache

Phase 3 — Percona XtraDB Cluster

Post 4: Percona XtraDB Cluster (PXC) — Galera Done Right

PXC is the recommended way to run Galera in production. This practical post covers:

  • PXC vs vanilla Galera on MariaDB — what's different and why Percona's builds are preferred for operational teams
  • Bootstrapping a 3-node cluster from scratch with Docker Compose
  • SST methodsxtrabackup-v2 (recommended, non-blocking donor), rsync (blocks donor reads), mariabackup (MariaDB alternative)
  • PXC Strict Mode — what operations are rejected and why (non-transactional tables, LOCK TABLES, explicit table locks)
  • Node operations: adding a node, removing a node gracefully, recovering a crashed node
  • PXC on Kubernetes with the Percona Operator
  • Configuration tunables: wsrep_cluster_size, wsrep_sst_method, wsrep_provider_options (gcache size, segment, EVS parameters)

Deep Dives

Post 5: ProxySQL — Smart Load Balancing & Query Routing

ProxySQL is the piece that makes your cluster invisible to applications. This deep dive covers:

  • ProxySQL architecture: multi-layer thread pool, in-memory SQLite config, runtime vs disk config
  • Hostgroupswriter_hostgroup, reader_hostgroup, offline_hostgroup, backup_writer_hostgroup
  • The Galera-aware ProxySQL scheduler (scheduler.sh that pings wsrep state and updates hostgroups automatically)
  • Query rules — routing reads to replicas, blocking dangerous queries, query rewriting
  • Connection pooling — transaction vs session vs multiplexed pooling modes
  • Query analytics in ProxySQL — slow query log, statistics tables, digests
  • Zero-downtime ProxySQL config changes via LOAD ... TO RUNTIME; SAVE ... TO DISK

Post 6: Backup & Recovery with Percona XtraBackup

The industry standard for production MySQL backup:

  • How physical backup (XtraBackup) differs from logical backup (mysqldump)
  • Full backup workflow — backup, prepare, restore
  • Incremental backup — LSN-based incremental strategy, how to chain incrementals
  • Streaming backups to S3xbstream + xbcloud for remote backup storage
  • Point-in-time recovery (PITR) with XtraBackup + binlog position
  • Encryption at rest--encrypt and --encrypt-key options
  • Backup verification — innobackupex --apply-log and validating restored data
  • Backup scheduling with cron, monitoring backup freshness

Post 7: Schema Changes in Galera: pt-osc, gh-ost & Percona Toolkit

Online DDL in Galera is the topic that catches most teams off-guard:

  • Why ALTER TABLE in Galera is dangerous — TOI (Total Order Isolation) blocks the entire cluster
  • RSU (Rolling Schema Upgrade) — run DDL node by node, desync from cluster first
  • pt-online-schema-change (pt-osc) — trigger-based online DDL that works with Galera
  • gh-ost — triggerless online DDL via binlog streaming (requires careful Galera config)
  • Other Percona Toolkit utilities: pt-archiver, pt-table-checksum, pt-table-sync
  • Decision matrix: when to use TOI vs RSU vs pt-osc vs gh-ost

Post 8: Tuning & Performance for Galera Clusters

Galera adds overhead — here's how to manage it:

  • Galera-specific InnoDB tunablesinnodb_flush_log_at_trx_commit, sync_binlog, write-set buffer sizing
  • Flow control deep divewsrep_flow_control_paused, how to read it, how to reduce it
  • Workload patterns that hurt Galera — large transactions, bulk inserts, hot-row updates, DDL storms
  • Write scaling strategies — segment Galera (LAN vs WAN segments), sharding at application layer
  • gcache tuning — sizing the Galera write-set cache for faster IST joins
  • Benchmarking methodology: sysbench workloads for Galera, reading wsrep_* status variables

Post 9: Monitoring with PMM (Percona Monitoring and Management)

Visibility into your cluster is non-negotiable:

  • PMM architecture: pmm-server (Grafana + VictoriaMetrics + Alertmanager) + pmm-client agents
  • Query Analytics (QAN) — identify your slowest queries, see execution plans, track query trends
  • MySQL dashboards: InnoDB metrics, buffer pool hit rate, thread state, table I/O
  • Galera-specific dashboards: wsrep_cluster_size, flow control, replication lag, certification conflicts
  • Alerting rules — node down, replication lag, flow control %, disk I/O thresholds
  • PMM with Docker Compose and Kubernetes Helm chart
  • Custom Grafana panels for application-level SLOs

Post 10: InnoDB Cluster, Group Replication & Vitess — Alternatives Compared

The ecosystem doesn't end at Galera:

  • MySQL InnoDB Cluster (MySQL Shell + Group Replication + MySQL Router) — Oracle's official HA stack comparison vs PXC
  • Group Replication vs Galera — same goal, different protocol, different trade-offs
  • Vitess — horizontal sharding proxy, how YouTube runs MySQL at scale
  • Orchestrator — standalone MySQL topology manager for async replication setups
  • MaxScale — MariaDB's equivalent to ProxySQL
  • Decision matrix: which stack fits which use case

Learning Paths

Path 1 — Developer who needs to understand HA

You write application code and want to understand how the database layer handles failures. You're not the primary DBA.

WeekFocusPosts
1Replication conceptsPost 2
2Galera architecturePost 3
3ProxySQL from app perspectivePost 5 (connection pooling, query routing)
4Monitoring — reading dashboardsPost 9

Path 2 — DevOps / Platform engineer setting up HA for the first time

You need to get a cluster running, understand operations, and keep it healthy.

WeekFocusPosts
1Replication fundamentalsPost 2
2Galera architecturePost 3
3PXC setup (hands-on)Post 4
4ProxySQL setupPost 5
5Backup setupPost 6
6Schema change strategyPost 7
7Monitoring setupPost 9
8Performance reviewPost 8

Path 3 — Evaluating HA options before committing

You have a write-heavy MySQL workload and need to decide whether Galera, InnoDB Cluster, or something else is right.

ReadingPosts
Start herePost 3 (Galera architecture)
Compare alternativesPost 10
Understand operational costPost 7 (schema changes are the hardest part)
Understand monitoring investmentPost 9

Prerequisites

Before starting this series you should be comfortable with:

  • MySQL basics — CREATE TABLE, SELECT, INSERT, UPDATE, EXPLAIN, user management
  • Docker / Docker Compose — all examples run in containers
  • Basic Linux operationssystemctl, log tailing, file editing
  • Networking fundamentals — ports, firewall rules, what a proxy is

If you're weak on any of these:


Galera vs The Alternatives — Quick Reference

FeatureAsync ReplicationSemi-SyncGalera / PXCInnoDB ClusterVitess
Write destinations1 primary1 primaryAll nodes1 primaryVaries (sharding)
Data loss on failoverPossibleMinimalNoneNone (sync)None
Failover automationExternal toolExternal toolBuilt-inMySQL RouterBuilt-in
Write throughputSingle-node limitSingle-node limitLimited by wsrep overheadSingle-node limitHorizontal scale
Read scaleReplicas (async)ReplicasAll nodesReplicasAll nodes
Replication lagYesMinimalNoneNoneNone
Operational complexityLowMediumMedium-HighMediumVery High
Best forRead-heavy, DRMission-critical single-primaryMulti-master HAOracle ecosystemWeb-scale sharding

Common Misconceptions

"Galera eliminates all latency" — No. Galera adds certification round-trip latency to every commit. WAN clusters feel this significantly. The trade-off is zero data loss, not zero latency.

"ProxySQL makes Galera truly multi-master" — Practically, ProxySQL routes writes to one node to avoid write conflict retries. Galera supports writes to any node, but your application needs to handle deadlock-like certification failures if you actually write to multiple nodes simultaneously.

"You can use MyISAM tables in a Galera cluster" — PXC Strict Mode will reject this. Galera only replicates InnoDB (transactional) writes. MyISAM writes are local-only.

"SST is fast" — SST is a full data copy from a donor node. For a 500 GB dataset, SST takes time. Size your gcache appropriately so rejoining nodes can use IST (incremental) instead.

"Galera solves database scaling" — Galera solves availability, not throughput. All writes touch all nodes. For write-intensive workloads beyond what one node can handle, you need sharding (Vitess) or a different database.


Essential Tools in This Ecosystem

ToolPurposeWhere it runs
Galera ClusterSynchronous replication libraryBundled in PXC / MariaDB
Percona XtraDB ClusterProduction MySQL + Galera distributionDatabase nodes
ProxySQLMySQL proxy, query routing, poolingSeparate proxy layer
Percona XtraBackupNon-blocking InnoDB backupDatabase nodes / backup server
PMM ServerMonitoring, query analytics, dashboardsDedicated monitoring server
PMM ClientAgent on each database nodeDatabase nodes
Percona Toolkitpt-osc, pt-table-checksum, pt-archiverOps workstation / cron
gh-ostTriggerless online DDLOps workstation
OrchestratorTopology management (for async setups)Separate server
MySQL ShellInnoDB Cluster management (alternative)Ops workstation

This series assumes MySQL as the database. For broader context:


Let's Get Started

The next post dives into MySQL replication fundamentals — binlog formats, GTID, async vs semi-sync, and the practical failure modes that Galera was designed to solve.

If you already have solid replication knowledge, jump straight to Post 3: Galera Cluster Architecture — that's where the interesting stuff starts.

Continue to Post 2 → MySQL Replication Fundamentals

📬 Subscribe to Newsletter

Get the latest blog posts delivered to your inbox every week. No spam, unsubscribe anytime.

We respect your privacy. Unsubscribe at any time.

💬 Comments

Sign in to leave a comment

We'll never post without your permission.