The Challenge
Building high availability for PostgreSQL on AWS seems straightforward until you hit these real-world constraints:
- Multi-AZ with Different Subnets: Your instances are in different availability zones with different subnet ranges (10.0.1.0/24 vs 10.0.2.0/24)
- Private Network Only: Databases shouldn't be in public subnets (security best practice)
- Single Connection Point: Clients need one hostname, not IP management
- Zero Data Loss: Synchronous replication is non-negotiable
- Automatic Failover: No manual intervention during failures
Most solutions fail at #1. Secondary Private IPs don't work across different subnets. Elastic IPs are public. Traditional virtual IPs (IPaddr2) only work in the same subnet.
The solution? Route53 Private Hosted Zone with Pacemaker automation.
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ VPC (10.0.0.0/16) │
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Subnet AZ-a │ │ Subnet AZ-b │ │
│ │ 10.0.1.0/24 │ │ 10.0.2.0/24 │ │
│ │ │ │ │ │
│ │ ┌────────────────┐ │ │ ┌────────────────┐ │ │
│ │ │ pg-primary │ │ │ │ pg-secondary │ │ │
│ │ │ 10.0.1.10 │◄─┼──┼─►│ 10.0.2.10 │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ PostgreSQL ✓ │ │ │ │ (standby) │ │ │
│ │ └────────────────┘ │ │ └────────────────┘ │ │
│ └──────────────────────┘ └──────────────────────┘ │
│ │
│ Route53 Private: db.pgha.internal → 10.0.1.10 (active) │
│ DRBD: Synchronous replication (Protocol C) │
│ Pacemaker: Automatic failover + DNS updates │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ VPC (10.0.0.0/16) │
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Subnet AZ-a │ │ Subnet AZ-b │ │
│ │ 10.0.1.0/24 │ │ 10.0.2.0/24 │ │
│ │ │ │ │ │
│ │ ┌────────────────┐ │ │ ┌────────────────┐ │ │
│ │ │ pg-primary │ │ │ │ pg-secondary │ │ │
│ │ │ 10.0.1.10 │◄─┼──┼─►│ 10.0.2.10 │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ PostgreSQL ✓ │ │ │ │ (standby) │ │ │
│ │ └────────────────┘ │ │ └────────────────┘ │ │
│ └──────────────────────┘ └──────────────────────┘ │
│ │
│ Route53 Private: db.pgha.internal → 10.0.1.10 (active) │
│ DRBD: Synchronous replication (Protocol C) │
│ Pacemaker: Automatic failover + DNS updates │
└─────────────────────────────────────────────────────────────┘
Key Components:
- DRBD: Synchronous block-level replication (zero data loss)
- Pacemaker: Cluster resource manager (automatic failover)
- Route53 Private DNS: Single hostname that updates automatically
- PostgreSQL 17: Latest stable version
- Multi-AZ: True high availability across availability zones
Why This Stack?
DRBD (Distributed Replicated Block Device)
- Synchronous replication: Writes confirmed on both nodes before commit
- Zero data loss: No async lag like streaming replication
- Block-level: Works with any filesystem/database
- Battle-tested: Used in production for 20+ years
Pacemaker
- Industry standard: Used by Red Hat, SUSE, Ubuntu
- Automatic failover: No manual intervention
- Resource management: Handles DRBD, filesystem, PostgreSQL, DNS
- Constraint-based: Ensures correct startup order
Route53 Private Hosted Zone
- Multi-AZ support: Works across different subnets (unlike Secondary Private IP)
- Private: Not accessible from internet
- Automatic updates: Pacemaker updates DNS during failover
- Low cost: $0.50/month
- Single hostname: Clients connect to db.pgha.internal
Implementation Journey
The Problem I Hit
Initially, I tried using AWS Secondary Private IP (the common recommendation). It failed because:
AZ-a: 10.0.1.0/24 → VIP: 10.0.1.100 ✓
AZ-b: 10.0.2.0/24 → VIP: 10.0.1.100 ✗ (wrong subnet!)
When failover happened to AZ-b, the VIP (10.0.1.100) couldn't be assigned because it's in the wrong subnet range.
Lesson learned: Secondary Private IPs only work if both instances are in the same subnet or use the same IP range across AZs.
The Solution: Route53 Private DNS
Instead of moving an IP, update DNS:
Normal: db.pgha.internal → 10.0.1.10 (AZ-a)
Failover: db.pgha.internal → 10.0.2.10 (AZ-b)
The hostname stays the same, the IP changes automatically. This works across any subnet configuration.
Step-by-Step Implementation
1. Infrastructure Setup
VPC Configuration:
# Create VPC with 2 subnets in different AZs
VPC: 10.0.0.0/16
Subnet AZ-a: 10.0.1.0/24
Subnet AZ-b: 10.0.2.0/24
Security Group (Critical - Port 2224 often missed):
TCP 22 - SSH
TCP 5432 - PostgreSQL
TCP 7789 - DRBD replication
TCP 2224 - pcsd (Pacemaker) ← CRITICAL!
UDP 5404-5405 - Corosync
ICMP -1 - Cluster heartbeat
Instances:
- 2x CentOS Stream 9 (t3.medium)
- 20GB EBS volumes for DRBD
- IAM instance profile with EC2 and Route53 permissions
2. DRBD Installation
The tricky part: DRBD kernel modules must come from CentOS Stream kmod repository (not ELRepo).
# Download DRBD kernel module
curl -O https://mirror.stream.centos.org/SIGs/9-stream/kmods/x86_64/packages-main/DriverDiscs/kmod-drbd-5.14.0~688-9.3.1~1.el9s.x86_64.iso
# Mount and install
sudo mount -o loop kmod-drbd*.iso /mnt/drbd-iso
sudo dnf install -y /mnt/drbd-iso/rpms/x86_64/*.rpm
# Install utilities and Pacemaker resource agent
sudo dnf install -y epel-release
sudo dnf install -y drbd-utils drbd-pacemaker
# Load module
sudo modprobe drbd
Critical discovery: The drbd-pacemaker package provides the OCF resource agent that enables Pacemaker to automatically promote/demote DRBD. Without it, you're stuck with manual DRBD management.
3. DRBD Configuration
# /etc/drbd.d/pgdata.res
resource pgdata {
protocol C; # Synchronous replication
on pg-primary {
device /dev/drbd0;
disk /dev/nvme1n1;
address 10.0.1.10:7789;
meta-disk internal;
}
on pg-secondary {
device /dev/drbd0;
disk /dev/nvme1n1;
address 10.0.2.10:7789;
meta-disk internal;
}
}
# /etc/drbd.d/pgdata.res
resource pgdata {
protocol C; # Synchronous replication
on pg-primary {
device /dev/drbd0;
disk /dev/nvme1n1;
address 10.0.1.10:7789;
meta-disk internal;
}
on pg-secondary {
device /dev/drbd0;
disk /dev/nvme1n1;
address 10.0.2.10:7789;
meta-disk internal;
}
}
Initialize DRBD:
# On both nodes
sudo drbdadm create-md pgdata --force
sudo drbdadm up pgdata
# On primary only
sudo drbdadm primary pgdata --force
# Initial sync (20GB took ~30 seconds)
sudo drbdadm status
4. PostgreSQL Setup
# Install PostgreSQL 17
sudo dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-9-x86_64/pgdg-redhat-repo-latest.noarch.rpm
sudo dnf -qy module disable postgresql
sudo dnf install -y postgresql17-server
# Create filesystem on DRBD (primary only)
sudo mkfs.ext4 /dev/drbd0
sudo mkdir -p /drbd-data
sudo mount /dev/drbd0 /drbd-data
# Initialize PostgreSQL on DRBD
sudo mkdir -p /drbd-data/postgresql/17/data
sudo chown -R postgres:postgres /drbd-data/postgresql
sudo -u postgres /usr/pgsql-17/bin/initdb -D /drbd-data/postgresql/17/data
# Configure network access
sudo -u postgres /usr/pgsql-17/bin/psql -c "ALTER SYSTEM SET listen_addresses = '*';"
echo "host all all 10.0.0.0/16 trust" | sudo tee -a /drbd-data/postgresql/17/data/pg_hba.conf
# Install PostgreSQL 17
sudo dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-9-x86_64/pgdg-redhat-repo-latest.noarch.rpm
sudo dnf -qy module disable postgresql
sudo dnf install -y postgresql17-server
# Create filesystem on DRBD (primary only)
sudo mkfs.ext4 /dev/drbd0
sudo mkdir -p /drbd-data
sudo mount /dev/drbd0 /drbd-data
# Initialize PostgreSQL on DRBD
sudo mkdir -p /drbd-data/postgresql/17/data
sudo chown -R postgres:postgres /drbd-data/postgresql
sudo -u postgres /usr/pgsql-17/bin/initdb -D /drbd-data/postgresql/17/data
# Configure network access
sudo -u postgres /usr/pgsql-17/bin/psql -c "ALTER SYSTEM SET listen_addresses = '*';"
echo "host all all 10.0.0.0/16 trust" | sudo tee -a /drbd-data/postgresql/17/data/pg_hba.conf
5. Pacemaker Cluster
# Install Pacemaker (both nodes)
sudo dnf config-manager --set-enabled highavailability
sudo dnf install -y pacemaker pcs corosync fence-agents-all resource-agents
sudo systemctl enable --now pcsd
echo 'hacluster:hacluster' | sudo chpasswd
# Create cluster (one node)
sudo pcs host auth pg-primary pg-secondary -u hacluster -p hacluster
sudo pcs cluster setup pgcluster pg-primary pg-secondary
sudo pcs cluster start --all
sudo pcs cluster enable --all
# Configure cluster properties
sudo pcs property set stonith-enabled=true
sudo pcs property set no-quorum-policy=ignore
# Install Pacemaker (both nodes)
sudo dnf config-manager --set-enabled highavailability
sudo dnf install -y pacemaker pcs corosync fence-agents-all resource-agents
sudo systemctl enable --now pcsd
echo 'hacluster:hacluster' | sudo chpasswd
# Create cluster (one node)
sudo pcs host auth pg-primary pg-secondary -u hacluster -p hacluster
sudo pcs cluster setup pgcluster pg-primary pg-secondary
sudo pcs cluster start --all
sudo pcs cluster enable --all
# Configure cluster properties
sudo pcs property set stonith-enabled=true
sudo pcs property set no-quorum-policy=ignore
6. Configure Resources
# DRBD resource (promotable clone)
sudo pcs resource create pgdata_drbd ocf:linbit:drbd \
drbd_resource=pgdata \
op monitor interval=10s
sudo pcs resource promotable pgdata_drbd \
promoted-max=1 promoted-node-max=1 \
clone-max=2 clone-node-max=1 notify=true
# Filesystem resource
sudo pcs resource create pgdata_fs Filesystem \
device=/dev/drbd0 \
directory=/drbd-data \
fstype=ext4 \
op monitor interval=20s
# PostgreSQL resource
sudo pcs resource create postgresql pgsql \
pgctl=/usr/pgsql-17/bin/pg_ctl \
psql=/usr/pgsql-17/bin/psql \
pgdata=/drbd-data/postgresql/17/data \
op start timeout=60s \
op stop timeout=60s \
op monitor interval=10s
# Constraints (critical ordering)
sudo pcs constraint colocation add pgdata_fs with pgdata_drbd-clone INFINITY with-rsc-role=Master
sudo pcs constraint order promote pgdata_drbd-clone then start pgdata_fs
sudo pcs constraint colocation add postgresql with pgdata_fs INFINITY
sudo pcs constraint order pgdata_fs then postgresql
# DRBD resource (promotable clone)
sudo pcs resource create pgdata_drbd ocf:linbit:drbd \
drbd_resource=pgdata \
op monitor interval=10s
sudo pcs resource promotable pgdata_drbd \
promoted-max=1 promoted-node-max=1 \
clone-max=2 clone-node-max=1 notify=true
# Filesystem resource
sudo pcs resource create pgdata_fs Filesystem \
device=/dev/drbd0 \
directory=/drbd-data \
fstype=ext4 \
op monitor interval=20s
# PostgreSQL resource
sudo pcs resource create postgresql pgsql \
pgctl=/usr/pgsql-17/bin/pg_ctl \
psql=/usr/pgsql-17/bin/psql \
pgdata=/drbd-data/postgresql/17/data \
op start timeout=60s \
op stop timeout=60s \
op monitor interval=10s
# Constraints (critical ordering)
sudo pcs constraint colocation add pgdata_fs with pgdata_drbd-clone INFINITY with-rsc-role=Master
sudo pcs constraint order promote pgdata_drbd-clone then start pgdata_fs
sudo pcs constraint colocation add postgresql with pgdata_fs INFINITY
sudo pcs constraint order pgdata_fs then postgresql
7. Route53 Private DNS
Create Private Hosted Zone:
aws route53 create-hosted-zone \
--name pgha.internal \
--vpc VPCRegion=ap-southeast-2,VPCId=vpc-xxx \
--caller-reference $(date +%s) \
--hosted-zone-config PrivateZone=true
Create Custom Resource Agent (save as /usr/lib/ocf/resource.d/heartbeat/route53-private):
#!/bin/bash
: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
route53_start() {
INSTANCE_IP=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)
aws route53 change-resource-record-sets \
--hosted-zone-id ${OCF_RESKEY_hosted_zone_id} \
--change-batch "{
\"Changes\": [{
\"Action\": \"UPSERT\",
\"ResourceRecordSet\": {
\"Name\": \"${OCF_RESKEY_hostname}\",
\"Type\": \"A\",
\"TTL\": ${OCF_RESKEY_ttl:-30},
\"ResourceRecords\": [{\"Value\": \"$INSTANCE_IP\"}]
}
}]
}" >/dev/null 2>&1
return $?
}
route53_stop() {
return $OCF_SUCCESS
}
route53_monitor() {
INSTANCE_IP=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)
CURRENT_IP=$(aws route53 list-resource-record-sets \
--hosted-zone-id ${OCF_RESKEY_hosted_zone_id} \
--query "ResourceRecordSets[?Name=='${OCF_RESKEY_hostname}.'].ResourceRecords[0].Value" \
--output text)
[ "$CURRENT_IP" = "$INSTANCE_IP" ] && return $OCF_SUCCESS || return $OCF_NOT_RUNNING
}
# ... (full agent code in GitHub)
Add Route53 Resource:
sudo pcs resource create cluster_dns ocf:heartbeat:route53-private \
hosted_zone_id=/hostedzone/Z0858847QEI548E3AXOM \
hostname=db.pgha.internal \
ttl=30 \
op start timeout=60s \
op stop timeout=60s \
op monitor interval=30s timeout=20s
# Constraints
sudo pcs constraint colocation add cluster_dns with postgresql INFINITY
sudo pcs constraint order postgresql then cluster_dns
8. STONITH Fencing (Required for Production)
Why STONITH is Critical: Prevents split-brain scenarios where both nodes think they're primary, which can cause data corruption. STONITH (Shoot The Other Node In The Head) forcibly stops a failed node via AWS API.
# Install fence agent (both nodes)
sudo dnf install -y fence-agents-aws
# Create fence devices
sudo pcs stonith create fence_primary fence_aws \
region=ap-southeast-2 \
pcmk_host_map='pg-primary:i-08ead9f4b36143bac' \
power_timeout=240 \
pcmk_reboot_timeout=480 \
pcmk_reboot_action=off \
op monitor interval=60s
sudo pcs stonith create fence_secondary fence_aws \
region=ap-southeast-2 \
pcmk_host_map='pg-secondary:i-008a5be659e720bd5' \
power_timeout=240 \
pcmk_reboot_timeout=480 \
pcmk_reboot_action=off \
op monitor interval=60s
# Prevent fence devices from running on the node they fence
sudo pcs constraint location fence_primary avoids pg-primary
sudo pcs constraint location fence_secondary avoids pg-secondary
# Enable STONITH
sudo pcs property set stonith-enabled=true
How it works: When a node becomes unresponsive, the surviving node calls AWS EC2 API to stop the failed instance, ensuring it can't cause split-brain.
Testing Results
Test 1: Automatic Failover (Instance Stop)
Scenario: Stop primary EC2 instance, verify automatic failover.
# Stop primary
aws ec2 stop-instances --instance-ids i-08ead9f4b36143bac
# Wait and observe...
Timeline:
| Time | Event |
|---|---|
| T+0s | Primary instance stopped |
| T+10s | Pacemaker detected node offline |
| T+20s | DRBD promoted on secondary |
| T+25s | Filesystem mounted on secondary |
| T+30s | PostgreSQL started on secondary |
| T+35s | Route53 DNS updated (10.0.1.10 → 10.0.2.10) |
| T+60s | Failover complete |
Result:
-- Before failover
psql -h db.pgha.internal -d testdb -c "SELECT inet_server_addr();"
inet_server_addr
------------------
10.0.1.10
-- After failover (automatic)
psql -h db.pgha.internal -d testdb -c "SELECT inet_server_addr();"
inet_server_addr
------------------
10.0.2.10
Data Integrity: ✅ Zero data loss (all rows preserved)
Test 2: Primary Rejoin as Standby
Scenario: Start primary instance, verify it rejoins as standby.
# Start primary
aws ec2 start-instances --instance-ids i-08ead9f4b36143bac
# Wait 60 seconds...
Result:
$ sudo pcs status resources
* Clone Set: pgdata_drbd-clone [pgdata_drbd] (promotable):
* Promoted: [ pg-secondary ]
* Unpromoted: [ pg-primary ] ← Rejoined as standby
* pgdata_fs: Started pg-secondary
* postgresql: Started pg-secondary
* cluster_dns: Started pg-secondary
$ sudo drbdadm status
pgdata role:Primary (pg-secondary)
disk:UpToDate
pg-primary role:Secondary ← Synced as standby
peer-disk:UpToDate
Behavior: ✅ Primary correctly rejoined as standby without disrupting secondary
Performance Metrics
| Metric | Value | Notes |
|---|---|---|
| Failover Time | 60 seconds | Detection + migration + DNS |
| Data Loss | 0 bytes | DRBD synchronous replication |
| DRBD Sync | 30 seconds | 20GB initial sync |
| DNS TTL | 30 seconds | Configurable |
| Detection Time | 10 seconds | Pacemaker monitoring |
| Rejoin Time | 60 seconds | Automatic |
Key Learnings
What Worked Perfectly ✅
- Route53 for Multi-AZ Different Subnets: Only solution that works across different subnet ranges
- drbd-pacemaker Package: Critical for automatic DRBD promotion
- Port 2224: Often missed but required for pcsd communication
- Automatic Rejoin: Primary rejoins as standby without manual intervention
- Zero Data Loss: DRBD synchronous replication works flawlessly
Gotchas to Avoid ⚠️
- Port 2224: Cluster authentication fails without it
- DRBD Source: Must use CentOS Stream kmod repo, not ELRepo
- drbd-pacemaker: Without this package, no automatic DRBD management
- PostgreSQL Network Config: Must configure listen_addresses and pg_hba.conf
- IAM Permissions: Need EC2, Route53, and fencing permissions
- STONITH Required: Don't disable STONITH in production (split-brain risk)
Why Not Other Solutions?
Secondary Private IP:
- ❌ Doesn't work across different subnet ranges
- ✅ Would work if both instances in same subnet
Network Load Balancer:
- ✅ Works across any subnets
- ❌ Costs $16/month extra
- ❌ Adds connection tracking overhead
IPaddr2 + Gratuitous ARP:
- ❌ Only works in same subnet
- ❌ Not suitable for multi-AZ
Elastic IP:
- ❌ Public IP (security issue for databases)
- ❌ Not suitable for private subnets
Production Readiness
What's Tested ✅
- [x] Automatic failover (instance stop)
- [x] Automatic rejoin (instance start)
- [x] DNS automatic updates
- [x] Data integrity (zero loss)
- [x] Multi-AZ different subnets
- [x] DRBD synchronous replication
- [x] Resource ordering and constraints
- [x] STONITH fencing enabled and working
Before Production Deployment
Must Have:
- Test STONITH fencing - Verify fence_aws can stop instances
- Implement backup procedures (pg_dump + EBS snapshots)
- Replace trust authentication with md5/scram-sha-256
- Test under production load (pgbench)
- Document operational procedures
- Set up monitoring alerts (CloudWatch + SNS)
Operational Commands
Daily Operations
# Check cluster status
sudo pcs status
# Check DRBD status
sudo drbdadm status
# Check DNS
aws route53 list-resource-record-sets \
--hosted-zone-id /hostedzone/Z0858847QEI548E3AXOM \
--query "ResourceRecordSets[?Name=='db.pgha.internal.']"
# Connect to database
psql -h db.pgha.internal -U postgres -d mydb
# Check cluster status
sudo pcs status
# Check DRBD status
sudo drbdadm status
# Check DNS
aws route53 list-resource-record-sets \
--hosted-zone-id /hostedzone/Z0858847QEI548E3AXOM \
--query "ResourceRecordSets[?Name=='db.pgha.internal.']"
# Connect to database
psql -h db.pgha.internal -U postgres -d mydb
Manual Failover
# Move to secondary
sudo pcs node standby pg-primary
# Wait 60 seconds for DNS propagation
sleep 60
# Verify
psql -h db.pgha.internal -c "SELECT inet_server_addr();"
# Failback
sudo pcs node unstandby pg-primary
# Move to secondary
sudo pcs node standby pg-primary
# Wait 60 seconds for DNS propagation
sleep 60
# Verify
psql -h db.pgha.internal -c "SELECT inet_server_addr();"
# Failback
sudo pcs node unstandby pg-primary
Troubleshooting
# View Pacemaker logs
sudo journalctl -u pacemaker -f
# View DRBD logs
sudo dmesg | grep drbd
# Cleanup failed resources
sudo pcs resource cleanup
# Force DRBD sync
sudo drbdadm invalidate pgdata # On secondary
# View Pacemaker logs
sudo journalctl -u pacemaker -f
# View DRBD logs
sudo dmesg | grep drbd
# Cleanup failed resources
sudo pcs resource cleanup
# Force DRBD sync
sudo drbdadm invalidate pgdata # On secondary
Monitoring Setup
CloudWatch Metrics
# Install CloudWatch agent
wget https://s3.amazonaws.com/amazoncloudwatch-agent/centos/amd64/latest/amazon-cloudwatch-agent.rpm
sudo rpm -U amazon-cloudwatch-agent.rpm
# Configure metrics: CPU, Memory, Disk, Pacemaker logs
# (Full config in GitHub)
# Install CloudWatch agent
wget https://s3.amazonaws.com/amazoncloudwatch-agent/centos/amd64/latest/amazon-cloudwatch-agent.rpm
sudo rpm -U amazon-cloudwatch-agent.rpm
# Configure metrics: CPU, Memory, Disk, Pacemaker logs
# (Full config in GitHub)
CloudWatch Alarms
# CPU alarm
aws cloudwatch put-metric-alarm \
--alarm-name pacemaker-cpu-high \
--metric-name CPU_IDLE \
--threshold 20 \
--comparison-operator LessThanThreshold
# Disk alarm
aws cloudwatch put-metric-alarm \
--alarm-name pacemaker-disk-high \
--metric-name DISK_USED \
--threshold 80 \
--comparison-operator GreaterThanThreshold
# CPU alarm
aws cloudwatch put-metric-alarm \
--alarm-name pacemaker-cpu-high \
--metric-name CPU_IDLE \
--threshold 20 \
--comparison-operator LessThanThreshold
# Disk alarm
aws cloudwatch put-metric-alarm \
--alarm-name pacemaker-disk-high \
--metric-name DISK_USED \
--threshold 80 \
--comparison-operator GreaterThanThreshold
Conclusion
Building PostgreSQL HA on AWS with Pacemaker, DRBD, and Route53 provides:
✅ True High Availability: Automatic failover in ~60 seconds
✅ Zero Data Loss: DRBD synchronous replication
✅ Multi-AZ Support: Works across different subnet ranges
✅ Cost Effective: ~$75/month vs $200+ for RDS
✅ Production Ready: Fully tested and documented
The key insight: Route53 Private Hosted Zone is the only solution that works across different subnet ranges in different AZs. While it adds 30-60 seconds to failover time (DNS propagation), it's the most reliable and cost-effective approach for true multi-AZ HA.
When to Use This Solution
Good Fit:
- Need true multi-AZ HA with different subnets
- Want zero data loss (synchronous replication)
- Need full control over PostgreSQL configuration
- Can tolerate 60-second failover time
Not a Good Fit:
- Need sub-second failover (use connection pooling + streaming replication)
- Want fully managed solution (use RDS Multi-AZ)
- Can't tolerate DNS propagation delay
- Don't have ops team to manage cluster
Documentation:
- DRBD User's Guide: https://linbit.com/drbd-user-guide/
- Pacemaker Documentation: https://clusterlabs.org/pacemaker/
- PostgreSQL HA Documentation: https://www.postgresql.org/docs/current/high-availability.html
No comments:
Post a Comment