Use Case: Manual Blue/Green setup for major version upgrades with logical replication
Tested: Aurora PostgreSQL 16.13 → 17.9
Overview
This guide shows how to manually replicate AWS RDS Blue/Green deployment functionality using PostgreSQL logical replication (publications and subscriptions). This is useful when:
- AWS Blue/Green deployments are not suitable for your setup due to limitations
- You need more control over the process
- You want to understand the underlying mechanism
- You need to customize the replication setup
Prerequisites
Blue Cluster (Source)
- Aurora PostgreSQL cluster (any version)
- Logical replication enabled:
rds.logical_replication = 1 - Sufficient WAL retention for replication lag
- Network access from Green cluster
Green Cluster (Target)
- Aurora PostgreSQL cluster (can be different version)
- Logical replication enabled:
rds.logical_replication = 1 - Same database structure as Blue (will be cloned)
- Network access to Blue cluster
Required Permissions
- Superuser or
rds_superuserrole rds_replicationrole- Ability to create publications, slots, and subscriptions
Step-by-Step Setup
Phase 1: Prepare Blue Cluster (Publisher)
Step 1.1: Enable Logical Replication on Blue
# Create parameter group with logical replication
aws rds create-db-cluster-parameter-group \
--db-cluster-parameter-group-name blue-logical-replication \
--db-parameter-group-family aurora-postgresql16 \
--description "Blue cluster with logical replication"
# Enable logical replication
aws rds modify-db-cluster-parameter-group \
--db-cluster-parameter-group-name blue-logical-replication \
--parameters "ParameterName=rds.logical_replication,ParameterValue=1,ApplyMethod=pending-reboot"
# Apply to cluster
aws rds modify-db-cluster \
--db-cluster-identifier blue-cluster \
--db-cluster-parameter-group-name blue-logical-replication \
--apply-immediately
# Reboot instance to apply
aws rds reboot-db-instance \
--db-instance-identifier blue-instance-1
Verify:
-- Connect to Blue
SHOW wal_level;
-- Expected: logical
SELECT name, setting
FROM pg_settings
WHERE name = 'rds.logical_replication';
-- Expected: on
Step 1.2: Create Publications on Blue
For each database you want to replicate:
-- Connect to Blue cluster
-- Switch to each database
-- Database: postgres
\c postgres
CREATE PUBLICATION blue_green_pub_postgres FOR ALL TABLES;
-- Database: testdb
\c testdb
CREATE PUBLICATION blue_green_pub_testdb FOR ALL TABLES;
-- Database: myapp
\c myapp
CREATE PUBLICATION blue_green_pub_myapp FOR ALL TABLES;
Naming Convention: blue_green_pub_<database_name>
Verify:
-- In each database
SELECT pubname, puballtables
FROM pg_publication
WHERE pubname LIKE 'blue_green_pub_%';
-- Expected output:
-- pubname | puballtables
-- blue_green_pub_postgres | t
-- blue_green_pub_testdb | t
Step 1.3: Create Replication Slots on Blue
For each database:
-- Connect to Blue cluster
-- Switch to each database
-- Database: postgres
\c postgres
SELECT pg_create_logical_replication_slot(
'blue_green_slot_postgres',
'pgoutput'
);
-- Database: testdb
\c testdb
SELECT pg_create_logical_replication_slot(
'blue_green_slot_testdb',
'pgoutput'
);
-- Database: myapp
\c myapp
SELECT pg_create_logical_replication_slot(
'blue_green_slot_myapp',
'pgoutput'
);
Naming Convention: blue_green_slot_<database_name>
Verify:
-- In each database
SELECT slot_name, plugin, slot_type, active, restart_lsn
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';
-- Expected output:
-- slot_name | plugin | slot_type | active | restart_lsn
-- blue_green_slot_postgres | pgoutput | logical | f | 0/1234567
-- blue_green_slot_testdb | pgoutput | logical | f | 0/1234567
Note: active = f because no subscriber is connected yet.
Phase 2: Clone Green Cluster from Blue
Step 2.1: Clone Green from Blue (Aurora Fast Clone)
For Aurora PostgreSQL (Recommended - Fast Clone):
Aurora supports fast cloning using copy-on-write technology, which is much faster than snapshot restore:
# Clone Green cluster from Blue (copy-on-write)
aws rds restore-db-cluster-to-point-in-time \
--source-db-cluster-identifier blue-cluster \
--db-cluster-identifier green-cluster \
--restore-type copy-on-write \
--use-latest-restorable-time \
--db-cluster-parameter-group-name green-logical-replication \
--vpc-security-group-ids sg-xxxxx
# Wait for clone cluster to be available
aws rds wait db-cluster-available \
--db-cluster-identifier green-cluster
# Create writer instance in Green cluster
aws rds create-db-instance \
--db-instance-identifier green-instance-1 \
--db-cluster-identifier green-cluster \
--db-instance-class db.r6g.large \
--engine aurora-postgresql
# Wait for instance to be available
aws rds wait db-instance-available \
--db-instance-identifier green-instance-1
# (Optional) Create reader instance for high availability
aws rds create-db-instance \
--db-instance-identifier green-instance-2 \
--db-cluster-identifier green-cluster \
--db-instance-class db.r6g.large \
--engine aurora-postgresql
Key Parameters:
--restore-type copy-on-write: Creates a fast clone (not a full copy)--use-latest-restorable-time: Uses the latest data from Blue- Clone is created in seconds (not minutes like snapshot restore)
For RDS PostgreSQL (Alternative - Snapshot Restore):
RDS PostgreSQL doesn't support fast cloning, so use snapshot restore:
# Create manual snapshot
aws rds create-db-snapshot \
--db-instance-identifier blue-instance \
--db-snapshot-identifier blue-snapshot-for-green
# Wait for snapshot to complete
aws rds wait db-snapshot-available \
--db-snapshot-identifier blue-snapshot-for-green
# Restore Green from snapshot
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier green-instance \
--db-snapshot-identifier blue-snapshot-for-green \
--db-instance-class db.r6g.large \
--db-parameter-group-name green-logical-replication \
--vpc-security-group-ids sg-xxxxx
# Wait for Green to be available
aws rds wait db-instance-available \
--db-instance-identifier green-instance
Result: Green now has all data from Blue at the clone/snapshot time.
Step 2.2: Get Snapshot LSN
Critical: You need to know the LSN at snapshot time to advance slots correctly.
Method 1: For Aurora PostgreSQL (Recommended for Aurora):
Aurora provides a special function to get the exact LSN where the clone/snapshot was taken:
-- Connect to Green cluster (after it's available)
SELECT aurora_volume_logical_start_lsn();
-- Example output:
-- aurora_volume_logical_start_lsn
-- -------------------------------
-- 0/402E2F0
-- (1 row)
Save this LSN value - you'll use it to advance replication slots on Blue.
# Get snapshot LSN from Green
SNAPSHOT_LSN=$(psql -h green-endpoint -U postgres -d postgres -t -c "SELECT aurora_volume_logical_start_lsn();" | xargs)
echo "Snapshot LSN: $SNAPSHOT_LSN"
# Example: 0/402E2F0
Method 2: For RDS PostgreSQL (Recommended for RDS):
RDS PostgreSQL doesn't have aurora_volume_logical_start_lsn(), but you can get the exact LSN from the PostgreSQL error log during Green's first startup.
Step 2.2a: Check Green's PostgreSQL Log
After Green is restored and starts up, check the PostgreSQL log for the "invalid record" message:
# View recent PostgreSQL logs
aws rds download-db-log-file-portion \
--db-instance-identifier green-instance-1 \
--log-file-name error/postgresql.log.2026-05-02-10 \
--region ap-northeast-1 \
--output text | grep -A2 -B2 "invalid record"
Look for this pattern:
LOG: redo starts at 0/100000A0
LOG: invalid record length at 0/1000FB78: expected at least 24, got 0
DEBUG: Set Logical Seed LSN at 0/1000FB78
LOG: redo done at 0/1000FB40
The LSN you need is: 0/1000FB78 (from the "invalid record length" line)
What this means:
- PostgreSQL replayed all WAL up to
0/1000FB40(last valid record) - Tried to read next record at
0/1000FB78but found no data (end of WAL) - This
0/1000FB78is the exact boundary where the snapshot was taken - This is the LSN you'll use to advance replication slots on Blue
Why "invalid record" is GOOD:
- It's NOT an error - it's PostgreSQL's way of marking the end of available WAL
- It ensures no gaps - PostgreSQL validated every record up to this point
- This is the precise LSN where logical replication should start
# Extract the LSN from logs
SNAPSHOT_LSN=$(aws rds download-db-log-file-portion \
--db-instance-identifier green-instance-1 \
--log-file-name error/postgresql.log.2026-05-02-10 \
--region ap-northeast-1 \
--output text | \
grep "invalid record length" | \
grep -oE '0/[0-9A-F]+' | \
head -1)
echo "Snapshot LSN: $SNAPSHOT_LSN"
# Example: 0/1000FB78
Method 3: For Standard RDS PostgreSQL (Alternative - Less Accurate):
If you can't access logs, query Blue's LSN immediately after snapshot completes:
# After snapshot completes, immediately run on Blue:
SNAPSHOT_LSN=$(psql -h blue-endpoint -U postgres -d postgres -t -c "SELECT pg_current_wal_lsn();" | xargs)
echo "Snapshot LSN: $SNAPSHOT_LSN"
Note: This is less accurate because Blue continues writing after the snapshot is taken.
When this method is SAFE:
- If you can pause application writes to the database during snapshot
- Stop application → Take snapshot → Get LSN → Resume application
- This ensures the LSN matches the snapshot exactly
Procedure for safe Method 3:
# 1. Stop application writes
echo "Stopping application..."
# (stop your application or set database to read-only)
# 2. Take snapshot
aws rds create-db-snapshot \
--db-instance-identifier blue-instance \
--db-snapshot-identifier blue-snapshot-$(date +%s)
# 3. Immediately get LSN (while writes are still paused)
SNAPSHOT_LSN=$(psql -h blue-endpoint -U postgres -d postgres -t -c "SELECT pg_current_wal_lsn();" | xargs)
echo "Snapshot LSN: $SNAPSHOT_LSN"
# 4. Resume application writes
echo "Resuming application..."
# (start your application)
Without pausing writes: LSN will be approximate and may include some post-snapshot changes.
Comparison of Methods:
| Method | Accuracy | When to Use |
|---|---|---|
aurora_volume_logical_start_lsn() | Exact | Aurora PostgreSQL only |
| PostgreSQL log "invalid record" | Exact | RDS PostgreSQL (recommended) |
pg_current_wal_lsn() on Blue | Approximate | Fallback if logs unavailable |
Recommendation:
- Aurora: Use Method 1 (
aurora_volume_logical_start_lsn()) - RDS PostgreSQL: Use Method 2 (PostgreSQL log analysis)
- Fallback: Use Method 3 only if other methods unavailable
Step 2.3: Clean Up Replication Slots on Green (If Cloned)
If Green was cloned from Blue, it may have copies of Blue's replication slots. These must be dropped:
-- Connect to Green cluster
-- Check for any replication slots
SELECT slot_name, slot_type, active
FROM pg_replication_slots;
-- Drop any slots that were copied from Blue
-- (These are not needed on Green and will cause issues)
SELECT pg_drop_replication_slot('blue_green_slot_postgres');
SELECT pg_drop_replication_slot('blue_green_slot_testdb');
SELECT pg_drop_replication_slot('blue_green_slot_myapp');
-- Verify all slots are dropped
SELECT COUNT(*) FROM pg_replication_slots;
-- Expected: 0
Why This is Important:
- Cloned slots on Green are inactive and useless
- They can interfere with major version upgrades
- Green will create its own subscriptions (not slots)
Phase 3: Advance Replication Slots (Critical Step!)
Step 3.1: Advance Slots on Blue to Snapshot LSN
This is the most critical step - it prevents Green from replaying already-cloned data.
-- Connect to Blue cluster
-- Use the SNAPSHOT_LSN you recorded earlier
-- Database: postgres
\c postgres
SELECT pg_replication_slot_advance(
'blue_green_slot_postgres',
'0/5000000' -- Replace with your actual SNAPSHOT_LSN
);
-- Database: testdb
\c testdb
SELECT pg_replication_slot_advance(
'blue_green_slot_testdb',
'0/5000000' -- Replace with your actual SNAPSHOT_LSN
);
-- Database: myapp
\c myapp
SELECT pg_replication_slot_advance(
'blue_green_slot_myapp',
'0/5000000' -- Replace with your actual SNAPSHOT_LSN
);
Verify:
-- In each database on Blue
SELECT slot_name, restart_lsn, confirmed_flush_lsn
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';
-- Expected: Both LSNs should match your SNAPSHOT_LSN
-- slot_name | restart_lsn | confirmed_flush_lsn
-- blue_green_slot_postgres | 0/5000000 | 0/5000000
-- blue_green_slot_testdb | 0/5000000 | 0/5000000
Why This Matters:
- Without this: Subscription replays ALL changes from slot creation → duplicate key errors
- With this: Subscription only replays changes AFTER snapshot → clean replication
Phase 4: Setup Green Cluster (Subscriber)
Step 4.1: Enable Logical Replication on Green
# Create parameter group for Green
aws rds create-db-cluster-parameter-group \
--db-cluster-parameter-group-name green-logical-replication \
--db-parameter-group-family aurora-postgresql17 \
--description "Green cluster with logical replication"
# Enable logical replication
aws rds modify-db-cluster-parameter-group \
--db-cluster-parameter-group-name green-logical-replication \
--parameters "ParameterName=rds.logical_replication,ParameterValue=1,ApplyMethod=pending-reboot"
# If not already applied during restore, apply now
aws rds modify-db-cluster \
--db-cluster-identifier green-cluster \
--db-cluster-parameter-group-name green-logical-replication \
--apply-immediately
# Reboot if needed
aws rds reboot-db-instance \
--db-instance-identifier green-instance-1
Step 4.2: Create Replication User on Blue (Optional but Recommended)
For better security, create a dedicated replication user instead of using postgres:
-- Connect to Blue cluster
CREATE USER replication_user WITH PASSWORD 'SecurePassword123!';
GRANT rds_replication TO replication_user;
-- Grant access to all databases
\c postgres
GRANT CONNECT ON DATABASE postgres TO replication_user;
GRANT USAGE ON SCHEMA public TO replication_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO replication_user;
\c testdb
GRANT CONNECT ON DATABASE testdb TO replication_user;
GRANT USAGE ON SCHEMA public TO replication_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO replication_user;
Step 4.3: Create Subscriptions on Green
For each database:
-- Connect to Green cluster
-- Get Blue endpoint
-- Blue endpoint: blue-cluster.cluster-xxxxx.region.rds.amazonaws.com
-- Database: postgres
\c postgres
CREATE SUBSCRIPTION blue_green_sub_postgres
CONNECTION 'host=blue-cluster.cluster-xxxxx.region.rds.amazonaws.com port=5432 dbname=postgres user=replication_user password=SecurePassword123! sslmode=require'
PUBLICATION blue_green_pub_postgres
WITH (
enabled = true,
create_slot = false,
slot_name = 'blue_green_slot_postgres',
copy_data = false
);
-- Database: testdb
\c testdb
CREATE SUBSCRIPTION blue_green_sub_testdb
CONNECTION 'host=blue-cluster.cluster-xxxxx.region.rds.amazonaws.com port=5432 dbname=testdb user=replication_user password=SecurePassword123! sslmode=require'
PUBLICATION blue_green_pub_testdb
WITH (
enabled = true,
create_slot = false,
slot_name = 'blue_green_slot_testdb',
copy_data = false
);
-- Database: myapp
\c myapp
CREATE SUBSCRIPTION blue_green_sub_myapp
CONNECTION 'host=blue-cluster.cluster-xxxxx.region.rds.amazonaws.com port=5432 dbname=myapp user=replication_user password=SecurePassword123! sslmode=require'
PUBLICATION blue_green_pub_myapp
WITH (
enabled = true,
create_slot = false,
slot_name = 'blue_green_slot_myapp',
copy_data = false
);
Critical Parameters:
create_slot = false: Slot already exists on Blue (don't create new one)slot_name: Must match the slot name on Bluecopy_data = false: Data already exists from snapshot (don't copy again)enabled = true: Start replication immediately
Naming Convention: blue_green_sub_<database_name>
Step 4.4: Verify Subscriptions Started
-- On Green, in each database
SELECT subname, subenabled, subconninfo
FROM pg_subscription
WHERE subname LIKE 'blue_green_sub_%';
-- Expected output:
-- subname | subenabled | subconninfo
-- blue_green_sub_postgres | t | host=blue-cluster...
-- blue_green_sub_testdb | t | host=blue-cluster...
-- Check apply worker status
SELECT subname, pid, received_lsn, latest_end_lsn, last_msg_receipt_time
FROM pg_stat_subscription
WHERE subname LIKE 'blue_green_sub_%';
-- Expected: pid should be non-NULL (worker running)
-- subname | pid | received_lsn | latest_end_lsn | last_msg_receipt_time
-- blue_green_sub_postgres | 1234 | 0/5000100 | 0/5000100 | 2026-05-02 08:30:00
Step 4.5: Verify Replication Active on Blue
-- On Blue, in each database
SELECT slot_name, active, active_pid, confirmed_flush_lsn
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';
-- Expected: active = t (subscriber connected)
-- slot_name | active | active_pid | confirmed_flush_lsn
-- blue_green_slot_postgres | t | 5678 | 0/5000100
-- blue_green_slot_testdb | t | 5679 | 0/5000100
Phase 5: Monitor and Verify Replication
Step 5.1: Test Data Replication
-- On Blue
\c testdb
INSERT INTO test_table (data) VALUES ('test_replication');
SELECT COUNT(*) FROM test_table;
-- Wait 5 seconds
-- On Green
\c testdb
SELECT COUNT(*) FROM test_table;
-- Should match Blue count
SELECT * FROM test_table WHERE data = 'test_replication';
-- Should see the new row
Step 5.2: Monitor Replication Lag
-- On Blue: Check how far behind Green is
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';
-- Expected: lag < 1 MB (typically a few KB)
-- slot_name | lag
-- blue_green_slot_postgres | 128 bytes
-- blue_green_slot_testdb | 256 bytes
Step 5.3: Monitor Subscription Health
-- On Green: Check subscription status
SELECT subname,
pid,
received_lsn,
latest_end_lsn,
last_msg_send_time,
last_msg_receipt_time,
latest_end_time
FROM pg_stat_subscription
WHERE subname LIKE 'blue_green_sub_%';
-- Check for errors
SELECT subname, last_error_message, last_error_time
FROM pg_stat_subscription_errors
WHERE subname LIKE 'blue_green_sub_%';
-- Should be empty (no errors)
Phase 6: Switchover (Promote Green to Primary)
When ready to switch from Blue to Green:
Step 6.1: Stop Application Writes to Blue
# Stop application or redirect traffic away from Blue
# Ensure no more writes are happening
Step 6.2: Wait for Green to Catch Up
-- On Blue: Check replication lag
SELECT slot_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';
-- Wait until lag_bytes = 0 for all slots
Step 6.3: Drop Subscriptions on Green
-- On Green, in each database
\c postgres
DROP SUBSCRIPTION blue_green_sub_postgres;
\c testdb
DROP SUBSCRIPTION blue_green_sub_testdb;
\c myapp
DROP SUBSCRIPTION blue_green_sub_myapp;
Effect: Green is now independent, no longer replicating from Blue.
Step 6.4: Drop Publications and Slots on Blue
-- On Blue, in each database
\c postgres
DROP PUBLICATION blue_green_pub_postgres;
SELECT pg_drop_replication_slot('blue_green_slot_postgres');
\c testdb
DROP PUBLICATION blue_green_pub_testdb;
SELECT pg_drop_replication_slot('blue_green_slot_testdb');
\c myapp
DROP PUBLICATION blue_green_pub_myapp;
SELECT pg_drop_replication_slot('blue_green_slot_myapp');
Step 6.5: Update Application to Point to Green
# Update application connection strings
# Old: blue-cluster.cluster-xxxxx.region.rds.amazonaws.com
# New: green-cluster.cluster-xxxxx.region.rds.amazonaws.com
# Or use DNS CNAME to switch
# Update CNAME: myapp-db.example.com → green-cluster.cluster-xxxxx...
Step 6.6: Verify Green is Primary
-- On Green
SELECT version();
-- Should show PostgreSQL 17.9 (if you upgraded)
-- Test writes
INSERT INTO test_table (data) VALUES ('green_is_primary');
-- Verify
SELECT * FROM test_table WHERE data = 'green_is_primary';
Phase 7: Cleanup (Optional)
Step 7.1: Keep Blue as Backup
You can keep Blue running for a few days as a backup before deleting.
Step 7.2: Delete Blue Cluster
# When confident Green is working well
aws rds delete-db-cluster \
--db-cluster-identifier blue-cluster \
--skip-final-snapshot
Automation Script
Complete Setup Script
#!/bin/bash
set -e
# Configuration
BLUE_CLUSTER="blue-cluster"
BLUE_ENDPOINT="blue-cluster.cluster-xxxxx.region.rds.amazonaws.com"
GREEN_CLUSTER="green-cluster"
REGION="ap-northeast-1"
DATABASES=("postgres" "testdb" "myapp")
echo "=== Phase 1: Setup Blue (Publisher) ==="
for DB in "${DATABASES[@]}"; do
echo "Setting up database: $DB"
# Create publication
psql -h $BLUE_ENDPOINT -U postgres -d $DB <<EOF
CREATE PUBLICATION blue_green_pub_$DB FOR ALL TABLES;
SELECT pg_create_logical_replication_slot('blue_green_slot_$DB', 'pgoutput');
EOF
done
echo ""
echo "=== Phase 2: Take Snapshot ==="
SNAPSHOT_ID="blue-snapshot-$(date +%s)"
aws rds create-db-cluster-snapshot \
--db-cluster-identifier $BLUE_CLUSTER \
--db-cluster-snapshot-identifier $SNAPSHOT_ID \
--region $REGION
echo "Waiting for snapshot to complete..."
aws rds wait db-cluster-snapshot-available \
--db-cluster-snapshot-identifier $SNAPSHOT_ID \
--region $REGION
echo "Snapshot complete: $SNAPSHOT_ID"
echo ""
echo "=== Restore Green from snapshot ==="
echo "Run the following commands to restore Green:"
echo ""
echo "aws rds restore-db-cluster-from-snapshot \\"
echo " --db-cluster-identifier $GREEN_CLUSTER \\"
echo " --snapshot-identifier $SNAPSHOT_ID \\"
echo " --engine aurora-postgresql \\"
echo " --engine-version 17.9 \\"
echo " --region $REGION"
echo ""
echo "After Green is available, get the snapshot LSN:"
echo ""
# Get snapshot LSN from Green (Aurora-specific)
echo "SNAPSHOT_LSN=\$(psql -h green-endpoint -U postgres -d postgres -t -c \"SELECT aurora_volume_logical_start_lsn();\" | xargs)"
echo "echo \"Snapshot LSN: \$SNAPSHOT_LSN\""
echo ""
echo "Then continue with Phase 3..."
echo ""
read -p "Press Enter after Green is available and you have the SNAPSHOT_LSN..."
# Prompt for snapshot LSN
read -p "Enter the SNAPSHOT_LSN from Green: " SNAPSHOT_LSN
echo "Using SNAPSHOT_LSN: $SNAPSHOT_LSN"
echo ""
echo "=== Phase 3: Advance Slots ==="
for DB in "${DATABASES[@]}"; do
psql -h $BLUE_ENDPOINT -U postgres -d $DB <<EOF
SELECT pg_replication_slot_advance('blue_green_slot_$DB', '$SNAPSHOT_LSN');
EOF
done
echo ""
echo "=== Phase 4: Setup Green (Subscriber) ==="
echo "After Green is restored and available, run:"
echo ""
for DB in "${DATABASES[@]}"; do
cat <<EOF
psql -h green-endpoint -U postgres -d $DB <<SQL
CREATE SUBSCRIPTION blue_green_sub_$DB
CONNECTION 'host=$BLUE_ENDPOINT port=5432 dbname=$DB user=postgres password=xxx sslmode=require'
PUBLICATION blue_green_pub_$DB
WITH (enabled=true, create_slot=false, slot_name='blue_green_slot_$DB', copy_data=false);
SQL
EOF
done
echo "=== Setup Complete ==="
Monitoring Queries
Check Overall Replication Health
-- On Blue: Replication lag summary
SELECT
slot_name,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag,
confirmed_flush_lsn
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%'
ORDER BY slot_name;
-- On Green: Subscription status summary
SELECT
subname,
pid IS NOT NULL AS worker_running,
received_lsn,
latest_end_lsn,
EXTRACT(EPOCH FROM (NOW() - last_msg_receipt_time)) AS seconds_since_last_msg
FROM pg_stat_subscription
WHERE subname LIKE 'blue_green_sub_%'
ORDER BY subname;
Troubleshooting
Issue: "Invalid Record Length" in PostgreSQL Logs
Symptoms: Seeing this in Green's PostgreSQL logs:
LOG: invalid record length at 0/1000FB78: expected at least 24, got 0
This is NOT an error! This is normal and expected.
Explanation:
- PostgreSQL replayed all available WAL sequentially
- When it reaches the end of available WAL, it tries to read the next record
- Finding no data (0 bytes) indicates the end of WAL - this is the snapshot boundary
- This LSN (
0/1000FB78) is the exact point where logical replication should start
What to do:
- Note the LSN from this message - you need it to advance slots on Blue
- Verify it's at the end of redo:
LOG: redo done at 0/1000FB40 (should appear right after) - Use this LSN when advancing replication slots on Blue
This message confirms:
- ✅ No gaps in WAL replay
- ✅ All data up to snapshot point is present
- ✅ Green is ready for logical replication
- ✅ Exact boundary is known
Issue: Subscription Not Starting
Symptoms: pid is NULL in pg_stat_subscription
Check:
-- On Green
SELECT subname, subenabled FROM pg_subscription WHERE subname LIKE 'blue_green_sub_%';
Solution:
-- Enable subscription
ALTER SUBSCRIPTION blue_green_sub_postgres ENABLE;
-- Check logs
SELECT * FROM pg_stat_subscription_errors WHERE subname LIKE 'blue_green_sub_%';
Issue: Duplicate Key Errors
Symptoms: Apply worker crashes with "duplicate key" errors
Cause: Slot was not advanced to snapshot LSN
Solution:
-- On Green: Drop subscription
DROP SUBSCRIPTION blue_green_sub_postgres;
-- On Blue: Advance slot to correct LSN
SELECT pg_replication_slot_advance('blue_green_slot_postgres', '<correct_snapshot_lsn>');
-- On Green: Recreate subscription
CREATE SUBSCRIPTION blue_green_sub_postgres ...;
Issue: Connection Refused
Symptoms: Subscription can't connect to Blue
Check:
- Security groups allow PostgreSQL (5432) from Green to Blue
- Blue endpoint is correct
- User credentials are correct
- SSL mode is compatible
Test:
# From Green instance
psql -h blue-endpoint -U replication_user -d postgres
Issue: High Replication Lag
Symptoms: Lag > 100 MB
Check:
-- On Blue
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';
Solutions:
- Scale up Green instance (more CPU/memory)
- Reduce write load on Blue temporarily
- Check network connectivity
- Check for long-running transactions on Green
Best Practices
- Always advance slots to snapshot LSN before creating subscriptions
- Test in non-production first
- Monitor replication lag continuously
- Use dedicated replication user for better security
- Document snapshot LSN for troubleshooting
- Keep Blue running for 24-48 hours after switchover
- Verify data consistency before deleting Blue
- Use SSL for replication connections
- Set up alerts for replication lag and subscription errors
- Plan rollback procedure before switchover
References
- PostgreSQL Logical Replication: https://www.postgresql.org/docs/current/logical-replication.html
- pg_replication_slot_advance: https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-REPLICATION
- Aurora PostgreSQL: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.AuroraPostgreSQL.html
- Aurora PostgreSQL Major Version Upgrade: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.MajorVersionUpgrade.html