Use Case: Manual Blue/Green setup for major version upgrades with logical replication
Tested: Aurora PostgreSQL 16.13 → 17.9
Overview
This guide shows how to manually replicate AWS RDS Blue/Green deployment functionality using PostgreSQL logical replication (publications and subscriptions). This is useful when:
- AWS Blue/Green deployments are not suitable for your setup due to limitations
- You need more control over the process
- You want to understand the underlying mechanism
- You need to customize the replication setup
Prerequisites
Blue Cluster (Source)
- Aurora PostgreSQL cluster (any version)
- Logical replication enabled:
rds.logical_replication = 1 - Sufficient WAL retention for replication lag
- Network access from Green cluster
Green Cluster (Target)
- Aurora PostgreSQL cluster (can be different version)
- Logical replication enabled:
rds.logical_replication = 1 - Same database structure as Blue (will be cloned)
- Network access to Blue cluster
Required Permissions
- Superuser or
rds_superuser role rds_replication role- Ability to create publications, slots, and subscriptions
Step-by-Step Setup
Phase 1: Prepare Blue Cluster (Publisher)
Step 1.1: Enable Logical Replication on Blue
aws rds create-db-cluster-parameter-group \
--db-cluster-parameter-group-name blue-logical-replication \
--db-parameter-group-family aurora-postgresql16 \
--description "Blue cluster with logical replication"
aws rds modify-db-cluster-parameter-group \
--db-cluster-parameter-group-name blue-logical-replication \
--parameters "ParameterName=rds.logical_replication,ParameterValue=1,ApplyMethod=pending-reboot"
aws rds modify-db-cluster \
--db-cluster-identifier blue-cluster \
--db-cluster-parameter-group-name blue-logical-replication \
--apply-immediately
aws rds reboot-db-instance \
--db-instance-identifier blue-instance-1
Verify:
SHOW wal_level;
SELECT name, setting
FROM pg_settings
WHERE name = 'rds.logical_replication';
Step 1.2: Create Publications on Blue
For each database you want to replicate:
\c postgres
CREATE PUBLICATION blue_green_pub_postgres FOR ALL TABLES;
\c testdb
CREATE PUBLICATION blue_green_pub_testdb FOR ALL TABLES;
\c myapp
CREATE PUBLICATION blue_green_pub_myapp FOR ALL TABLES;
Naming Convention: blue_green_pub_<database_name>
Verify:
SELECT pubname, puballtables
FROM pg_publication
WHERE pubname LIKE 'blue_green_pub_%';
Step 1.3: Create Replication Slots on Blue
For each database:
\c postgres
SELECT pg_create_logical_replication_slot(
'blue_green_slot_postgres',
'pgoutput'
);
\c testdb
SELECT pg_create_logical_replication_slot(
'blue_green_slot_testdb',
'pgoutput'
);
\c myapp
SELECT pg_create_logical_replication_slot(
'blue_green_slot_myapp',
'pgoutput'
);
Naming Convention: blue_green_slot_<database_name>
Verify:
SELECT slot_name, plugin, slot_type, active, restart_lsn
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';
Note: active = f because no subscriber is connected yet.
Phase 2: Clone Green Cluster from Blue
Step 2.1: Clone Green from Blue (Aurora Fast Clone)
For Aurora PostgreSQL (Recommended - Fast Clone):
Aurora supports fast cloning using copy-on-write technology, which is much faster than snapshot restore:
aws rds restore-db-cluster-to-point-in-time \
--source-db-cluster-identifier blue-cluster \
--db-cluster-identifier green-cluster \
--restore-type copy-on-write \
--use-latest-restorable-time \
--db-cluster-parameter-group-name green-logical-replication \
--vpc-security-group-ids sg-xxxxx
aws rds wait db-cluster-available \
--db-cluster-identifier green-cluster
aws rds create-db-instance \
--db-instance-identifier green-instance-1 \
--db-cluster-identifier green-cluster \
--db-instance-class db.r6g.large \
--engine aurora-postgresql
aws rds wait db-instance-available \
--db-instance-identifier green-instance-1
aws rds create-db-instance \
--db-instance-identifier green-instance-2 \
--db-cluster-identifier green-cluster \
--db-instance-class db.r6g.large \
--engine aurora-postgresql
Key Parameters:
--restore-type copy-on-write: Creates a fast clone (not a full copy)--use-latest-restorable-time: Uses the latest data from Blue- Clone is created in seconds (not minutes like snapshot restore)
For RDS PostgreSQL (Alternative - Snapshot Restore):
RDS PostgreSQL doesn't support fast cloning, so use snapshot restore:
aws rds create-db-snapshot \
--db-instance-identifier blue-instance \
--db-snapshot-identifier blue-snapshot-for-green
aws rds wait db-snapshot-available \
--db-snapshot-identifier blue-snapshot-for-green
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier green-instance \
--db-snapshot-identifier blue-snapshot-for-green \
--db-instance-class db.r6g.large \
--db-parameter-group-name green-logical-replication \
--vpc-security-group-ids sg-xxxxx
aws rds wait db-instance-available \
--db-instance-identifier green-instance
Result: Green now has all data from Blue at the clone/snapshot time.
Step 2.2: Get Snapshot LSN
Critical: You need to know the LSN at snapshot time to advance slots correctly.
Method 1: For Aurora PostgreSQL (Recommended for Aurora):
Aurora provides a special function to get the exact LSN where the clone/snapshot was taken:
SELECT aurora_volume_logical_start_lsn();
Save this LSN value - you'll use it to advance replication slots on Blue.
SNAPSHOT_LSN=$(psql -h green-endpoint -U postgres -d postgres -t -c "SELECT aurora_volume_logical_start_lsn();" | xargs)
echo "Snapshot LSN: $SNAPSHOT_LSN"
Method 2: For RDS PostgreSQL (Recommended for RDS):
RDS PostgreSQL doesn't have aurora_volume_logical_start_lsn(), but you can get the exact LSN from the PostgreSQL error log during Green's first startup.
Step 2.2a: Check Green's PostgreSQL Log
After Green is restored and starts up, check the PostgreSQL log for the "invalid record" message:
aws rds download-db-log-file-portion \
--db-instance-identifier green-instance-1 \
--log-file-name error/postgresql.log.2026-05-02-10 \
--region ap-northeast-1 \
--output text | grep -A2 -B2 "invalid record"
Look for this pattern:
LOG: redo starts at 0/100000A0
LOG: invalid record length at 0/1000FB78: expected at least 24, got 0
DEBUG: Set Logical Seed LSN at 0/1000FB78
LOG: redo done at 0/1000FB40
The LSN you need is: 0/1000FB78 (from the "invalid record length" line)
What this means:
- PostgreSQL replayed all WAL up to
0/1000FB40 (last valid record) - Tried to read next record at
0/1000FB78 but found no data (end of WAL) - This
0/1000FB78 is the exact boundary where the snapshot was taken - This is the LSN you'll use to advance replication slots on Blue
Why "invalid record" is GOOD:
- It's NOT an error - it's PostgreSQL's way of marking the end of available WAL
- It ensures no gaps - PostgreSQL validated every record up to this point
- This is the precise LSN where logical replication should start
SNAPSHOT_LSN=$(aws rds download-db-log-file-portion \
--db-instance-identifier green-instance-1 \
--log-file-name error/postgresql.log.2026-05-02-10 \
--region ap-northeast-1 \
--output text | \
grep "invalid record length" | \
grep -oE '0/[0-9A-F]+' | \
head -1)
echo "Snapshot LSN: $SNAPSHOT_LSN"
Method 3: For Standard RDS PostgreSQL (Alternative - Less Accurate):
If you can't access logs, query Blue's LSN immediately after snapshot completes:
SNAPSHOT_LSN=$(psql -h blue-endpoint -U postgres -d postgres -t -c "SELECT pg_current_wal_lsn();" | xargs)
echo "Snapshot LSN: $SNAPSHOT_LSN"
Note: This is less accurate because Blue continues writing after the snapshot is taken.
When this method is SAFE:
- If you can pause application writes to the database during snapshot
- Stop application → Take snapshot → Get LSN → Resume application
- This ensures the LSN matches the snapshot exactly
Procedure for safe Method 3:
echo "Stopping application..."
aws rds create-db-snapshot \
--db-instance-identifier blue-instance \
--db-snapshot-identifier blue-snapshot-$(date +%s)
SNAPSHOT_LSN=$(psql -h blue-endpoint -U postgres -d postgres -t -c "SELECT pg_current_wal_lsn();" | xargs)
echo "Snapshot LSN: $SNAPSHOT_LSN"
echo "Resuming application..."
Without pausing writes: LSN will be approximate and may include some post-snapshot changes.
Comparison of Methods:
| Method | Accuracy | When to Use |
|---|
aurora_volume_logical_start_lsn() | Exact | Aurora PostgreSQL only |
| PostgreSQL log "invalid record" | Exact | RDS PostgreSQL (recommended) |
pg_current_wal_lsn() on Blue | Approximate | Fallback if logs unavailable |
Recommendation:
- Aurora: Use Method 1 (
aurora_volume_logical_start_lsn()) - RDS PostgreSQL: Use Method 2 (PostgreSQL log analysis)
- Fallback: Use Method 3 only if other methods unavailable
Step 2.3: Clean Up Replication Slots on Green (If Cloned)
If Green was cloned from Blue, it may have copies of Blue's replication slots. These must be dropped:
SELECT slot_name, slot_type, active
FROM pg_replication_slots;
SELECT pg_drop_replication_slot('blue_green_slot_postgres');
SELECT pg_drop_replication_slot('blue_green_slot_testdb');
SELECT pg_drop_replication_slot('blue_green_slot_myapp');
SELECT COUNT(*) FROM pg_replication_slots;
Why This is Important:
- Cloned slots on Green are inactive and useless
- They can interfere with major version upgrades
- Green will create its own subscriptions (not slots)
Phase 3: Advance Replication Slots (Critical Step!)
Step 3.1: Advance Slots on Blue to Snapshot LSN
This is the most critical step - it prevents Green from replaying already-cloned data.
\c postgres
SELECT pg_replication_slot_advance(
'blue_green_slot_postgres',
'0/5000000'
);
\c testdb
SELECT pg_replication_slot_advance(
'blue_green_slot_testdb',
'0/5000000'
);
\c myapp
SELECT pg_replication_slot_advance(
'blue_green_slot_myapp',
'0/5000000'
);
Verify:
SELECT slot_name, restart_lsn, confirmed_flush_lsn
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';
Why This Matters:
- Without this: Subscription replays ALL changes from slot creation → duplicate key errors
- With this: Subscription only replays changes AFTER snapshot → clean replication
Phase 4: Setup Green Cluster (Subscriber)
Step 4.1: Enable Logical Replication on Green
aws rds create-db-cluster-parameter-group \
--db-cluster-parameter-group-name green-logical-replication \
--db-parameter-group-family aurora-postgresql17 \
--description "Green cluster with logical replication"
aws rds modify-db-cluster-parameter-group \
--db-cluster-parameter-group-name green-logical-replication \
--parameters "ParameterName=rds.logical_replication,ParameterValue=1,ApplyMethod=pending-reboot"
aws rds modify-db-cluster \
--db-cluster-identifier green-cluster \
--db-cluster-parameter-group-name green-logical-replication \
--apply-immediately
aws rds reboot-db-instance \
--db-instance-identifier green-instance-1
Step 4.2: Create Replication User on Blue (Optional but Recommended)
For better security, create a dedicated replication user instead of using postgres:
CREATE USER replication_user WITH PASSWORD 'SecurePassword123!';
GRANT rds_replication TO replication_user;
\c postgres
GRANT CONNECT ON DATABASE postgres TO replication_user;
GRANT USAGE ON SCHEMA public TO replication_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO replication_user;
\c testdb
GRANT CONNECT ON DATABASE testdb TO replication_user;
GRANT USAGE ON SCHEMA public TO replication_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO replication_user;
Step 4.3: Create Subscriptions on Green
For each database:
\c postgres
CREATE SUBSCRIPTION blue_green_sub_postgres
CONNECTION 'host=blue-cluster.cluster-xxxxx.region.rds.amazonaws.com port=5432 dbname=postgres user=replication_user password=SecurePassword123! sslmode=require'
PUBLICATION blue_green_pub_postgres
WITH (
enabled = true,
create_slot = false,
slot_name = 'blue_green_slot_postgres',
copy_data = false
);
\c testdb
CREATE SUBSCRIPTION blue_green_sub_testdb
CONNECTION 'host=blue-cluster.cluster-xxxxx.region.rds.amazonaws.com port=5432 dbname=testdb user=replication_user password=SecurePassword123! sslmode=require'
PUBLICATION blue_green_pub_testdb
WITH (
enabled = true,
create_slot = false,
slot_name = 'blue_green_slot_testdb',
copy_data = false
);
\c myapp
CREATE SUBSCRIPTION blue_green_sub_myapp
CONNECTION 'host=blue-cluster.cluster-xxxxx.region.rds.amazonaws.com port=5432 dbname=myapp user=replication_user password=SecurePassword123! sslmode=require'
PUBLICATION blue_green_pub_myapp
WITH (
enabled = true,
create_slot = false,
slot_name = 'blue_green_slot_myapp',
copy_data = false
);
Critical Parameters:
create_slot = false: Slot already exists on Blue (don't create new one)slot_name: Must match the slot name on Bluecopy_data = false: Data already exists from snapshot (don't copy again)enabled = true: Start replication immediately
Naming Convention: blue_green_sub_<database_name>
Step 4.4: Verify Subscriptions Started
SELECT subname, subenabled, subconninfo
FROM pg_subscription
WHERE subname LIKE 'blue_green_sub_%';
SELECT subname, pid, received_lsn, latest_end_lsn, last_msg_receipt_time
FROM pg_stat_subscription
WHERE subname LIKE 'blue_green_sub_%';
Step 4.5: Verify Replication Active on Blue
SELECT slot_name, active, active_pid, confirmed_flush_lsn
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';
Phase 5: Monitor and Verify Replication
Step 5.1: Test Data Replication
\c testdb
INSERT INTO test_table (data) VALUES ('test_replication');
SELECT COUNT(*) FROM test_table;
\c testdb
SELECT COUNT(*) FROM test_table;
SELECT * FROM test_table WHERE data = 'test_replication';
Step 5.2: Monitor Replication Lag
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';
Step 5.3: Monitor Subscription Health
SELECT subname,
pid,
received_lsn,
latest_end_lsn,
last_msg_send_time,
last_msg_receipt_time,
latest_end_time
FROM pg_stat_subscription
WHERE subname LIKE 'blue_green_sub_%';
SELECT subname, last_error_message, last_error_time
FROM pg_stat_subscription_errors
WHERE subname LIKE 'blue_green_sub_%';
When ready to switch from Blue to Green:
Step 6.1: Stop Application Writes to Blue
Step 6.2: Wait for Green to Catch Up
SELECT slot_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';
Step 6.3: Drop Subscriptions on Green
\c postgres
DROP SUBSCRIPTION blue_green_sub_postgres;
\c testdb
DROP SUBSCRIPTION blue_green_sub_testdb;
\c myapp
DROP SUBSCRIPTION blue_green_sub_myapp;
Effect: Green is now independent, no longer replicating from Blue.
Step 6.4: Drop Publications and Slots on Blue
\c postgres
DROP PUBLICATION blue_green_pub_postgres;
SELECT pg_drop_replication_slot('blue_green_slot_postgres');
\c testdb
DROP PUBLICATION blue_green_pub_testdb;
SELECT pg_drop_replication_slot('blue_green_slot_testdb');
\c myapp
DROP PUBLICATION blue_green_pub_myapp;
SELECT pg_drop_replication_slot('blue_green_slot_myapp');
Step 6.5: Update Application to Point to Green
Step 6.6: Verify Green is Primary
SELECT version();
INSERT INTO test_table (data) VALUES ('green_is_primary');
SELECT * FROM test_table WHERE data = 'green_is_primary';
Phase 7: Cleanup (Optional)
Step 7.1: Keep Blue as Backup
You can keep Blue running for a few days as a backup before deleting.
Step 7.2: Delete Blue Cluster
aws rds delete-db-cluster \
--db-cluster-identifier blue-cluster \
--skip-final-snapshot
Automation Script
Complete Setup Script
#!/bin/bash
set -e
BLUE_CLUSTER="blue-cluster"
BLUE_ENDPOINT="blue-cluster.cluster-xxxxx.region.rds.amazonaws.com"
GREEN_CLUSTER="green-cluster"
REGION="ap-northeast-1"
DATABASES=("postgres" "testdb" "myapp")
echo "=== Phase 1: Setup Blue (Publisher) ==="
for DB in "${DATABASES[@]}"; do
echo "Setting up database: $DB"
psql -h $BLUE_ENDPOINT -U postgres -d $DB <<EOF
CREATE PUBLICATION blue_green_pub_$DB FOR ALL TABLES;
SELECT pg_create_logical_replication_slot('blue_green_slot_$DB', 'pgoutput');
EOF
done
echo ""
echo "=== Phase 2: Take Snapshot ==="
SNAPSHOT_ID="blue-snapshot-$(date +%s)"
aws rds create-db-cluster-snapshot \
--db-cluster-identifier $BLUE_CLUSTER \
--db-cluster-snapshot-identifier $SNAPSHOT_ID \
--region $REGION
echo "Waiting for snapshot to complete..."
aws rds wait db-cluster-snapshot-available \
--db-cluster-snapshot-identifier $SNAPSHOT_ID \
--region $REGION
echo "Snapshot complete: $SNAPSHOT_ID"
echo ""
echo "=== Restore Green from snapshot ==="
echo "Run the following commands to restore Green:"
echo ""
echo "aws rds restore-db-cluster-from-snapshot \\"
echo " --db-cluster-identifier $GREEN_CLUSTER \\"
echo " --snapshot-identifier $SNAPSHOT_ID \\"
echo " --engine aurora-postgresql \\"
echo " --engine-version 17.9 \\"
echo " --region $REGION"
echo ""
echo "After Green is available, get the snapshot LSN:"
echo ""
echo "SNAPSHOT_LSN=\$(psql -h green-endpoint -U postgres -d postgres -t -c \"SELECT aurora_volume_logical_start_lsn();\" | xargs)"
echo "echo \"Snapshot LSN: \$SNAPSHOT_LSN\""
echo ""
echo "Then continue with Phase 3..."
echo ""
read -p "Press Enter after Green is available and you have the SNAPSHOT_LSN..."
read -p "Enter the SNAPSHOT_LSN from Green: " SNAPSHOT_LSN
echo "Using SNAPSHOT_LSN: $SNAPSHOT_LSN"
echo ""
echo "=== Phase 3: Advance Slots ==="
for DB in "${DATABASES[@]}"; do
psql -h $BLUE_ENDPOINT -U postgres -d $DB <<EOF
SELECT pg_replication_slot_advance('blue_green_slot_$DB', '$SNAPSHOT_LSN');
EOF
done
echo ""
echo "=== Phase 4: Setup Green (Subscriber) ==="
echo "After Green is restored and available, run:"
echo ""
for DB in "${DATABASES[@]}"; do
cat <<EOF
psql -h green-endpoint -U postgres -d $DB <<SQL
CREATE SUBSCRIPTION blue_green_sub_$DB
CONNECTION 'host=$BLUE_ENDPOINT port=5432 dbname=$DB user=postgres password=xxx sslmode=require'
PUBLICATION blue_green_pub_$DB
WITH (enabled=true, create_slot=false, slot_name='blue_green_slot_$DB', copy_data=false);
SQL
EOF
done
echo "=== Setup Complete ==="
Monitoring Queries
Check Overall Replication Health
SELECT
slot_name,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag,
confirmed_flush_lsn
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%'
ORDER BY slot_name;
SELECT
subname,
pid IS NOT NULL AS worker_running,
received_lsn,
latest_end_lsn,
EXTRACT(EPOCH FROM (NOW() - last_msg_receipt_time)) AS seconds_since_last_msg
FROM pg_stat_subscription
WHERE subname LIKE 'blue_green_sub_%'
ORDER BY subname;
Troubleshooting
Issue: "Invalid Record Length" in PostgreSQL Logs
Symptoms: Seeing this in Green's PostgreSQL logs:
LOG: invalid record length at 0/1000FB78: expected at least 24, got 0
This is NOT an error! This is normal and expected.
Explanation:
- PostgreSQL replayed all available WAL sequentially
- When it reaches the end of available WAL, it tries to read the next record
- Finding no data (0 bytes) indicates the end of WAL - this is the snapshot boundary
- This LSN (
0/1000FB78) is the exact point where logical replication should start
What to do:
- Note the LSN from this message - you need it to advance slots on Blue
- Verify it's at the end of redo:
LOG: redo done at 0/1000FB40 (should appear right after)
- Use this LSN when advancing replication slots on Blue
This message confirms:
- ✅ No gaps in WAL replay
- ✅ All data up to snapshot point is present
- ✅ Green is ready for logical replication
- ✅ Exact boundary is known
Issue: Subscription Not Starting
Symptoms: pid is NULL in pg_stat_subscription
Check:
SELECT subname, subenabled FROM pg_subscription WHERE subname LIKE 'blue_green_sub_%';
Solution:
ALTER SUBSCRIPTION blue_green_sub_postgres ENABLE;
SELECT * FROM pg_stat_subscription_errors WHERE subname LIKE 'blue_green_sub_%';
Issue: Duplicate Key Errors
Symptoms: Apply worker crashes with "duplicate key" errors
Cause: Slot was not advanced to snapshot LSN
Solution:
DROP SUBSCRIPTION blue_green_sub_postgres;
SELECT pg_replication_slot_advance('blue_green_slot_postgres', '<correct_snapshot_lsn>');
CREATE SUBSCRIPTION blue_green_sub_postgres ...;
Issue: Connection Refused
Symptoms: Subscription can't connect to Blue
Check:
- Security groups allow PostgreSQL (5432) from Green to Blue
- Blue endpoint is correct
- User credentials are correct
- SSL mode is compatible
Test:
psql -h blue-endpoint -U replication_user -d postgres
Issue: High Replication Lag
Symptoms: Lag > 100 MB
Check:
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';
Solutions:
- Scale up Green instance (more CPU/memory)
- Reduce write load on Blue temporarily
- Check network connectivity
- Check for long-running transactions on Green
Best Practices
- Always advance slots to snapshot LSN before creating subscriptions
- Test in non-production first
- Monitor replication lag continuously
- Use dedicated replication user for better security
- Document snapshot LSN for troubleshooting
- Keep Blue running for 24-48 hours after switchover
- Verify data consistency before deleting Blue
- Use SSL for replication connections
- Set up alerts for replication lag and subscription errors
- Plan rollback procedure before switchover
References