Saturday, May 2, 2026

Self-Managed Blue/Green Deployment for Aurora PostgreSQL

 Use Case: Manual Blue/Green setup for major version upgrades with logical replication

Tested: Aurora PostgreSQL 16.13 → 17.9


Overview

This guide shows how to manually replicate AWS RDS Blue/Green deployment functionality using PostgreSQL logical replication (publications and subscriptions). This is useful when:

  • AWS Blue/Green deployments are not suitable for your setup due to limitations
  • You need more control over the process
  • You want to understand the underlying mechanism
  • You need to customize the replication setup

Prerequisites

Blue Cluster (Source)

  • Aurora PostgreSQL cluster (any version)
  • Logical replication enabled: rds.logical_replication = 1
  • Sufficient WAL retention for replication lag
  • Network access from Green cluster

Green Cluster (Target)

  • Aurora PostgreSQL cluster (can be different version)
  • Logical replication enabled: rds.logical_replication = 1
  • Same database structure as Blue (will be cloned)
  • Network access to Blue cluster

Required Permissions

  • Superuser or rds_superuser role
  • rds_replication role
  • Ability to create publications, slots, and subscriptions

Step-by-Step Setup

Phase 1: Prepare Blue Cluster (Publisher)

Step 1.1: Enable Logical Replication on Blue

# Create parameter group with logical replication
aws rds create-db-cluster-parameter-group \
  --db-cluster-parameter-group-name blue-logical-replication \
  --db-parameter-group-family aurora-postgresql16 \
  --description "Blue cluster with logical replication"

# Enable logical replication
aws rds modify-db-cluster-parameter-group \
  --db-cluster-parameter-group-name blue-logical-replication \
  --parameters "ParameterName=rds.logical_replication,ParameterValue=1,ApplyMethod=pending-reboot"

# Apply to cluster
aws rds modify-db-cluster \
  --db-cluster-identifier blue-cluster \
  --db-cluster-parameter-group-name blue-logical-replication \
  --apply-immediately

# Reboot instance to apply
aws rds reboot-db-instance \
  --db-instance-identifier blue-instance-1

Verify:

-- Connect to Blue
SHOW wal_level;
-- Expected: logical

SELECT name, setting 
FROM pg_settings 
WHERE name = 'rds.logical_replication';
-- Expected: on

Step 1.2: Create Publications on Blue

For each database you want to replicate:

-- Connect to Blue cluster
-- Switch to each database

-- Database: postgres
\c postgres
CREATE PUBLICATION blue_green_pub_postgres FOR ALL TABLES;

-- Database: testdb
\c testdb
CREATE PUBLICATION blue_green_pub_testdb FOR ALL TABLES;

-- Database: myapp
\c myapp
CREATE PUBLICATION blue_green_pub_myapp FOR ALL TABLES;

Naming Conventionblue_green_pub_<database_name>

Verify:

-- In each database
SELECT pubname, puballtables 
FROM pg_publication 
WHERE pubname LIKE 'blue_green_pub_%';

-- Expected output:
-- pubname                    | puballtables
-- blue_green_pub_postgres    | t
-- blue_green_pub_testdb      | t

Step 1.3: Create Replication Slots on Blue

For each database:

-- Connect to Blue cluster
-- Switch to each database

-- Database: postgres
\c postgres
SELECT pg_create_logical_replication_slot(
  'blue_green_slot_postgres',
  'pgoutput'
);

-- Database: testdb
\c testdb
SELECT pg_create_logical_replication_slot(
  'blue_green_slot_testdb',
  'pgoutput'
);

-- Database: myapp
\c myapp
SELECT pg_create_logical_replication_slot(
  'blue_green_slot_myapp',
  'pgoutput'
);

Naming Conventionblue_green_slot_<database_name>

Verify:

-- In each database
SELECT slot_name, plugin, slot_type, active, restart_lsn
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';

-- Expected output:
-- slot_name                | plugin   | slot_type | active | restart_lsn
-- blue_green_slot_postgres | pgoutput | logical   | f      | 0/1234567
-- blue_green_slot_testdb   | pgoutput | logical   | f      | 0/1234567

Noteactive = f because no subscriber is connected yet.


Phase 2: Clone Green Cluster from Blue

Step 2.1: Clone Green from Blue (Aurora Fast Clone)

For Aurora PostgreSQL (Recommended - Fast Clone):

Aurora supports fast cloning using copy-on-write technology, which is much faster than snapshot restore:

# Clone Green cluster from Blue (copy-on-write)
aws rds restore-db-cluster-to-point-in-time \
  --source-db-cluster-identifier blue-cluster \
  --db-cluster-identifier green-cluster \
  --restore-type copy-on-write \
  --use-latest-restorable-time \
  --db-cluster-parameter-group-name green-logical-replication \
  --vpc-security-group-ids sg-xxxxx

# Wait for clone cluster to be available
aws rds wait db-cluster-available \
  --db-cluster-identifier green-cluster

# Create writer instance in Green cluster
aws rds create-db-instance \
  --db-instance-identifier green-instance-1 \
  --db-cluster-identifier green-cluster \
  --db-instance-class db.r6g.large \
  --engine aurora-postgresql

# Wait for instance to be available
aws rds wait db-instance-available \
  --db-instance-identifier green-instance-1

# (Optional) Create reader instance for high availability
aws rds create-db-instance \
  --db-instance-identifier green-instance-2 \
  --db-cluster-identifier green-cluster \
  --db-instance-class db.r6g.large \
  --engine aurora-postgresql

Key Parameters:

  • --restore-type copy-on-write: Creates a fast clone (not a full copy)
  • --use-latest-restorable-time: Uses the latest data from Blue
  • Clone is created in seconds (not minutes like snapshot restore)

For RDS PostgreSQL (Alternative - Snapshot Restore):

RDS PostgreSQL doesn't support fast cloning, so use snapshot restore:

# Create manual snapshot
aws rds create-db-snapshot \
  --db-instance-identifier blue-instance \
  --db-snapshot-identifier blue-snapshot-for-green

# Wait for snapshot to complete
aws rds wait db-snapshot-available \
  --db-snapshot-identifier blue-snapshot-for-green

# Restore Green from snapshot
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier green-instance \
  --db-snapshot-identifier blue-snapshot-for-green \
  --db-instance-class db.r6g.large \
  --db-parameter-group-name green-logical-replication \
  --vpc-security-group-ids sg-xxxxx

# Wait for Green to be available
aws rds wait db-instance-available \
  --db-instance-identifier green-instance

Result: Green now has all data from Blue at the clone/snapshot time.

Step 2.2: Get Snapshot LSN

Critical: You need to know the LSN at snapshot time to advance slots correctly.


Method 1: For Aurora PostgreSQL (Recommended for Aurora):

Aurora provides a special function to get the exact LSN where the clone/snapshot was taken:

-- Connect to Green cluster (after it's available)
SELECT aurora_volume_logical_start_lsn();

-- Example output:
-- aurora_volume_logical_start_lsn 
-- -------------------------------
-- 0/402E2F0
-- (1 row)

Save this LSN value - you'll use it to advance replication slots on Blue.

# Get snapshot LSN from Green
SNAPSHOT_LSN=$(psql -h green-endpoint -U postgres -d postgres -t -c "SELECT aurora_volume_logical_start_lsn();" | xargs)
echo "Snapshot LSN: $SNAPSHOT_LSN"
# Example: 0/402E2F0

Method 2: For RDS PostgreSQL (Recommended for RDS):

RDS PostgreSQL doesn't have aurora_volume_logical_start_lsn(), but you can get the exact LSN from the PostgreSQL error log during Green's first startup.

Step 2.2a: Check Green's PostgreSQL Log

After Green is restored and starts up, check the PostgreSQL log for the "invalid record" message:

# View recent PostgreSQL logs
aws rds download-db-log-file-portion \
  --db-instance-identifier green-instance-1 \
  --log-file-name error/postgresql.log.2026-05-02-10 \
  --region ap-northeast-1 \
  --output text | grep -A2 -B2 "invalid record"

Look for this pattern:

LOG: redo starts at 0/100000A0
LOG: invalid record length at 0/1000FB78: expected at least 24, got 0
DEBUG: Set Logical Seed LSN at 0/1000FB78
LOG: redo done at 0/1000FB40

The LSN you need is0/1000FB78 (from the "invalid record length" line)

What this means:

  • PostgreSQL replayed all WAL up to 0/1000FB40 (last valid record)
  • Tried to read next record at 0/1000FB78 but found no data (end of WAL)
  • This 0/1000FB78 is the exact boundary where the snapshot was taken
  • This is the LSN you'll use to advance replication slots on Blue

Why "invalid record" is GOOD:

  • It's NOT an error - it's PostgreSQL's way of marking the end of available WAL
  • It ensures no gaps - PostgreSQL validated every record up to this point
  • This is the precise LSN where logical replication should start
# Extract the LSN from logs
SNAPSHOT_LSN=$(aws rds download-db-log-file-portion \
  --db-instance-identifier green-instance-1 \
  --log-file-name error/postgresql.log.2026-05-02-10 \
  --region ap-northeast-1 \
  --output text | \
  grep "invalid record length" | \
  grep -oE '0/[0-9A-F]+' | \
  head -1)

echo "Snapshot LSN: $SNAPSHOT_LSN"
# Example: 0/1000FB78

Method 3: For Standard RDS PostgreSQL (Alternative - Less Accurate):

If you can't access logs, query Blue's LSN immediately after snapshot completes:

# After snapshot completes, immediately run on Blue:
SNAPSHOT_LSN=$(psql -h blue-endpoint -U postgres -d postgres -t -c "SELECT pg_current_wal_lsn();" | xargs)
echo "Snapshot LSN: $SNAPSHOT_LSN"

Note: This is less accurate because Blue continues writing after the snapshot is taken.

When this method is SAFE:

  • If you can pause application writes to the database during snapshot
  • Stop application → Take snapshot → Get LSN → Resume application
  • This ensures the LSN matches the snapshot exactly

Procedure for safe Method 3:

# 1. Stop application writes
echo "Stopping application..."
# (stop your application or set database to read-only)

# 2. Take snapshot
aws rds create-db-snapshot \
  --db-instance-identifier blue-instance \
  --db-snapshot-identifier blue-snapshot-$(date +%s)

# 3. Immediately get LSN (while writes are still paused)
SNAPSHOT_LSN=$(psql -h blue-endpoint -U postgres -d postgres -t -c "SELECT pg_current_wal_lsn();" | xargs)
echo "Snapshot LSN: $SNAPSHOT_LSN"

# 4. Resume application writes
echo "Resuming application..."
# (start your application)

Without pausing writes: LSN will be approximate and may include some post-snapshot changes.


Comparison of Methods:

MethodAccuracyWhen to Use
aurora_volume_logical_start_lsn()ExactAurora PostgreSQL only
PostgreSQL log "invalid record"ExactRDS PostgreSQL (recommended)
pg_current_wal_lsn() on BlueApproximateFallback if logs unavailable

Recommendation:

  • Aurora: Use Method 1 (aurora_volume_logical_start_lsn())
  • RDS PostgreSQL: Use Method 2 (PostgreSQL log analysis)
  • Fallback: Use Method 3 only if other methods unavailable

Step 2.3: Clean Up Replication Slots on Green (If Cloned)

If Green was cloned from Blue, it may have copies of Blue's replication slots. These must be dropped:

-- Connect to Green cluster
-- Check for any replication slots
SELECT slot_name, slot_type, active 
FROM pg_replication_slots;

-- Drop any slots that were copied from Blue
-- (These are not needed on Green and will cause issues)
SELECT pg_drop_replication_slot('blue_green_slot_postgres');
SELECT pg_drop_replication_slot('blue_green_slot_testdb');
SELECT pg_drop_replication_slot('blue_green_slot_myapp');

-- Verify all slots are dropped
SELECT COUNT(*) FROM pg_replication_slots;
-- Expected: 0

Why This is Important:

  • Cloned slots on Green are inactive and useless
  • They can interfere with major version upgrades
  • Green will create its own subscriptions (not slots)

Phase 3: Advance Replication Slots (Critical Step!)

Step 3.1: Advance Slots on Blue to Snapshot LSN

This is the most critical step - it prevents Green from replaying already-cloned data.

-- Connect to Blue cluster
-- Use the SNAPSHOT_LSN you recorded earlier

-- Database: postgres
\c postgres
SELECT pg_replication_slot_advance(
  'blue_green_slot_postgres',
  '0/5000000'  -- Replace with your actual SNAPSHOT_LSN
);

-- Database: testdb
\c testdb
SELECT pg_replication_slot_advance(
  'blue_green_slot_testdb',
  '0/5000000'  -- Replace with your actual SNAPSHOT_LSN
);

-- Database: myapp
\c myapp
SELECT pg_replication_slot_advance(
  'blue_green_slot_myapp',
  '0/5000000'  -- Replace with your actual SNAPSHOT_LSN
);

Verify:

-- In each database on Blue
SELECT slot_name, restart_lsn, confirmed_flush_lsn
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';

-- Expected: Both LSNs should match your SNAPSHOT_LSN
-- slot_name                | restart_lsn | confirmed_flush_lsn
-- blue_green_slot_postgres | 0/5000000   | 0/5000000
-- blue_green_slot_testdb   | 0/5000000   | 0/5000000

Why This Matters:

  • Without this: Subscription replays ALL changes from slot creation → duplicate key errors
  • With this: Subscription only replays changes AFTER snapshot → clean replication

Phase 4: Setup Green Cluster (Subscriber)

Step 4.1: Enable Logical Replication on Green

# Create parameter group for Green
aws rds create-db-cluster-parameter-group \
  --db-cluster-parameter-group-name green-logical-replication \
  --db-parameter-group-family aurora-postgresql17 \
  --description "Green cluster with logical replication"

# Enable logical replication
aws rds modify-db-cluster-parameter-group \
  --db-cluster-parameter-group-name green-logical-replication \
  --parameters "ParameterName=rds.logical_replication,ParameterValue=1,ApplyMethod=pending-reboot"

# If not already applied during restore, apply now
aws rds modify-db-cluster \
  --db-cluster-identifier green-cluster \
  --db-cluster-parameter-group-name green-logical-replication \
  --apply-immediately

# Reboot if needed
aws rds reboot-db-instance \
  --db-instance-identifier green-instance-1

For better security, create a dedicated replication user instead of using postgres:

-- Connect to Blue cluster
CREATE USER replication_user WITH PASSWORD 'SecurePassword123!';
GRANT rds_replication TO replication_user;

-- Grant access to all databases
\c postgres
GRANT CONNECT ON DATABASE postgres TO replication_user;
GRANT USAGE ON SCHEMA public TO replication_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO replication_user;

\c testdb
GRANT CONNECT ON DATABASE testdb TO replication_user;
GRANT USAGE ON SCHEMA public TO replication_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO replication_user;

Step 4.3: Create Subscriptions on Green

For each database:

-- Connect to Green cluster
-- Get Blue endpoint
-- Blue endpoint: blue-cluster.cluster-xxxxx.region.rds.amazonaws.com

-- Database: postgres
\c postgres
CREATE SUBSCRIPTION blue_green_sub_postgres
  CONNECTION 'host=blue-cluster.cluster-xxxxx.region.rds.amazonaws.com port=5432 dbname=postgres user=replication_user password=SecurePassword123! sslmode=require'
  PUBLICATION blue_green_pub_postgres
  WITH (
    enabled = true,
    create_slot = false,
    slot_name = 'blue_green_slot_postgres',
    copy_data = false
  );

-- Database: testdb
\c testdb
CREATE SUBSCRIPTION blue_green_sub_testdb
  CONNECTION 'host=blue-cluster.cluster-xxxxx.region.rds.amazonaws.com port=5432 dbname=testdb user=replication_user password=SecurePassword123! sslmode=require'
  PUBLICATION blue_green_pub_testdb
  WITH (
    enabled = true,
    create_slot = false,
    slot_name = 'blue_green_slot_testdb',
    copy_data = false
  );

-- Database: myapp
\c myapp
CREATE SUBSCRIPTION blue_green_sub_myapp
  CONNECTION 'host=blue-cluster.cluster-xxxxx.region.rds.amazonaws.com port=5432 dbname=myapp user=replication_user password=SecurePassword123! sslmode=require'
  PUBLICATION blue_green_pub_myapp
  WITH (
    enabled = true,
    create_slot = false,
    slot_name = 'blue_green_slot_myapp',
    copy_data = false
  );

Critical Parameters:

  • create_slot = false: Slot already exists on Blue (don't create new one)
  • slot_name: Must match the slot name on Blue
  • copy_data = false: Data already exists from snapshot (don't copy again)
  • enabled = true: Start replication immediately

Naming Conventionblue_green_sub_<database_name>


Step 4.4: Verify Subscriptions Started

-- On Green, in each database
SELECT subname, subenabled, subconninfo 
FROM pg_subscription
WHERE subname LIKE 'blue_green_sub_%';

-- Expected output:
-- subname                  | subenabled | subconninfo
-- blue_green_sub_postgres  | t          | host=blue-cluster...
-- blue_green_sub_testdb    | t          | host=blue-cluster...

-- Check apply worker status
SELECT subname, pid, received_lsn, latest_end_lsn, last_msg_receipt_time
FROM pg_stat_subscription
WHERE subname LIKE 'blue_green_sub_%';

-- Expected: pid should be non-NULL (worker running)
-- subname                  | pid  | received_lsn | latest_end_lsn | last_msg_receipt_time
-- blue_green_sub_postgres  | 1234 | 0/5000100    | 0/5000100      | 2026-05-02 08:30:00

Step 4.5: Verify Replication Active on Blue

-- On Blue, in each database
SELECT slot_name, active, active_pid, confirmed_flush_lsn
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';

-- Expected: active = t (subscriber connected)
-- slot_name                | active | active_pid | confirmed_flush_lsn
-- blue_green_slot_postgres | t      | 5678       | 0/5000100
-- blue_green_slot_testdb   | t      | 5679       | 0/5000100

Phase 5: Monitor and Verify Replication

Step 5.1: Test Data Replication

-- On Blue
\c testdb
INSERT INTO test_table (data) VALUES ('test_replication');
SELECT COUNT(*) FROM test_table;

-- Wait 5 seconds

-- On Green
\c testdb
SELECT COUNT(*) FROM test_table;
-- Should match Blue count

SELECT * FROM test_table WHERE data = 'test_replication';
-- Should see the new row

Step 5.2: Monitor Replication Lag

-- On Blue: Check how far behind Green is
SELECT slot_name,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';

-- Expected: lag < 1 MB (typically a few KB)
-- slot_name                | lag
-- blue_green_slot_postgres | 128 bytes
-- blue_green_slot_testdb   | 256 bytes

Step 5.3: Monitor Subscription Health

-- On Green: Check subscription status
SELECT subname, 
       pid, 
       received_lsn, 
       latest_end_lsn,
       last_msg_send_time,
       last_msg_receipt_time,
       latest_end_time
FROM pg_stat_subscription
WHERE subname LIKE 'blue_green_sub_%';

-- Check for errors
SELECT subname, last_error_message, last_error_time
FROM pg_stat_subscription_errors
WHERE subname LIKE 'blue_green_sub_%';
-- Should be empty (no errors)

Phase 6: Switchover (Promote Green to Primary)

When ready to switch from Blue to Green:

Step 6.1: Stop Application Writes to Blue

# Stop application or redirect traffic away from Blue
# Ensure no more writes are happening

Step 6.2: Wait for Green to Catch Up

-- On Blue: Check replication lag
SELECT slot_name,
       pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';

-- Wait until lag_bytes = 0 for all slots

Step 6.3: Drop Subscriptions on Green

-- On Green, in each database
\c postgres
DROP SUBSCRIPTION blue_green_sub_postgres;

\c testdb
DROP SUBSCRIPTION blue_green_sub_testdb;

\c myapp
DROP SUBSCRIPTION blue_green_sub_myapp;

Effect: Green is now independent, no longer replicating from Blue.


Step 6.4: Drop Publications and Slots on Blue

-- On Blue, in each database
\c postgres
DROP PUBLICATION blue_green_pub_postgres;
SELECT pg_drop_replication_slot('blue_green_slot_postgres');

\c testdb
DROP PUBLICATION blue_green_pub_testdb;
SELECT pg_drop_replication_slot('blue_green_slot_testdb');

\c myapp
DROP PUBLICATION blue_green_pub_myapp;
SELECT pg_drop_replication_slot('blue_green_slot_myapp');

Step 6.5: Update Application to Point to Green

# Update application connection strings
# Old: blue-cluster.cluster-xxxxx.region.rds.amazonaws.com
# New: green-cluster.cluster-xxxxx.region.rds.amazonaws.com

# Or use DNS CNAME to switch
# Update CNAME: myapp-db.example.com → green-cluster.cluster-xxxxx...

Step 6.6: Verify Green is Primary

-- On Green
SELECT version();
-- Should show PostgreSQL 17.9 (if you upgraded)

-- Test writes
INSERT INTO test_table (data) VALUES ('green_is_primary');

-- Verify
SELECT * FROM test_table WHERE data = 'green_is_primary';

Phase 7: Cleanup (Optional)

Step 7.1: Keep Blue as Backup

You can keep Blue running for a few days as a backup before deleting.


Step 7.2: Delete Blue Cluster

# When confident Green is working well
aws rds delete-db-cluster \
  --db-cluster-identifier blue-cluster \
  --skip-final-snapshot

Automation Script

Complete Setup Script

#!/bin/bash
set -e

# Configuration
BLUE_CLUSTER="blue-cluster"
BLUE_ENDPOINT="blue-cluster.cluster-xxxxx.region.rds.amazonaws.com"
GREEN_CLUSTER="green-cluster"
REGION="ap-northeast-1"
DATABASES=("postgres" "testdb" "myapp")

echo "=== Phase 1: Setup Blue (Publisher) ==="

for DB in "${DATABASES[@]}"; do
  echo "Setting up database: $DB"
  
  # Create publication
  psql -h $BLUE_ENDPOINT -U postgres -d $DB <<EOF
CREATE PUBLICATION blue_green_pub_$DB FOR ALL TABLES;
SELECT pg_create_logical_replication_slot('blue_green_slot_$DB', 'pgoutput');
EOF
done

echo ""
echo "=== Phase 2: Take Snapshot ==="

SNAPSHOT_ID="blue-snapshot-$(date +%s)"

aws rds create-db-cluster-snapshot \
  --db-cluster-identifier $BLUE_CLUSTER \
  --db-cluster-snapshot-identifier $SNAPSHOT_ID \
  --region $REGION

echo "Waiting for snapshot to complete..."
aws rds wait db-cluster-snapshot-available \
  --db-cluster-snapshot-identifier $SNAPSHOT_ID \
  --region $REGION

echo "Snapshot complete: $SNAPSHOT_ID"
echo ""
echo "=== Restore Green from snapshot ==="
echo "Run the following commands to restore Green:"
echo ""
echo "aws rds restore-db-cluster-from-snapshot \\"
echo "  --db-cluster-identifier $GREEN_CLUSTER \\"
echo "  --snapshot-identifier $SNAPSHOT_ID \\"
echo "  --engine aurora-postgresql \\"
echo "  --engine-version 17.9 \\"
echo "  --region $REGION"
echo ""
echo "After Green is available, get the snapshot LSN:"
echo ""

# Get snapshot LSN from Green (Aurora-specific)
echo "SNAPSHOT_LSN=\$(psql -h green-endpoint -U postgres -d postgres -t -c \"SELECT aurora_volume_logical_start_lsn();\" | xargs)"
echo "echo \"Snapshot LSN: \$SNAPSHOT_LSN\""
echo ""
echo "Then continue with Phase 3..."
echo ""
read -p "Press Enter after Green is available and you have the SNAPSHOT_LSN..."

# Prompt for snapshot LSN
read -p "Enter the SNAPSHOT_LSN from Green: " SNAPSHOT_LSN
echo "Using SNAPSHOT_LSN: $SNAPSHOT_LSN"

echo ""
echo "=== Phase 3: Advance Slots ==="

for DB in "${DATABASES[@]}"; do
  psql -h $BLUE_ENDPOINT -U postgres -d $DB <<EOF
SELECT pg_replication_slot_advance('blue_green_slot_$DB', '$SNAPSHOT_LSN');
EOF
done

echo ""
echo "=== Phase 4: Setup Green (Subscriber) ==="
echo "After Green is restored and available, run:"
echo ""

for DB in "${DATABASES[@]}"; do
  cat <<EOF
psql -h green-endpoint -U postgres -d $DB <<SQL
CREATE SUBSCRIPTION blue_green_sub_$DB
  CONNECTION 'host=$BLUE_ENDPOINT port=5432 dbname=$DB user=postgres password=xxx sslmode=require'
  PUBLICATION blue_green_pub_$DB
  WITH (enabled=true, create_slot=false, slot_name='blue_green_slot_$DB', copy_data=false);
SQL

EOF
done

echo "=== Setup Complete ==="

Monitoring Queries

Check Overall Replication Health

-- On Blue: Replication lag summary
SELECT 
  slot_name,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag,
  confirmed_flush_lsn
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%'
ORDER BY slot_name;
-- On Green: Subscription status summary
SELECT 
  subname,
  pid IS NOT NULL AS worker_running,
  received_lsn,
  latest_end_lsn,
  EXTRACT(EPOCH FROM (NOW() - last_msg_receipt_time)) AS seconds_since_last_msg
FROM pg_stat_subscription
WHERE subname LIKE 'blue_green_sub_%'
ORDER BY subname;

Troubleshooting

Issue: "Invalid Record Length" in PostgreSQL Logs

Symptoms: Seeing this in Green's PostgreSQL logs:

LOG: invalid record length at 0/1000FB78: expected at least 24, got 0

This is NOT an error! This is normal and expected.

Explanation:

  • PostgreSQL replayed all available WAL sequentially
  • When it reaches the end of available WAL, it tries to read the next record
  • Finding no data (0 bytes) indicates the end of WAL - this is the snapshot boundary
  • This LSN (0/1000FB78) is the exact point where logical replication should start

What to do:

  1. Note the LSN from this message - you need it to advance slots on Blue
  2. Verify it's at the end of redo:
    LOG: redo done at 0/1000FB40  (should appear right after)
    
  3. Use this LSN when advancing replication slots on Blue

This message confirms:

  • ✅ No gaps in WAL replay
  • ✅ All data up to snapshot point is present
  • ✅ Green is ready for logical replication
  • ✅ Exact boundary is known

Issue: Subscription Not Starting

Symptomspid is NULL in pg_stat_subscription

Check:

-- On Green
SELECT subname, subenabled FROM pg_subscription WHERE subname LIKE 'blue_green_sub_%';

Solution:

-- Enable subscription
ALTER SUBSCRIPTION blue_green_sub_postgres ENABLE;

-- Check logs
SELECT * FROM pg_stat_subscription_errors WHERE subname LIKE 'blue_green_sub_%';

Issue: Duplicate Key Errors

Symptoms: Apply worker crashes with "duplicate key" errors

Cause: Slot was not advanced to snapshot LSN

Solution:

-- On Green: Drop subscription
DROP SUBSCRIPTION blue_green_sub_postgres;

-- On Blue: Advance slot to correct LSN
SELECT pg_replication_slot_advance('blue_green_slot_postgres', '<correct_snapshot_lsn>');

-- On Green: Recreate subscription
CREATE SUBSCRIPTION blue_green_sub_postgres ...;

Issue: Connection Refused

Symptoms: Subscription can't connect to Blue

Check:

  1. Security groups allow PostgreSQL (5432) from Green to Blue
  2. Blue endpoint is correct
  3. User credentials are correct
  4. SSL mode is compatible

Test:

# From Green instance
psql -h blue-endpoint -U replication_user -d postgres

Issue: High Replication Lag

Symptoms: Lag > 100 MB

Check:

-- On Blue
SELECT slot_name, 
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
WHERE slot_name LIKE 'blue_green_slot_%';

Solutions:

  1. Scale up Green instance (more CPU/memory)
  2. Reduce write load on Blue temporarily
  3. Check network connectivity
  4. Check for long-running transactions on Green

Best Practices

  1. Always advance slots to snapshot LSN before creating subscriptions
  2. Test in non-production first
  3. Monitor replication lag continuously
  4. Use dedicated replication user for better security
  5. Document snapshot LSN for troubleshooting
  6. Keep Blue running for 24-48 hours after switchover
  7. Verify data consistency before deleting Blue
  8. Use SSL for replication connections
  9. Set up alerts for replication lag and subscription errors
  10. Plan rollback procedure before switchover

References


No comments:

Post a Comment