This guide provides SQL scripts and procedures to diagnose and resolve replication lag issues in Aurora PostgreSQL Blue-Green deployments using logical replication.
Tested on: Aurora PostgreSQL 17.9
All SQL queries in this guide have been verified to work on Aurora PostgreSQL.
Quick Diagnosis Checklist
- Check replication lag on BLUE cluster (source)
- Check apply rate on GREEN cluster (target)
- Identify bottleneck type: CPU, I/O, Network, or Schema issues
- Apply appropriate fix
1. Check Replication Lag (Run on BLUE Cluster)
1.1 View Current Replication Lag
SELECT
slot_name,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag_size,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) as lag_bytes
FROM pg_replication_slots
ORDER BY lag_bytes DESC;
Example Output:
slot_name | active | lag_size | lag_bytes
-----------------------------------------------+--------+----------+-----------
rds_us_east_1_7jc5pd3g7gitygeslb2ikyrzta_16412 | t | 268 GB | 288073302904
rds_us_east_1_7jc5pd3g7gitygeslb2ikyrzta_5 | t | 735 MB | 770287240
What to look for:
- Lag > 1 GB indicates a problem
- Lag > 100 GB indicates a severe bottleneck
- Inactive slots mean replication is broken
1.2 Check WAL Sender Status
SELECT
pid,
application_name,
client_addr,
state,
sync_state,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) as send_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn)) as write_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) as flush_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) as replay_lag
FROM pg_stat_replication;
Example Output:
pid | application_name | state | replay_lag
-----+--------------------------------------------+-----------+-----------
1938 | rds_us_east_1_shbjw735qwo5dyejg5l2bt4e64_16412 | streaming | 257 GB
1865 | rds_us_east_1_shbjw735qwo5dyejg5l2bt4e64_16384 | streaming | 872 bytes
What to look for:
state should be 'streaming'- Large
replay_lag means GREEN is slow to apply changes - Large
send_lag means network or GREEN receiver is slow
1.3 Check WAL Sender Wait Events
SELECT
pid,
application_name,
state,
wait_event_type,
wait_event,
backend_type,
EXTRACT(EPOCH FROM (now() - state_change)) as duration_seconds
FROM pg_stat_activity
WHERE backend_type = 'walsender'
ORDER BY duration_seconds DESC;
Example Output:
pid | state | wait_event_type | wait_event | duration_seconds
-----+--------+-----------------+----------------------+-----------------
1865 | active | Client | WalSenderWaitForWal | 109290.19
1938 | active | Client | WalSenderWriteData | 109289.79
Common wait events:
WalSenderWaitForWal: Normal, waiting for new WAL to sendWalSenderWriteData: Network bottleneck or GREEN is slow to receiveClientRead: Waiting for GREEN to acknowledge receipt
1.4 Check Write Activity on BLUE
SELECT
schemaname,
relname,
n_tup_ins as inserts,
n_tup_upd as updates,
n_tup_del as deletes,
n_tup_ins + n_tup_upd + n_tup_del as total_writes,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||relname)) as size
FROM pg_stat_user_tables
WHERE n_tup_ins + n_tup_upd + n_tup_del > 0
ORDER BY total_writes DESC
LIMIT 20;
Example Output:
relname | inserts | updates | deletes | total_writes | size
-------------------------------+----------+----------+----------+--------------+------
tbl_group1_insert_delete_01 | 50000000 | 25000000 | 25000000 | 100000000 | 37 GB
tbl_group2_append_only_01 | 75000000 | 0 | 0 | 75000000 | 41 GB
What to look for:
- Tables with high UPDATE/DELETE counts are most affected by missing primary keys
- Large tables with many writes will cause the most lag
2. Check Apply Rate (Run on GREEN Cluster)
2.1 Check Logical Replication Workers
SELECT
pid,
application_name,
state,
wait_event_type,
wait_event,
backend_type,
state_change
FROM pg_stat_activity
WHERE backend_type LIKE '%replication%'
ORDER BY pid;
Example Output:
pid | state | wait_event_type | wait_event | backend_type
------+--------+-----------------+--------------------+------------------------------
768 | NULL | Activity | LogicalLauncherMain| logical replication launcher
29139 | idle | Activity | LogicalApplyMain | logical replication apply worker
4900 | active | IO | DataFileRead | logical replication apply worker
What to look for:
state = 'active': Worker is processing changesstate = 'idle': Worker is waiting for more data (normal if caught up)wait_event = 'LogicalApplyMain': Idle, waiting for datawait_event = 'DataFileRead': I/O bottleneckwait_event = 'DataFileWrite': I/O bottleneck
2.2 Measure Apply Rate
Option 1: Single SQL with 60-second measurement
DO $$
DECLARE
lsn1 pg_lsn;
lsn2 pg_lsn;
time1 timestamptz;
time2 timestamptz;
bytes_diff bigint;
duration numeric;
mb_per_sec numeric;
BEGIN
lsn1 := pg_current_wal_lsn();
time1 := clock_timestamp();
PERFORM pg_sleep(60);
lsn2 := pg_current_wal_lsn();
time2 := clock_timestamp();
bytes_diff := pg_wal_lsn_diff(lsn2, lsn1);
duration := EXTRACT(EPOCH FROM (time2 - time1));
IF duration > 0 THEN
mb_per_sec := bytes_diff / duration / (1024*1024);
ELSE
mb_per_sec := 0;
END IF;
RAISE NOTICE 'Start LSN: %, End LSN: %', lsn1, lsn2;
RAISE NOTICE 'Start Time: %, End Time: %', time1, time2;
RAISE NOTICE 'Bytes Applied: % (%)', bytes_diff, pg_size_pretty(bytes_diff);
RAISE NOTICE 'Duration: % seconds', round(duration, 3);
RAISE NOTICE 'Apply Rate: % MB/sec', round(mb_per_sec, 2);
IF bytes_diff = 0 THEN
RAISE NOTICE 'WARNING: No WAL progress detected - replication may be idle or caught up';
END IF;
END $$;
Example Output:
NOTICE: Start LSN: 6B/2DE6BDE8, End LSN: 6B/2FFF5C00
NOTICE: Start Time: 2026-04-09 14:17:30+00, End Time: 2026-04-09 14:18:30+00
NOTICE: Bytes Applied: 35167768 (34 MB)
NOTICE: Duration: 60.002 seconds
NOTICE: Apply Rate: 0.56 MB/sec
Option 2: Manual two-step measurement
SELECT pg_current_wal_lsn() as lsn1, now() as time1;
SELECT pg_current_wal_lsn() as lsn2, now() as time2;
SELECT
pg_wal_lsn_diff('68/BA670598', '68/B9E387F0') as bytes_applied,
pg_size_pretty(pg_wal_lsn_diff('68/BA670598', '68/B9E387F0')) as size_applied,
pg_wal_lsn_diff('68/BA670598', '68/B9E387F0') / 60.0 / (1024*1024) as mb_per_sec;
Example Output:
bytes_applied | size_applied | mb_per_sec
--------------+--------------+-----------
8617384 | 8217 kB | 0.14
What to look for:
- Apply rate < 1 MB/sec: Severe bottleneck
- Apply rate 1-5 MB/sec: Moderate bottleneck
- Apply rate > 10 MB/sec: Good performance
2.3 Check Subscription Status
SELECT
subname,
subenabled,
subslotname,
subpublications
FROM pg_subscription;
Example Output:
subname | subenabled | subslotname
-----------------------------------------------+------------+---------------------------------------------
rds_us_east_1_shbjw735qwo5dyejg5l2bt4e64_16384 | t | rds_us_east_1_7jc5pd3g7gitygeslb2ikyrzta_16384
rds_us_east_1_shbjw735qwo5dyejg5l2bt4e64_5 | t | rds_us_east_1_7jc5pd3g7gitygeslb2ikyrzta_5
rds_us_east_1_shbjw735qwo5dyejg5l2bt4e64_16412 | t | rds_us_east_1_7jc5pd3g7gitygeslb2ikyrzta_16412
What to look for:
subenabled = true: Subscription is active- If false, replication is paused
3. Identify Root Cause
3.1 Check for Missing Primary Keys (Run on GREEN)
SELECT
n.nspname as schema,
c.relname as table,
pg_size_pretty(pg_total_relation_size(c.oid)) as size,
c.relreplident,
CASE c.relreplident
WHEN 'd' THEN 'DEFAULT (primary key)'
WHEN 'n' THEN 'NOTHING'
WHEN 'f' THEN 'FULL (all columns)'
WHEN 'i' THEN 'INDEX'
END as replica_identity_type
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
LEFT JOIN pg_constraint con ON con.conrelid = c.oid AND con.contype = 'p'
WHERE c.relkind = 'r'
AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
AND con.conname IS NULL
ORDER BY pg_total_relation_size(c.oid) DESC;
Example Output:
schema | table | size | relreplident | replica_identity_type
-------+--------------------------------+-------+--------------+----------------------
public | tbl_group1_insert_delete_01 | 37 GB | d | DEFAULT (primary key)
public | tbl_group1_insert_delete_02 | 37 GB | d | DEFAULT (primary key)
Critical Issue:
- Tables without primary keys cause FULL TABLE SCANS on every UPDATE/DELETE
- Large tables (> 1 GB) without PKs will severely impact replication performance
3.2 Check for Sequential Scans (Run on GREEN)
SELECT
schemaname,
relname,
seq_scan,
seq_tup_read,
idx_scan,
n_live_tup,
CASE
WHEN seq_scan > 0 THEN seq_tup_read / seq_scan
ELSE 0
END as avg_tuples_per_scan,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||relname)) as size
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 20;
Example Output:
relname | seq_scan | seq_tup_read | avg_tuples_per_scan | size
----------------------------+----------+--------------+---------------------+------
tbl_group1_insert_delete_01 | 1000000 | 50000000000 | 50000 | 37 GB
What to look for:
- High
seq_tup_read values indicate full table scans avg_tuples_per_scan > 10,000 on large tables is problematic
3.3 Check for Blocking/Locks (Run on GREEN)
SELECT
blocked.pid AS blocked_pid,
blocked.usename AS blocked_user,
blocking.pid AS blocking_pid,
blocking.usename AS blocking_user,
blocked.wait_event_type,
blocked.wait_event,
left(blocked.query, 100) AS blocked_query,
left(blocking.query, 100) AS blocking_query
FROM pg_stat_activity AS blocked
JOIN pg_stat_activity AS blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.pid != blocking.pid;
What to look for:
- Replication workers blocked by long-running queries
- Lock contention on frequently updated tables
- Empty result means no blocking (good)
3.4 Check Table Structure (Run on BOTH clusters)
SELECT
conname AS constraint_name,
array_agg(a.attname ORDER BY array_position(con.conkey, a.attnum)) AS pk_columns
FROM pg_constraint con
JOIN pg_attribute a ON a.attrelid = con.conrelid AND a.attnum = ANY(con.conkey)
WHERE con.conrelid = 'public.tbl_group1_insert_delete_01'::regclass
AND con.contype = 'p'
GROUP BY conname;
SELECT
c.relname,
c.relreplident,
CASE c.relreplident
WHEN 'd' THEN 'DEFAULT (primary key)'
WHEN 'n' THEN 'NOTHING'
WHEN 'f' THEN 'FULL (all columns)'
WHEN 'i' THEN 'INDEX'
END as replica_identity
FROM pg_class c
WHERE c.oid = 'public.tbl_group1_insert_delete_01'::regclass;
Example Output:
constraint_name | pk_columns
-----------------------------------+-----------
tbl_group1_insert_delete_01_pkey | {id}
relname | relreplident | replica_identity
-------------------------------+--------------+------------------
tbl_group1_insert_delete_01 | d | DEFAULT (primary key)
4. Common Issues and Solutions
Issue 1: Missing Primary Keys
Symptoms:
- Replication lag growing continuously
- GREEN workers waiting on
DataFileRead - High sequential scan counts on GREEN
- Apply rate < 1 MB/sec
Diagnosis:
SELECT count(*) as tables_without_pk
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
LEFT JOIN pg_constraint con ON con.conrelid = c.oid AND con.contype = 'p'
WHERE c.relkind = 'r'
AND n.nspname = 'public'
AND con.conname IS NULL;
Solution:
SET default_transaction_read_only = off;
ALTER TABLE tbl_group1_insert_delete_01 ADD PRIMARY KEY (id);
ALTER TABLE tbl_group1_insert_delete_01 ADD PRIMARY KEY (id);
Important Notes:
- Logical replication does NOT replicate DDL statements
- You must add primary keys on BOTH clusters manually
- Add to GREEN first to improve apply performance immediately
- Add to BLUE to prevent schema drift
Issue 2: I/O Bottleneck
Symptoms:
- Workers waiting on
DataFileRead or DataFileWrite - Apply rate improved after adding PKs but still slow
- CloudWatch shows high disk I/O
Diagnosis:
SELECT
wait_event_type,
wait_event,
count(*) as count
FROM pg_stat_activity
WHERE wait_event_type = 'IO'
GROUP BY wait_event_type, wait_event
ORDER BY count DESC;
Example Output:
wait_event_type | wait_event | count
----------------+-----------------+------
IO | DataFileRead | 3
IO | AuroraStorageLogAllocate | 1
Solutions:
- Increase provisioned IOPS on GREEN cluster
- Scale up to larger instance class with better I/O
- Reduce write workload on BLUE temporarily
- Enable Aurora I/O-Optimized configuration
Issue 3: Network Bottleneck
Symptoms:
- WAL senders on BLUE waiting on
WalSenderWriteData - Large
send_lag but small replay_lag - Network throughput at maximum
Diagnosis:
SELECT
application_name,
client_addr,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) as send_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) as replay_lag
FROM pg_stat_replication;
Solutions:
- Scale up instance class for better network bandwidth
- Reduce write workload on BLUE
- Check for network issues between AZs
Issue 4: Long-Running Transactions
Symptoms:
- Replication workers blocked
- Locks visible in pg_stat_activity
- Lag increases during specific operations
Diagnosis:
SELECT
pid,
usename,
state,
wait_event,
now() - xact_start as transaction_duration,
left(query, 100) as query
FROM pg_stat_activity
WHERE state != 'idle'
AND xact_start IS NOT NULL
ORDER BY xact_start
LIMIT 10;
Solutions:
- Terminate blocking queries:
SELECT pg_terminate_backend(pid); - Optimize application queries to be shorter
- Use connection pooling with transaction timeouts
5. Monitoring Script
Complete Monitoring Query (Run on BLUE)
WITH slot_lag AS (
SELECT
slot_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) as lag_bytes
FROM pg_replication_slots
),
sender_stats AS (
SELECT
application_name,
state,
sync_state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as replay_lag_bytes
FROM pg_stat_replication
)
SELECT
sl.slot_name,
pg_size_pretty(sl.lag_bytes) as slot_lag,
ss.state as sender_state,
ss.sync_state,
pg_size_pretty(ss.replay_lag_bytes) as replay_lag,
CASE
WHEN sl.lag_bytes > 107374182400 THEN 'CRITICAL (>100GB)'
WHEN sl.lag_bytes > 10737418240 THEN 'WARNING (>10GB)'
WHEN sl.lag_bytes > 1073741824 THEN 'ATTENTION (>1GB)'
ELSE 'OK'
END as status
FROM slot_lag sl
LEFT JOIN sender_stats ss ON ss.application_name LIKE '%' || split_part(sl.slot_name, '_', 6) || '%'
ORDER BY sl.lag_bytes DESC;
Example Output:
slot_name | slot_lag | sender_state | status
-----------------------------------------------+----------+--------------+------------------
rds_us_east_1_7jc5pd3g7gitygeslb2ikyrzta_16412 | 268 GB | streaming | CRITICAL (>100GB)
rds_us_east_1_7jc5pd3g7gitygeslb2ikyrzta_5 | 735 MB | streaming | OK
Complete Monitoring Query (Run on GREEN)
SELECT
pid,
backend_type,
state,
wait_event_type,
wait_event,
CASE
WHEN state = 'active' AND wait_event IS NULL THEN 'PROCESSING'
WHEN state = 'active' AND wait_event_type = 'IO' THEN 'IO_BOTTLENECK'
WHEN state = 'idle' AND wait_event = 'LogicalApplyMain' THEN 'WAITING_FOR_DATA'
ELSE 'CHECK_REQUIRED'
END as worker_status,
state_change,
now() - state_change as time_in_state
FROM pg_stat_activity
WHERE backend_type LIKE '%replication%'
ORDER BY backend_type, pid;
Example Output:
pid | backend_type | state | wait_event | worker_status
------+--------------------------------------+--------+-----------------+------------------
768 | logical replication launcher | NULL | LogicalLauncherMain | CHECK_REQUIRED
29139 | logical replication apply worker | idle | LogicalApplyMain | WAITING_FOR_DATA
4900 | logical replication apply worker | active | DataFileRead | IO_BOTTLENECK
On GREEN Cluster
SET logical_decoding_work_mem = '256MB';
SHOW max_logical_replication_workers;
SHOW max_parallel_apply_workers_per_subscription;
Important: Aurora PostgreSQL automatically manages most replication parameters. Contact AWS Support before modifying system-level parameters.
7. Emergency Procedures
Pause Replication (if lag is unmanageable)
ALTER SUBSCRIPTION subscription_name DISABLE;
Resume Replication
ALTER SUBSCRIPTION subscription_name ENABLE;
Check Subscription Names
SELECT subname FROM pg_subscription;
8. Best Practices
- Always ensure tables have primary keys before creating Blue-Green deployments
- Monitor replication lag continuously during the deployment
- Test schema changes on a non-production Blue-Green deployment first
- Reduce write workload during initial replication catch-up
- Size GREEN cluster appropriately - same or larger than BLUE
- Add primary keys to BOTH clusters to maintain schema consistency
- Use CloudWatch metrics to monitor I/O, CPU, and network utilization
- Set up alerts for replication lag > 1 GB