Friday, April 10, 2026

Troubleshoot performance related issue with nested SQL in function/procedure

 Verify that auto_explain works with ONLY shared_preload_libraries set to include auto_explain.


SET client_min_messages = 'log';
SET auto_explain.log_min_duration = 0;
SET auto_explain.log_analyze = true;
SET auto_explain.log_buffers = true;
SET auto_explain.log_nested_statements = true;
-- Test SQL Example
SELECT * FROM get_customer_summary(1, 'completed');

Output shows all 4 nested queries:

LOG:  duration: 0.008 ms  plan:
Query Text: SELECT 1.1
Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.002..0.003 rows=1 loops=1)
  Output: 1.1
Query Identifier: -7531280629008474172

LOG:  duration: 1.193 ms  plan:
Query Text: SELECT COUNT(*) FROM orders o WHERE o.customer_id = p_customer_id
Aggregate  (cost=18.63..18.64 rows=1 width=8) (actual time=1.176..1.176 rows=1 loops=1)
  Output: count(*)
  Buffers: shared read=1
  I/O Timings: shared read=1.144
  ->  Seq Scan on public.orders o  (cost=0.00..18.62 rows=3 width=0)
        Filter: (o.customer_id = 1)
        Rows Removed by Filter: 3
Query Identifier: -2397021968848467034

LOG:  duration: 1507.592 ms  plan:
Query Text: SELECT COUNT(*), pg_sleep(1.5) FROM orders o WHERE o.customer_id = p_customer_id
Aggregate  (cost=18.63..18.65 rows=1 width=12) (actual time=1507.585..1507.586 rows=1 loops=1)
  Output: count(*), pg_sleep('1.5'::double precision)
  Buffers: shared hit=1
  ->  Seq Scan on public.orders o  (cost=0.00..18.62 rows=3 width=0)
        Filter: (o.customer_id = 1)
        Rows Removed by Filter: 3
Query Identifier: -5717035855382371184

LOG:  duration: 0.025 ms  plan:
Query Text: SELECT o.customer_id, COUNT(*)::BIGINT, ...
GroupAggregate  (cost=0.00..20.37 rows=1 width=108) (actual time=0.022..0.022 rows=1 loops=1)
  Output: customer_id, count(*), sum((amount * 1.1)), avg((amount * 1.1)), 'completed'::character varying
  Buffers: shared hit=1
  ->  Seq Scan on public.orders o  (cost=0.00..20.35 rows=1 width=20)
        Filter: ((o.customer_id = 1) AND ((o.status)::text = 'completed'::text))
        Rows Removed by Filter: 3
Query Identifier: 4827049925723211578

LOG:  duration: 1589.046 ms  plan:
Query Text: SELECT * FROM get_customer_summary(1, 'completed');
Function Scan on public.get_customer_summary  (cost=0.25..10.25 rows=1000 width=108)
  Output: customer_id, total_orders, total_amount, avg_amount, status_filter
  Function Call: get_customer_summary(1, 'completed'::character varying)
  Buffers: shared hit=470 read=60
Query Identifier: 2175165029487630477

 customer_id | total_orders | total_amount |     avg_amount      | status_filter 
-------------+--------------+--------------+---------------------+---------------
           1 |            2 |      193.325 | 96.6625000000000000 | completed

Time: 1839.588 ms (00:01.840)


Blue-Green Deployment Replication Lag Troubleshooting Guide

 This guide provides SQL scripts and procedures to diagnose and resolve replication lag issues in Aurora PostgreSQL Blue-Green deployments using logical replication.

Tested on: Aurora PostgreSQL 17.9
All SQL queries in this guide have been verified to work on Aurora PostgreSQL.


Quick Diagnosis Checklist

  1. Check replication lag on BLUE cluster (source)
  2. Check apply rate on GREEN cluster (target)
  3. Identify bottleneck type: CPU, I/O, Network, or Schema issues
  4. Apply appropriate fix

1. Check Replication Lag (Run on BLUE Cluster)

1.1 View Current Replication Lag

-- Check replication slot lag
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT 
    slot_name,
    active,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag_size,
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) as lag_bytes
FROM pg_replication_slots
ORDER BY lag_bytes DESC;

Example Output:

slot_name                                      | active | lag_size | lag_bytes
-----------------------------------------------+--------+----------+-----------
rds_us_east_1_7jc5pd3g7gitygeslb2ikyrzta_16412 | t      | 268 GB   | 288073302904
rds_us_east_1_7jc5pd3g7gitygeslb2ikyrzta_5     | t      | 735 MB   | 770287240

What to look for:

  • Lag > 1 GB indicates a problem
  • Lag > 100 GB indicates a severe bottleneck
  • Inactive slots mean replication is broken

1.2 Check WAL Sender Status

-- Check WAL sender processes and their lag
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT 
    pid,
    application_name,
    client_addr,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) as send_lag,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn)) as write_lag,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) as flush_lag,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) as replay_lag
FROM pg_stat_replication;

Example Output:

pid  | application_name                           | state     | replay_lag
-----+--------------------------------------------+-----------+-----------
1938 | rds_us_east_1_shbjw735qwo5dyejg5l2bt4e64_16412 | streaming | 257 GB
1865 | rds_us_east_1_shbjw735qwo5dyejg5l2bt4e64_16384 | streaming | 872 bytes

What to look for:

  • state should be 'streaming'
  • Large replay_lag means GREEN is slow to apply changes
  • Large send_lag means network or GREEN receiver is slow

1.3 Check WAL Sender Wait Events

-- Check what WAL senders are waiting on
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT 
    pid,
    application_name,
    state,
    wait_event_type,
    wait_event,
    backend_type,
    EXTRACT(EPOCH FROM (now() - state_change)) as duration_seconds
FROM pg_stat_activity
WHERE backend_type = 'walsender'
ORDER BY duration_seconds DESC;

Example Output:

pid  | state  | wait_event_type | wait_event           | duration_seconds
-----+--------+-----------------+----------------------+-----------------
1865 | active | Client          | WalSenderWaitForWal  | 109290.19
1938 | active | Client          | WalSenderWriteData   | 109289.79

Common wait events:

  • WalSenderWaitForWal: Normal, waiting for new WAL to send
  • WalSenderWriteData: Network bottleneck or GREEN is slow to receive
  • ClientRead: Waiting for GREEN to acknowledge receipt

1.4 Check Write Activity on BLUE

-- Check which tables have the most write activity
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT 
    schemaname,
    relname,
    n_tup_ins as inserts,
    n_tup_upd as updates,
    n_tup_del as deletes,
    n_tup_ins + n_tup_upd + n_tup_del as total_writes,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||relname)) as size
FROM pg_stat_user_tables
WHERE n_tup_ins + n_tup_upd + n_tup_del > 0
ORDER BY total_writes DESC
LIMIT 20;

Example Output:

relname                        | inserts  | updates  | deletes  | total_writes | size
-------------------------------+----------+----------+----------+--------------+------
tbl_group1_insert_delete_01    | 50000000 | 25000000 | 25000000 | 100000000    | 37 GB
tbl_group2_append_only_01      | 75000000 | 0        | 0        | 75000000     | 41 GB

What to look for:

  • Tables with high UPDATE/DELETE counts are most affected by missing primary keys
  • Large tables with many writes will cause the most lag

2. Check Apply Rate (Run on GREEN Cluster)

2.1 Check Logical Replication Workers

-- Check replication worker status
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT 
    pid,
    application_name,
    state,
    wait_event_type,
    wait_event,
    backend_type,
    state_change
FROM pg_stat_activity
WHERE backend_type LIKE '%replication%'
ORDER BY pid;

Example Output:

pid   | state  | wait_event_type | wait_event         | backend_type
------+--------+-----------------+--------------------+------------------------------
768   | NULL   | Activity        | LogicalLauncherMain| logical replication launcher
29139 | idle   | Activity        | LogicalApplyMain   | logical replication apply worker
4900  | active | IO              | DataFileRead       | logical replication apply worker

What to look for:

  • state = 'active': Worker is processing changes
  • state = 'idle': Worker is waiting for more data (normal if caught up)
  • wait_event = 'LogicalApplyMain': Idle, waiting for data
  • wait_event = 'DataFileRead': I/O bottleneck
  • wait_event = 'DataFileWrite': I/O bottleneck

2.2 Measure Apply Rate

Option 1: Single SQL with 60-second measurement

-- Measure apply rate over 60 seconds in a single query
-- ✓ Verified on Aurora PostgreSQL 17.9
DO $$
DECLARE
    lsn1 pg_lsn;
    lsn2 pg_lsn;
    time1 timestamptz;
    time2 timestamptz;
    bytes_diff bigint;
    duration numeric;
    mb_per_sec numeric;
BEGIN
    -- Snapshot 1
    lsn1 := pg_current_wal_lsn();
    time1 := clock_timestamp();
    
    -- Sleep 60 seconds
    PERFORM pg_sleep(60);
    
    -- Snapshot 2
    lsn2 := pg_current_wal_lsn();
    time2 := clock_timestamp();
    
    -- Calculate
    bytes_diff := pg_wal_lsn_diff(lsn2, lsn1);
    duration := EXTRACT(EPOCH FROM (time2 - time1));
    
    IF duration > 0 THEN
        mb_per_sec := bytes_diff / duration / (1024*1024);
    ELSE
        mb_per_sec := 0;
    END IF;
    
    -- Display results
    RAISE NOTICE 'Start LSN: %, End LSN: %', lsn1, lsn2;
    RAISE NOTICE 'Start Time: %, End Time: %', time1, time2;
    RAISE NOTICE 'Bytes Applied: % (%)', bytes_diff, pg_size_pretty(bytes_diff);
    RAISE NOTICE 'Duration: % seconds', round(duration, 3);
    RAISE NOTICE 'Apply Rate: % MB/sec', round(mb_per_sec, 2);
    
    IF bytes_diff = 0 THEN
        RAISE NOTICE 'WARNING: No WAL progress detected - replication may be idle or caught up';
    END IF;
END $$;

Example Output:

NOTICE:  Start LSN: 6B/2DE6BDE8, End LSN: 6B/2FFF5C00
NOTICE:  Start Time: 2026-04-09 14:17:30+00, End Time: 2026-04-09 14:18:30+00
NOTICE:  Bytes Applied: 35167768 (34 MB)
NOTICE:  Duration: 60.002 seconds
NOTICE:  Apply Rate: 0.56 MB/sec

Option 2: Manual two-step measurement

-- Take snapshot 1
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT pg_current_wal_lsn() as lsn1, now() as time1;

-- Wait 60 seconds (use \watch 60 in psql or sleep in script)

-- Take snapshot 2
SELECT pg_current_wal_lsn() as lsn2, now() as time2;

-- Calculate progress (replace LSN values from snapshots above)
SELECT 
    pg_wal_lsn_diff('68/BA670598', '68/B9E387F0') as bytes_applied,
    pg_size_pretty(pg_wal_lsn_diff('68/BA670598', '68/B9E387F0')) as size_applied,
    pg_wal_lsn_diff('68/BA670598', '68/B9E387F0') / 60.0 / (1024*1024) as mb_per_sec;

Example Output:

bytes_applied | size_applied | mb_per_sec
--------------+--------------+-----------
8617384       | 8217 kB      | 0.14

What to look for:

  • Apply rate < 1 MB/sec: Severe bottleneck
  • Apply rate 1-5 MB/sec: Moderate bottleneck
  • Apply rate > 10 MB/sec: Good performance

2.3 Check Subscription Status

-- Check subscription configuration
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT 
    subname,
    subenabled,
    subslotname,
    subpublications
FROM pg_subscription;

Example Output:

subname                                        | subenabled | subslotname
-----------------------------------------------+------------+---------------------------------------------
rds_us_east_1_shbjw735qwo5dyejg5l2bt4e64_16384 | t          | rds_us_east_1_7jc5pd3g7gitygeslb2ikyrzta_16384
rds_us_east_1_shbjw735qwo5dyejg5l2bt4e64_5     | t          | rds_us_east_1_7jc5pd3g7gitygeslb2ikyrzta_5
rds_us_east_1_shbjw735qwo5dyejg5l2bt4e64_16412 | t          | rds_us_east_1_7jc5pd3g7gitygeslb2ikyrzta_16412

What to look for:

  • subenabled = true: Subscription is active
  • If false, replication is paused

3. Identify Root Cause

3.1 Check for Missing Primary Keys (Run on GREEN)

-- Find tables without primary keys
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT 
    n.nspname as schema,
    c.relname as table,
    pg_size_pretty(pg_total_relation_size(c.oid)) as size,
    c.relreplident,
    CASE c.relreplident
        WHEN 'd' THEN 'DEFAULT (primary key)'
        WHEN 'n' THEN 'NOTHING'
        WHEN 'f' THEN 'FULL (all columns)'
        WHEN 'i' THEN 'INDEX'
    END as replica_identity_type
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
LEFT JOIN pg_constraint con ON con.conrelid = c.oid AND con.contype = 'p'
WHERE c.relkind = 'r'
  AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
  AND con.conname IS NULL
ORDER BY pg_total_relation_size(c.oid) DESC;

Example Output:

schema | table                          | size  | relreplident | replica_identity_type
-------+--------------------------------+-------+--------------+----------------------
public | tbl_group1_insert_delete_01    | 37 GB | d            | DEFAULT (primary key)
public | tbl_group1_insert_delete_02    | 37 GB | d            | DEFAULT (primary key)

Critical Issue:

  • Tables without primary keys cause FULL TABLE SCANS on every UPDATE/DELETE
  • Large tables (> 1 GB) without PKs will severely impact replication performance

3.2 Check for Sequential Scans (Run on GREEN)

-- Find tables with excessive sequential scans
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT 
    schemaname,
    relname,
    seq_scan,
    seq_tup_read,
    idx_scan,
    n_live_tup,
    CASE 
        WHEN seq_scan > 0 THEN seq_tup_read / seq_scan 
        ELSE 0 
    END as avg_tuples_per_scan,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||relname)) as size
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 20;

Example Output:

relname                     | seq_scan | seq_tup_read | avg_tuples_per_scan | size
----------------------------+----------+--------------+---------------------+------
tbl_group1_insert_delete_01 | 1000000  | 50000000000  | 50000               | 37 GB

What to look for:

  • High seq_tup_read values indicate full table scans
  • avg_tuples_per_scan > 10,000 on large tables is problematic

3.3 Check for Blocking/Locks (Run on GREEN)

-- Find blocking queries
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT 
    blocked.pid AS blocked_pid,
    blocked.usename AS blocked_user,
    blocking.pid AS blocking_pid,
    blocking.usename AS blocking_user,
    blocked.wait_event_type,
    blocked.wait_event,
    left(blocked.query, 100) AS blocked_query,
    left(blocking.query, 100) AS blocking_query
FROM pg_stat_activity AS blocked
JOIN pg_stat_activity AS blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.pid != blocking.pid;

What to look for:

  • Replication workers blocked by long-running queries
  • Lock contention on frequently updated tables
  • Empty result means no blocking (good)

3.4 Check Table Structure (Run on BOTH clusters)

-- Verify table has primary key
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT 
    conname AS constraint_name,
    array_agg(a.attname ORDER BY array_position(con.conkey, a.attnum)) AS pk_columns
FROM pg_constraint con
JOIN pg_attribute a ON a.attrelid = con.conrelid AND a.attnum = ANY(con.conkey)
WHERE con.conrelid = 'public.tbl_group1_insert_delete_01'::regclass
  AND con.contype = 'p'
GROUP BY conname;

-- Check replica identity setting
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT 
    c.relname,
    c.relreplident,
    CASE c.relreplident
        WHEN 'd' THEN 'DEFAULT (primary key)'
        WHEN 'n' THEN 'NOTHING'
        WHEN 'f' THEN 'FULL (all columns)'
        WHEN 'i' THEN 'INDEX'
    END as replica_identity
FROM pg_class c
WHERE c.oid = 'public.tbl_group1_insert_delete_01'::regclass;

Example Output:

constraint_name                    | pk_columns
-----------------------------------+-----------
tbl_group1_insert_delete_01_pkey   | {id}

relname                        | relreplident | replica_identity
-------------------------------+--------------+------------------
tbl_group1_insert_delete_01    | d            | DEFAULT (primary key)

4. Common Issues and Solutions

Issue 1: Missing Primary Keys

Symptoms:

  • Replication lag growing continuously
  • GREEN workers waiting on DataFileRead
  • High sequential scan counts on GREEN
  • Apply rate < 1 MB/sec

Diagnosis:

-- Run on GREEN
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT count(*) as tables_without_pk
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
LEFT JOIN pg_constraint con ON con.conrelid = c.oid AND con.contype = 'p'
WHERE c.relkind = 'r'
  AND n.nspname = 'public'
  AND con.conname IS NULL;

Solution:

-- Run on GREEN first (to fix immediate bottleneck)
-- ✓ Verified on Aurora PostgreSQL 17.9
SET default_transaction_read_only = off;
ALTER TABLE tbl_group1_insert_delete_01 ADD PRIMARY KEY (id);

-- Then run on BLUE (to maintain schema consistency)
-- ✓ Verified on Aurora PostgreSQL 17.9
ALTER TABLE tbl_group1_insert_delete_01 ADD PRIMARY KEY (id);

Important Notes:

  • Logical replication does NOT replicate DDL statements
  • You must add primary keys on BOTH clusters manually
  • Add to GREEN first to improve apply performance immediately
  • Add to BLUE to prevent schema drift

Issue 2: I/O Bottleneck

Symptoms:

  • Workers waiting on DataFileRead or DataFileWrite
  • Apply rate improved after adding PKs but still slow
  • CloudWatch shows high disk I/O

Diagnosis:

-- Run on GREEN - check I/O wait events
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT 
    wait_event_type,
    wait_event,
    count(*) as count
FROM pg_stat_activity
WHERE wait_event_type = 'IO'
GROUP BY wait_event_type, wait_event
ORDER BY count DESC;

Example Output:

wait_event_type | wait_event      | count
----------------+-----------------+------
IO              | DataFileRead    | 3
IO              | AuroraStorageLogAllocate | 1

Solutions:

  1. Increase provisioned IOPS on GREEN cluster
  2. Scale up to larger instance class with better I/O
  3. Reduce write workload on BLUE temporarily
  4. Enable Aurora I/O-Optimized configuration

Issue 3: Network Bottleneck

Symptoms:

  • WAL senders on BLUE waiting on WalSenderWriteData
  • Large send_lag but small replay_lag
  • Network throughput at maximum

Diagnosis:

-- Run on BLUE
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT 
    application_name,
    client_addr,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) as send_lag,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) as replay_lag
FROM pg_stat_replication;

Solutions:

  1. Scale up instance class for better network bandwidth
  2. Reduce write workload on BLUE
  3. Check for network issues between AZs

Issue 4: Long-Running Transactions

Symptoms:

  • Replication workers blocked
  • Locks visible in pg_stat_activity
  • Lag increases during specific operations

Diagnosis:

-- Run on GREEN - find long-running transactions
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT 
    pid,
    usename,
    state,
    wait_event,
    now() - xact_start as transaction_duration,
    left(query, 100) as query
FROM pg_stat_activity
WHERE state != 'idle'
  AND xact_start IS NOT NULL
ORDER BY xact_start
LIMIT 10;

Solutions:

  1. Terminate blocking queries: SELECT pg_terminate_backend(pid);
  2. Optimize application queries to be shorter
  3. Use connection pooling with transaction timeouts

5. Monitoring Script

Complete Monitoring Query (Run on BLUE)

-- Comprehensive replication monitoring
-- ✓ Verified on Aurora PostgreSQL 17.9
WITH slot_lag AS (
    SELECT 
        slot_name,
        pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) as lag_bytes
    FROM pg_replication_slots
),
sender_stats AS (
    SELECT 
        application_name,
        state,
        sync_state,
        pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as replay_lag_bytes
    FROM pg_stat_replication
)
SELECT 
    sl.slot_name,
    pg_size_pretty(sl.lag_bytes) as slot_lag,
    ss.state as sender_state,
    ss.sync_state,
    pg_size_pretty(ss.replay_lag_bytes) as replay_lag,
    CASE 
        WHEN sl.lag_bytes > 107374182400 THEN 'CRITICAL (>100GB)'
        WHEN sl.lag_bytes > 10737418240 THEN 'WARNING (>10GB)'
        WHEN sl.lag_bytes > 1073741824 THEN 'ATTENTION (>1GB)'
        ELSE 'OK'
    END as status
FROM slot_lag sl
LEFT JOIN sender_stats ss ON ss.application_name LIKE '%' || split_part(sl.slot_name, '_', 6) || '%'
ORDER BY sl.lag_bytes DESC;

Example Output:

slot_name                                      | slot_lag | sender_state | status
-----------------------------------------------+----------+--------------+------------------
rds_us_east_1_7jc5pd3g7gitygeslb2ikyrzta_16412 | 268 GB   | streaming    | CRITICAL (>100GB)
rds_us_east_1_7jc5pd3g7gitygeslb2ikyrzta_5     | 735 MB   | streaming    | OK

Complete Monitoring Query (Run on GREEN)

-- Comprehensive apply worker monitoring
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT 
    pid,
    backend_type,
    state,
    wait_event_type,
    wait_event,
    CASE 
        WHEN state = 'active' AND wait_event IS NULL THEN 'PROCESSING'
        WHEN state = 'active' AND wait_event_type = 'IO' THEN 'IO_BOTTLENECK'
        WHEN state = 'idle' AND wait_event = 'LogicalApplyMain' THEN 'WAITING_FOR_DATA'
        ELSE 'CHECK_REQUIRED'
    END as worker_status,
    state_change,
    now() - state_change as time_in_state
FROM pg_stat_activity
WHERE backend_type LIKE '%replication%'
ORDER BY backend_type, pid;

Example Output:

pid   | backend_type                         | state  | wait_event      | worker_status
------+--------------------------------------+--------+-----------------+------------------
768   | logical replication launcher         | NULL   | LogicalLauncherMain | CHECK_REQUIRED
29139 | logical replication apply worker     | idle   | LogicalApplyMain    | WAITING_FOR_DATA
4900  | logical replication apply worker     | active | DataFileRead        | IO_BOTTLENECK

6. Performance Tuning Parameters

On GREEN Cluster

-- Note: Aurora PostgreSQL manages most replication parameters automatically
-- These are for reference only - test before applying in production

-- Increase work memory for apply operations (session level)
-- ✓ Verified on Aurora PostgreSQL 17.9
SET logical_decoding_work_mem = '256MB';

-- Check current settings
SHOW max_logical_replication_workers;
SHOW max_parallel_apply_workers_per_subscription;

Important: Aurora PostgreSQL automatically manages most replication parameters. Contact AWS Support before modifying system-level parameters.


7. Emergency Procedures

Pause Replication (if lag is unmanageable)

-- Run on GREEN
-- Note: For Aurora Blue-Green deployments, subscriptions are managed by AWS
ALTER SUBSCRIPTION subscription_name DISABLE;

Resume Replication

-- Run on GREEN
ALTER SUBSCRIPTION subscription_name ENABLE;

Check Subscription Names

-- Run on GREEN to get actual subscription names
-- ✓ Verified on Aurora PostgreSQL 17.9
SELECT subname FROM pg_subscription;

8. Best Practices

  1. Always ensure tables have primary keys before creating Blue-Green deployments
  2. Monitor replication lag continuously during the deployment
  3. Test schema changes on a non-production Blue-Green deployment first
  4. Reduce write workload during initial replication catch-up
  5. Size GREEN cluster appropriately - same or larger than BLUE
  6. Add primary keys to BOTH clusters to maintain schema consistency
  7. Use CloudWatch metrics to monitor I/O, CPU, and network utilization
  8. Set up alerts for replication lag > 1 GB