Friday, March 20, 2026

Preventing PostgreSQL Replication Lag with Debezium Heartbeat

 

Introduction

A critical challenge in Change Data Capture (CDC) systems is replication lag that occurs when replicated tables receive minimal or no writes while non-replicated tables are heavily updated. This scenario causes PostgreSQL's Write-Ahead Log (WAL) to accumulate, leading to disk space issues and potential data loss. This guide demonstrates how to implement Debezium's heartbeat mechanism to keep replication slots advancing and prevent unbounded lag growth.

The Problem: Replication Lag with Idle Tables

Understanding the Issue

When using Debezium for PostgreSQL CDC:

  1. Debezium only processes changes to tables in table.include.list
  2. Replication slot doesn't advance when there are no changes to replicated tables
  3. WAL accumulates from writes to non-replicated tables
  4. Disk space fills up as PostgreSQL retains WAL files

Real-World Scenario

Consider this common setup:

  • Replicated tablesorderscustomers (business-critical, low write volume)
  • Non-replicated tablesaudit_logsmetricssessions (high write volume)

Result: The replication slot stays at the same LSN (Log Sequence Number) while WAL grows from audit log writes.

Impact Demonstrated

In our testing with a 4x heavy workload (800,000 inserts per batch to non-replicated tables):

Without Heartbeat (35 minutes):

  • Initial lag: 16 GB
  • Final lag: 23 GB
  • Growth rate: ~200 MB/minute
  • Replication slot LSN: Frozen at 9/ED2BFA78

With Heartbeat (after implementation):

  • Lag dropped to: 344 bytes
  • Replication slot: Actively advancing
  • WAL cleanup: Resumed normally

Solution Overview

Debezium provides two heartbeat approaches:

  1. Heartbeat Table: Creates a dedicated table that Debezium updates periodically
  2. WAL Logical Messages: Uses PostgreSQL's pg_logical_emit_message() to write directly to WAL

Both approaches ensure the replication slot advances regularly, even when replicated tables are idle.

Approach 1: Heartbeat Table (Traditional)

When to Use

  • You want visibility into heartbeat activity via SQL queries
  • You need an audit trail of heartbeat events
  • You're comfortable managing an additional table

Step 1: Create Heartbeat Table

-- Create heartbeat table
CREATE TABLE IF NOT EXISTS public.debezium_heartbeat (
    id SERIAL PRIMARY KEY,
    ts TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()
);

-- Grant privileges to Debezium user
GRANT INSERT, UPDATE ON public.debezium_heartbeat TO postgres;
GRANT USAGE, SELECT ON SEQUENCE debezium_heartbeat_id_seq TO postgres;

Step 2: Add Table to Publication

-- Add heartbeat table to existing publication
ALTER PUBLICATION debezium_publication ADD TABLE public.debezium_heartbeat;

-- Verify
SELECT schemaname, tablename 
FROM pg_publication_tables 
WHERE pubname = 'debezium_publication';

Expected output:

 schemaname |      tablename       
------------+---------------------
 public     | orders
 public     | customers
 public     | debezium_heartbeat

Step 3: Create Kafka Topic

If your MSK cluster has auto.create.topics.enable=false, manually create the heartbeat topic:

# Get bootstrap servers
BOOTSTRAP_SERVERS=$(aws kafka get-bootstrap-brokers \
  --cluster-arn <your-msk-cluster-arn> \
  --region us-east-1 \
  --query 'BootstrapBrokerString' \
  --output text)

# Create heartbeat topic
kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVERS \
  --create \
  --topic cdc.public.debezium_heartbeat \
  --partitions 1 \
  --replication-factor 2

Step 4: Update Connector Configuration

Update your connector to include heartbeat settings:

{
  "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
  "database.hostname": "your-rds-endpoint.rds.amazonaws.com",
  "database.port": "5432",
  "database.user": "postgres",
  "database.password": "your_secure_password",
  "database.dbname": "your_database",
  "database.server.name": "rds_pg",
  "database.sslmode": "require",
  "table.include.list": "public.orders,public.customers,public.debezium_heartbeat",
  "plugin.name": "pgoutput",
  "slot.name": "debezium_slot",
  "publication.name": "debezium_publication",
  "topic.prefix": "cdc",
  "tasks.max": "1",
  "snapshot.mode": "initial",
  "heartbeat.interval.ms": "30000",
  "heartbeat.action.query": "INSERT INTO public.debezium_heartbeat (ts) VALUES (now()) ON CONFLICT (id) DO UPDATE SET ts = now()"
}

Key Parameters:

ParameterValueDescription
heartbeat.interval.ms30000Heartbeat fires every 30 seconds
heartbeat.action.querySQL INSERT/UPDATEWrites to heartbeat table
table.include.listMust include heartbeat tableEnsures heartbeat changes are captured

Step 5: Apply Configuration

# Get current connector version
CURRENT_VERSION=$(aws kafkaconnect describe-connector \
  --connector-arn <connector-arn> \
  --region us-east-1 \
  --query 'currentVersion' \
  --output text)

# Update connector
aws kafkaconnect update-connector \
  --connector-arn <connector-arn> \
  --current-version "$CURRENT_VERSION" \
  --connector-configuration file://connector-config-with-heartbeat.json \
  --region us-east-1

Wait for connector to restart (typically 1-2 minutes).

Step 6: Verify Heartbeat Activity

Check heartbeat table:

SELECT COUNT(*) as total_heartbeats, 
       MAX(ts) as last_heartbeat, 
       NOW() - MAX(ts) as age
FROM public.debezium_heartbeat;

Expected output:

 total_heartbeats |        last_heartbeat         |      age       
------------------+-------------------------------+----------------
              150 | 2026-03-20 08:45:30.123456+00 | 00:00:15.234

Check Kafka messages:

kafka-console-consumer.sh \
  --bootstrap-server $BOOTSTRAP_SERVERS \
  --topic cdc.public.debezium_heartbeat \
  --from-beginning \
  --max-messages 3

Expected message structure:

{
  "before": null,
  "after": {
    "id": 150,
    "ts": "2026-03-20T08:45:30.123456Z"
  },
  "source": {
    "version": "3.2.6.Final",
    "connector": "postgresql",
    "name": "cdc",
    "ts_ms": 1773991530123,
    "snapshot": "false",
    "db": "testdb",
    "schema": "public",
    "table": "debezium_heartbeat",
    "txId": 14261,
    "lsn": 67071821280
  },
  "op": "c",
  "ts_ms": 1773992743390
}

Verify LSN advancement:

SELECT slot_name,
       confirmed_flush_lsn,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag
FROM pg_replication_slots
WHERE slot_name = 'debezium_slot';

The confirmed_flush_lsn should advance every ~30 seconds.

When to Use

  • You want a cleaner implementation without additional tables
  • You don't need to query heartbeat history
  • You want minimal overhead

Advantages

✅ No table creation required
✅ No publication management needed
✅ Minimal database overhead
✅ Cleaner architecture

Step 1: Create Kafka Topic

Create the topic for logical messages:

kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVERS \
  --create \
  --topic cdc.message \
  --partitions 1 \
  --replication-factor 2

Step 2: Update Connector Configuration

{
  "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
  "database.hostname": "your-rds-endpoint.rds.amazonaws.com",
  "database.port": "5432",
  "database.user": "postgres",
  "database.password": "your_secure_password",
  "database.dbname": "your_database",
  "database.server.name": "rds_pg",
  "database.sslmode": "require",
  "table.include.list": "public.orders,public.customers",
  "plugin.name": "pgoutput",
  "slot.name": "debezium_slot",
  "publication.name": "debezium_publication",
  "topic.prefix": "cdc",
  "tasks.max": "1",
  "snapshot.mode": "initial",
  "heartbeat.interval.ms": "30000",
  "heartbeat.action.query": "SELECT pg_logical_emit_message(false, 'heartbeat', now()::varchar);"
}

Key Differences:

  • ❌ No heartbeat table in table.include.list
  • ✅ Uses pg_logical_emit_message() function
  • ✅ Writes directly to WAL without table overhead

Step 3: Apply Configuration

aws kafkaconnect update-connector \
  --connector-arn <connector-arn> \
  --current-version "$CURRENT_VERSION" \
  --connector-configuration file://connector-config-wal-message.json \
  --region us-east-1

Step 4: Verify Heartbeat Messages

Check Kafka messages:

kafka-console-consumer.sh \
  --bootstrap-server $BOOTSTRAP_SERVERS \
  --topic cdc.message \
  --from-beginning \
  --max-messages 3

Expected message structure:

{
  "op": "m",
  "ts_ms": 1773993777771,
  "source": {
    "version": "3.2.6.Final",
    "connector": "postgresql",
    "name": "cdc",
    "ts_ms": 1773993732522,
    "snapshot": "false",
    "db": "testdb",
    "sequence": "[\"68249716520\",\"68249716520\"]",
    "schema": "",
    "table": "",
    "txId": null,
    "lsn": 68249716520
  },
  "transaction": null,
  "message": {
    "prefix": "heartbeat",
    "content": "MjAyNi0wMy0yMCAwODowMjo1Ny4yNTk4ODUrMDA="
  }
}

Key indicators:

  • "op": "m" - Message operation (not create/update/delete)
  • "prefix": "heartbeat" - Message identifier
  • "lsn" - Advancing LSN value
  • content - Base64-encoded timestamp

Comparison: Table vs WAL Messages

AspectHeartbeat TableWAL Logical Messages
Setup ComplexityMedium (table + publication)Low (just configuration)
Database OverheadTable writes + indexesWAL writes only
VisibilityQuery via SQLKafka messages only
Audit TrailPersistent in tableTransient in WAL
MaintenanceTable cleanup neededNo maintenance
Kafka Topicscdc.public.debezium_heartbeatcdc.message
Recommended ForAudit requirementsProduction systems

Test Results: Effectiveness Demonstrated

Test Setup

  • Workload: 800,000 inserts/batch (400K audit_logs + 400K metrics)
  • Replicated tables: orders, customers (zero writes)
  • Test duration: 1 hour

Phase 1: Without Heartbeat (35 minutes)

Time: 15:00 - 15:35
Initial lag: 16 GB
Final lag: 23 GB
Growth: 7 GB in 35 minutes (~200 MB/min)
Replication slot LSN: Frozen at 9/ED2BFA78

Phase 2: With Heartbeat Table (25 minutes)

Time: 15:38 - 16:01
Configuration: heartbeat.interval.ms=30000
Result: Lag dropped from 23 GB to 344 bytes
LSN: Advancing every 30 seconds
Heartbeat messages: Flowing to Kafka

Phase 3: With WAL Messages (Continuous)

Time: 16:02 onwards
Configuration: pg_logical_emit_message()
Workload: Continuous 4x heavy
Result: Lag maintained at 137-191 MB (vs 23+ GB without)
LSN: Advancing regularly

Key Findings

✅ Heartbeat prevents unbounded lag growth
✅ Both approaches equally effective
✅ WAL messages cleaner for production
✅ 30-second interval optimal for most workloads

AWS DMS Alternative

If you're using AWS DMS instead of Debezium, use the built-in heartbeat feature:

Endpoint Settings:

ParameterValueDescription
heartbeatEnabletrueEnables heartbeat feature
heartbeatSchemapublicSchema for heartbeat table
heartbeatFrequency5Update frequency in minutes

Extra Connection Attributes:

heartbeatEnable=true;heartbeatSchema=public;heartbeatFrequency=5

DMS automatically creates and manages the heartbeat table.

Troubleshooting

Issue 1: Heartbeat Not Firing

Symptoms: No heartbeat messages in Kafka, lag still growing

Diagnosis:

aws logs tail /aws/mskconnect/rds-postgres-debezium \
  --since 10m | grep -i heartbeat

Common Causes:

  1. Heartbeat table not in table.include.list
  2. Heartbeat table not in publication
  3. Kafka topic doesn't exist

Solution:

-- Verify table in publication
SELECT * FROM pg_publication_tables WHERE pubname = 'debezium_publication';

-- Add if missing
ALTER PUBLICATION debezium_publication ADD TABLE public.debezium_heartbeat;

Issue 2: Kafka Topic Error

Symptoms: Logs show UNKNOWN_TOPIC_OR_PARTITION

Solution: Manually create the Kafka topic:

kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVERS \
  --create --topic cdc.public.debezium_heartbeat \
  --partitions 1 --replication-factor 2

Issue 3: Heartbeat Firing But Lag Still Growing

Symptoms: Heartbeat messages in Kafka but replication lag increasing

Diagnosis:

-- Check if slot is advancing
SELECT slot_name, confirmed_flush_lsn 
FROM pg_replication_slots 
WHERE slot_name = 'debezium_slot';

-- Wait 1 minute, check again - LSN should have changed

Solution: Check connector state and restart if needed

Best Practices

1. Choose the Right Interval

  • High-volume systems: 30 seconds (default)
  • Low-volume systems: 60-120 seconds
  • Critical systems: 15-30 seconds

2. Monitor Continuously

Set up CloudWatch alarms for:

  • Replication lag > 100 MB
  • Heartbeat age > 2 minutes
  • Connector state changes

3. Use WAL Messages for Production

Unless you need audit trails, prefer pg_logical_emit_message() for:

  • Cleaner architecture
  • Lower overhead
  • Easier maintenance

4. Test Before Production

Always test heartbeat configuration in non-production:

  1. Simulate idle replicated tables
  2. Generate load on non-replicated tables
  3. Verify lag stays minimal
  4. Monitor for 24 hours

Conclusion

Implementing Debezium heartbeat is essential for production CDC systems where replicated tables may be idle while other database activity continues. Our testing demonstrated dramatic lag reduction (23 GB → 344 bytes) with heartbeat enabled. Choose the WAL logical messages approach for cleaner implementation, or use the heartbeat table if you need audit trails.

Next Steps

  • Implement Monitoring: Set up comprehensive monitoring and alerting (Blog Post #4)
  • Optimize Performance: Tune heartbeat interval based on your workload
  • Plan for Failures: Document recovery procedures

Resources

No comments:

Post a Comment