Preventing PostgreSQL Replication Lag with Debezium Heartbeat

Introduction

A critical challenge in Change Data Capture (CDC) systems is replication lag that occurs when replicated tables receive minimal or no writes while non-replicated tables are heavily updated. This scenario causes PostgreSQL's Write-Ahead Log (WAL) to accumulate, leading to disk space issues and potential data loss. This guide demonstrates how to implement Debezium's heartbeat mechanism to keep replication slots advancing and prevent unbounded lag growth.

The Problem: Replication Lag with Idle Tables

Understanding the Issue

When using Debezium for PostgreSQL CDC:

Debezium only processes changes to tables in table.include.list
Replication slot doesn't advance when there are no changes to replicated tables
WAL accumulates from writes to non-replicated tables
Disk space fills up as PostgreSQL retains WAL files

Real-World Scenario

Consider this common setup:

Replicated tables: orders, customers (business-critical, low write volume)
Non-replicated tables: audit_logs, metrics, sessions (high write volume)

Result: The replication slot stays at the same LSN (Log Sequence Number) while WAL grows from audit log writes.

Impact Demonstrated

In our testing with a 4x heavy workload (800,000 inserts per batch to non-replicated tables):

Without Heartbeat (35 minutes):

Initial lag: 16 GB
Final lag: 23 GB
Growth rate: ~200 MB/minute
Replication slot LSN: Frozen at 9/ED2BFA78

With Heartbeat (after implementation):

Lag dropped to: 344 bytes
Replication slot: Actively advancing
WAL cleanup: Resumed normally

Solution Overview

Debezium provides two heartbeat approaches:

Heartbeat Table: Creates a dedicated table that Debezium updates periodically
WAL Logical Messages: Uses PostgreSQL's pg_logical_emit_message() to write directly to WAL

Both approaches ensure the replication slot advances regularly, even when replicated tables are idle.

Approach 1: Heartbeat Table (Traditional)

When to Use

You want visibility into heartbeat activity via SQL queries
You need an audit trail of heartbeat events
You're comfortable managing an additional table

Step 1: Create Heartbeat Table

-- Create heartbeat table
CREATE TABLE IF NOT EXISTS public.debezium_heartbeat (
    id SERIAL PRIMARY KEY,
    ts TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()
);

-- Grant privileges to Debezium user
GRANT INSERT, UPDATE ON public.debezium_heartbeat TO postgres;
GRANT USAGE, SELECT ON SEQUENCE debezium_heartbeat_id_seq TO postgres;

Step 2: Add Table to Publication

-- Add heartbeat table to existing publication
ALTER PUBLICATION debezium_publication ADD TABLE public.debezium_heartbeat;

-- Verify
SELECT schemaname, tablename 
FROM pg_publication_tables 
WHERE pubname = 'debezium_publication';

Expected output:

 schemaname |      tablename       
------------+---------------------
 public     | orders
 public     | customers
 public     | debezium_heartbeat

Step 3: Create Kafka Topic

If your MSK cluster has auto.create.topics.enable=false, manually create the heartbeat topic:

# Get bootstrap servers
BOOTSTRAP_SERVERS=$(aws kafka get-bootstrap-brokers \
  --cluster-arn <your-msk-cluster-arn> \
  --region us-east-1 \
  --query 'BootstrapBrokerString' \
  --output text)

# Create heartbeat topic
kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVERS \
  --create \
  --topic cdc.public.debezium_heartbeat \
  --partitions 1 \
  --replication-factor 2

Step 4: Update Connector Configuration

Update your connector to include heartbeat settings:

{
  "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
  "database.hostname": "your-rds-endpoint.rds.amazonaws.com",
  "database.port": "5432",
  "database.user": "postgres",
  "database.password": "your_secure_password",
  "database.dbname": "your_database",
  "database.server.name": "rds_pg",
  "database.sslmode": "require",
  "table.include.list": "public.orders,public.customers,public.debezium_heartbeat",
  "plugin.name": "pgoutput",
  "slot.name": "debezium_slot",
  "publication.name": "debezium_publication",
  "topic.prefix": "cdc",
  "tasks.max": "1",
  "snapshot.mode": "initial",
  "heartbeat.interval.ms": "30000",
  "heartbeat.action.query": "INSERT INTO public.debezium_heartbeat (ts) VALUES (now()) ON CONFLICT (id) DO UPDATE SET ts = now()"
}

Key Parameters:

Parameter	Value	Description
`heartbeat.interval.ms`	`30000`	Heartbeat fires every 30 seconds
`heartbeat.action.query`	SQL INSERT/UPDATE	Writes to heartbeat table
`table.include.list`	Must include heartbeat table	Ensures heartbeat changes are captured

Step 5: Apply Configuration

# Get current connector version
CURRENT_VERSION=$(aws kafkaconnect describe-connector \
  --connector-arn <connector-arn> \
  --region us-east-1 \
  --query 'currentVersion' \
  --output text)

# Update connector
aws kafkaconnect update-connector \
  --connector-arn <connector-arn> \
  --current-version "$CURRENT_VERSION" \
  --connector-configuration file://connector-config-with-heartbeat.json \
  --region us-east-1

Wait for connector to restart (typically 1-2 minutes).

Step 6: Verify Heartbeat Activity

Check heartbeat table:

SELECT COUNT(*) as total_heartbeats, 
       MAX(ts) as last_heartbeat, 
       NOW() - MAX(ts) as age
FROM public.debezium_heartbeat;

Expected output:

 total_heartbeats |        last_heartbeat         |      age       
------------------+-------------------------------+----------------
              150 | 2026-03-20 08:45:30.123456+00 | 00:00:15.234

Check Kafka messages:

kafka-console-consumer.sh \
  --bootstrap-server $BOOTSTRAP_SERVERS \
  --topic cdc.public.debezium_heartbeat \
  --from-beginning \
  --max-messages 3

Expected message structure:

{
  "before": null,
  "after": {
    "id": 150,
    "ts": "2026-03-20T08:45:30.123456Z"
  },
  "source": {
    "version": "3.2.6.Final",
    "connector": "postgresql",
    "name": "cdc",
    "ts_ms": 1773991530123,
    "snapshot": "false",
    "db": "testdb",
    "schema": "public",
    "table": "debezium_heartbeat",
    "txId": 14261,
    "lsn": 67071821280
  },
  "op": "c",
  "ts_ms": 1773992743390
}

Verify LSN advancement:

SELECT slot_name,
       confirmed_flush_lsn,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag
FROM pg_replication_slots
WHERE slot_name = 'debezium_slot';

The confirmed_flush_lsn should advance every ~30 seconds.

Approach 2: WAL Logical Messages (Recommended)

When to Use

You want a cleaner implementation without additional tables
You don't need to query heartbeat history
You want minimal overhead

Advantages

✅ No table creation required
✅ No publication management needed
✅ Minimal database overhead
✅ Cleaner architecture

Step 1: Create Kafka Topic

Create the topic for logical messages:

kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVERS \
  --create \
  --topic cdc.message \
  --partitions 1 \
  --replication-factor 2

Step 2: Update Connector Configuration

{
  "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
  "database.hostname": "your-rds-endpoint.rds.amazonaws.com",
  "database.port": "5432",
  "database.user": "postgres",
  "database.password": "your_secure_password",
  "database.dbname": "your_database",
  "database.server.name": "rds_pg",
  "database.sslmode": "require",
  "table.include.list": "public.orders,public.customers",
  "plugin.name": "pgoutput",
  "slot.name": "debezium_slot",
  "publication.name": "debezium_publication",
  "topic.prefix": "cdc",
  "tasks.max": "1",
  "snapshot.mode": "initial",
  "heartbeat.interval.ms": "30000",
  "heartbeat.action.query": "SELECT pg_logical_emit_message(false, 'heartbeat', now()::varchar);"
}

Key Differences:

❌ No heartbeat table in table.include.list
✅ Uses pg_logical_emit_message() function
✅ Writes directly to WAL without table overhead

Step 3: Apply Configuration

aws kafkaconnect update-connector \
  --connector-arn <connector-arn> \
  --current-version "$CURRENT_VERSION" \
  --connector-configuration file://connector-config-wal-message.json \
  --region us-east-1

Step 4: Verify Heartbeat Messages

Check Kafka messages:

kafka-console-consumer.sh \
  --bootstrap-server $BOOTSTRAP_SERVERS \
  --topic cdc.message \
  --from-beginning \
  --max-messages 3

Expected message structure:

{
  "op": "m",
  "ts_ms": 1773993777771,
  "source": {
    "version": "3.2.6.Final",
    "connector": "postgresql",
    "name": "cdc",
    "ts_ms": 1773993732522,
    "snapshot": "false",
    "db": "testdb",
    "sequence": "[\"68249716520\",\"68249716520\"]",
    "schema": "",
    "table": "",
    "txId": null,
    "lsn": 68249716520
  },
  "transaction": null,
  "message": {
    "prefix": "heartbeat",
    "content": "MjAyNi0wMy0yMCAwODowMjo1Ny4yNTk4ODUrMDA="
  }
}

Key indicators:

"op": "m" - Message operation (not create/update/delete)
"prefix": "heartbeat" - Message identifier
"lsn" - Advancing LSN value
content - Base64-encoded timestamp

Comparison: Table vs WAL Messages

Aspect	Heartbeat Table	WAL Logical Messages
Setup Complexity	Medium (table + publication)	Low (just configuration)
Database Overhead	Table writes + indexes	WAL writes only
Visibility	Query via SQL	Kafka messages only
Audit Trail	Persistent in table	Transient in WAL
Maintenance	Table cleanup needed	No maintenance
Kafka Topics	`cdc.public.debezium_heartbeat`	`cdc.message`
Recommended For	Audit requirements	Production systems

Test Results: Effectiveness Demonstrated

Test Setup

Workload: 800,000 inserts/batch (400K audit_logs + 400K metrics)
Replicated tables: orders, customers (zero writes)
Test duration: 1 hour

Phase 1: Without Heartbeat (35 minutes)

Time: 15:00 - 15:35
Initial lag: 16 GB
Final lag: 23 GB
Growth: 7 GB in 35 minutes (~200 MB/min)
Replication slot LSN: Frozen at 9/ED2BFA78

Phase 2: With Heartbeat Table (25 minutes)

Time: 15:38 - 16:01
Configuration: heartbeat.interval.ms=30000
Result: Lag dropped from 23 GB to 344 bytes
LSN: Advancing every 30 seconds
Heartbeat messages: Flowing to Kafka

Phase 3: With WAL Messages (Continuous)

Time: 16:02 onwards
Configuration: pg_logical_emit_message()
Workload: Continuous 4x heavy
Result: Lag maintained at 137-191 MB (vs 23+ GB without)
LSN: Advancing regularly

Key Findings

✅ Heartbeat prevents unbounded lag growth
✅ Both approaches equally effective
✅ WAL messages cleaner for production
✅ 30-second interval optimal for most workloads

AWS DMS Alternative

If you're using AWS DMS instead of Debezium, use the built-in heartbeat feature:

Endpoint Settings:

Parameter	Value	Description
`heartbeatEnable`	`true`	Enables heartbeat feature
`heartbeatSchema`	`public`	Schema for heartbeat table
`heartbeatFrequency`	`5`	Update frequency in minutes

Extra Connection Attributes:

heartbeatEnable=true;heartbeatSchema=public;heartbeatFrequency=5

DMS automatically creates and manages the heartbeat table.

Troubleshooting

Issue 1: Heartbeat Not Firing

Symptoms: No heartbeat messages in Kafka, lag still growing

Diagnosis:

aws logs tail /aws/mskconnect/rds-postgres-debezium \
  --since 10m | grep -i heartbeat

Common Causes:

Heartbeat table not in table.include.list
Heartbeat table not in publication
Kafka topic doesn't exist

Solution:

-- Verify table in publication
SELECT * FROM pg_publication_tables WHERE pubname = 'debezium_publication';

-- Add if missing
ALTER PUBLICATION debezium_publication ADD TABLE public.debezium_heartbeat;

Issue 2: Kafka Topic Error

Symptoms: Logs show UNKNOWN_TOPIC_OR_PARTITION

Solution: Manually create the Kafka topic:

kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVERS \
  --create --topic cdc.public.debezium_heartbeat \
  --partitions 1 --replication-factor 2

Issue 3: Heartbeat Firing But Lag Still Growing

Symptoms: Heartbeat messages in Kafka but replication lag increasing

Diagnosis:

-- Check if slot is advancing
SELECT slot_name, confirmed_flush_lsn 
FROM pg_replication_slots 
WHERE slot_name = 'debezium_slot';

-- Wait 1 minute, check again - LSN should have changed

Solution: Check connector state and restart if needed

Best Practices

1. Choose the Right Interval

High-volume systems: 30 seconds (default)
Low-volume systems: 60-120 seconds
Critical systems: 15-30 seconds

2. Monitor Continuously

Set up CloudWatch alarms for:

Replication lag > 100 MB
Heartbeat age > 2 minutes
Connector state changes

3. Use WAL Messages for Production

Unless you need audit trails, prefer pg_logical_emit_message() for:

Cleaner architecture
Lower overhead
Easier maintenance

4. Test Before Production

Always test heartbeat configuration in non-production:

Simulate idle replicated tables
Generate load on non-replicated tables
Verify lag stays minimal
Monitor for 24 hours

Conclusion

Implementing Debezium heartbeat is essential for production CDC systems where replicated tables may be idle while other database activity continues. Our testing demonstrated dramatic lag reduction (23 GB → 344 bytes) with heartbeat enabled. Choose the WAL logical messages approach for cleaner implementation, or use the heartbeat table if you need audit trails.

Next Steps

Implement Monitoring: Set up comprehensive monitoring and alerting (Blog Post #4)
Optimize Performance: Tune heartbeat interval based on your workload
Plan for Failures: Document recovery procedures

Donghua's Blog - DBAGlobe

Friday, March 20, 2026