Introduction
A critical challenge in Change Data Capture (CDC) systems is replication lag that occurs when replicated tables receive minimal or no writes while non-replicated tables are heavily updated. This scenario causes PostgreSQL's Write-Ahead Log (WAL) to accumulate, leading to disk space issues and potential data loss. This guide demonstrates how to implement Debezium's heartbeat mechanism to keep replication slots advancing and prevent unbounded lag growth.
The Problem: Replication Lag with Idle Tables
Understanding the Issue
When using Debezium for PostgreSQL CDC:
- Debezium only processes changes to tables in
table.include.list - Replication slot doesn't advance when there are no changes to replicated tables
- WAL accumulates from writes to non-replicated tables
- Disk space fills up as PostgreSQL retains WAL files
Real-World Scenario
Consider this common setup:
- Replicated tables:
orders,customers(business-critical, low write volume) - Non-replicated tables:
audit_logs,metrics,sessions(high write volume)
Result: The replication slot stays at the same LSN (Log Sequence Number) while WAL grows from audit log writes.
Impact Demonstrated
In our testing with a 4x heavy workload (800,000 inserts per batch to non-replicated tables):
Without Heartbeat (35 minutes):
- Initial lag: 16 GB
- Final lag: 23 GB
- Growth rate: ~200 MB/minute
- Replication slot LSN: Frozen at
9/ED2BFA78
With Heartbeat (after implementation):
- Lag dropped to: 344 bytes
- Replication slot: Actively advancing
- WAL cleanup: Resumed normally
Solution Overview
Debezium provides two heartbeat approaches:
- Heartbeat Table: Creates a dedicated table that Debezium updates periodically
- WAL Logical Messages: Uses PostgreSQL's
pg_logical_emit_message()to write directly to WAL
Both approaches ensure the replication slot advances regularly, even when replicated tables are idle.
Approach 1: Heartbeat Table (Traditional)
When to Use
- You want visibility into heartbeat activity via SQL queries
- You need an audit trail of heartbeat events
- You're comfortable managing an additional table
Step 1: Create Heartbeat Table
-- Create heartbeat table
CREATE TABLE IF NOT EXISTS public.debezium_heartbeat (
id SERIAL PRIMARY KEY,
ts TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()
);
-- Grant privileges to Debezium user
GRANT INSERT, UPDATE ON public.debezium_heartbeat TO postgres;
GRANT USAGE, SELECT ON SEQUENCE debezium_heartbeat_id_seq TO postgres;
Step 2: Add Table to Publication
-- Add heartbeat table to existing publication
ALTER PUBLICATION debezium_publication ADD TABLE public.debezium_heartbeat;
-- Verify
SELECT schemaname, tablename
FROM pg_publication_tables
WHERE pubname = 'debezium_publication';
Expected output:
schemaname | tablename
------------+---------------------
public | orders
public | customers
public | debezium_heartbeat
Step 3: Create Kafka Topic
If your MSK cluster has auto.create.topics.enable=false, manually create the heartbeat topic:
# Get bootstrap servers
BOOTSTRAP_SERVERS=$(aws kafka get-bootstrap-brokers \
--cluster-arn <your-msk-cluster-arn> \
--region us-east-1 \
--query 'BootstrapBrokerString' \
--output text)
# Create heartbeat topic
kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVERS \
--create \
--topic cdc.public.debezium_heartbeat \
--partitions 1 \
--replication-factor 2
Step 4: Update Connector Configuration
Update your connector to include heartbeat settings:
{
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "your-rds-endpoint.rds.amazonaws.com",
"database.port": "5432",
"database.user": "postgres",
"database.password": "your_secure_password",
"database.dbname": "your_database",
"database.server.name": "rds_pg",
"database.sslmode": "require",
"table.include.list": "public.orders,public.customers,public.debezium_heartbeat",
"plugin.name": "pgoutput",
"slot.name": "debezium_slot",
"publication.name": "debezium_publication",
"topic.prefix": "cdc",
"tasks.max": "1",
"snapshot.mode": "initial",
"heartbeat.interval.ms": "30000",
"heartbeat.action.query": "INSERT INTO public.debezium_heartbeat (ts) VALUES (now()) ON CONFLICT (id) DO UPDATE SET ts = now()"
}
Key Parameters:
| Parameter | Value | Description |
|---|---|---|
heartbeat.interval.ms | 30000 | Heartbeat fires every 30 seconds |
heartbeat.action.query | SQL INSERT/UPDATE | Writes to heartbeat table |
table.include.list | Must include heartbeat table | Ensures heartbeat changes are captured |
Step 5: Apply Configuration
# Get current connector version
CURRENT_VERSION=$(aws kafkaconnect describe-connector \
--connector-arn <connector-arn> \
--region us-east-1 \
--query 'currentVersion' \
--output text)
# Update connector
aws kafkaconnect update-connector \
--connector-arn <connector-arn> \
--current-version "$CURRENT_VERSION" \
--connector-configuration file://connector-config-with-heartbeat.json \
--region us-east-1
Wait for connector to restart (typically 1-2 minutes).
Step 6: Verify Heartbeat Activity
Check heartbeat table:
SELECT COUNT(*) as total_heartbeats,
MAX(ts) as last_heartbeat,
NOW() - MAX(ts) as age
FROM public.debezium_heartbeat;
Expected output:
total_heartbeats | last_heartbeat | age
------------------+-------------------------------+----------------
150 | 2026-03-20 08:45:30.123456+00 | 00:00:15.234
Check Kafka messages:
kafka-console-consumer.sh \
--bootstrap-server $BOOTSTRAP_SERVERS \
--topic cdc.public.debezium_heartbeat \
--from-beginning \
--max-messages 3
Expected message structure:
{
"before": null,
"after": {
"id": 150,
"ts": "2026-03-20T08:45:30.123456Z"
},
"source": {
"version": "3.2.6.Final",
"connector": "postgresql",
"name": "cdc",
"ts_ms": 1773991530123,
"snapshot": "false",
"db": "testdb",
"schema": "public",
"table": "debezium_heartbeat",
"txId": 14261,
"lsn": 67071821280
},
"op": "c",
"ts_ms": 1773992743390
}
Verify LSN advancement:
SELECT slot_name,
confirmed_flush_lsn,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag
FROM pg_replication_slots
WHERE slot_name = 'debezium_slot';
The confirmed_flush_lsn should advance every ~30 seconds.
Approach 2: WAL Logical Messages (Recommended)
When to Use
- You want a cleaner implementation without additional tables
- You don't need to query heartbeat history
- You want minimal overhead
Advantages
✅ No table creation required
✅ No publication management needed
✅ Minimal database overhead
✅ Cleaner architecture
Step 1: Create Kafka Topic
Create the topic for logical messages:
kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVERS \
--create \
--topic cdc.message \
--partitions 1 \
--replication-factor 2
Step 2: Update Connector Configuration
{
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "your-rds-endpoint.rds.amazonaws.com",
"database.port": "5432",
"database.user": "postgres",
"database.password": "your_secure_password",
"database.dbname": "your_database",
"database.server.name": "rds_pg",
"database.sslmode": "require",
"table.include.list": "public.orders,public.customers",
"plugin.name": "pgoutput",
"slot.name": "debezium_slot",
"publication.name": "debezium_publication",
"topic.prefix": "cdc",
"tasks.max": "1",
"snapshot.mode": "initial",
"heartbeat.interval.ms": "30000",
"heartbeat.action.query": "SELECT pg_logical_emit_message(false, 'heartbeat', now()::varchar);"
}
Key Differences:
- ❌ No heartbeat table in
table.include.list - ✅ Uses
pg_logical_emit_message()function - ✅ Writes directly to WAL without table overhead
Step 3: Apply Configuration
aws kafkaconnect update-connector \
--connector-arn <connector-arn> \
--current-version "$CURRENT_VERSION" \
--connector-configuration file://connector-config-wal-message.json \
--region us-east-1
Step 4: Verify Heartbeat Messages
Check Kafka messages:
kafka-console-consumer.sh \
--bootstrap-server $BOOTSTRAP_SERVERS \
--topic cdc.message \
--from-beginning \
--max-messages 3
Expected message structure:
{
"op": "m",
"ts_ms": 1773993777771,
"source": {
"version": "3.2.6.Final",
"connector": "postgresql",
"name": "cdc",
"ts_ms": 1773993732522,
"snapshot": "false",
"db": "testdb",
"sequence": "[\"68249716520\",\"68249716520\"]",
"schema": "",
"table": "",
"txId": null,
"lsn": 68249716520
},
"transaction": null,
"message": {
"prefix": "heartbeat",
"content": "MjAyNi0wMy0yMCAwODowMjo1Ny4yNTk4ODUrMDA="
}
}
Key indicators:
"op": "m"- Message operation (not create/update/delete)"prefix": "heartbeat"- Message identifier"lsn"- Advancing LSN valuecontent- Base64-encoded timestamp
Comparison: Table vs WAL Messages
| Aspect | Heartbeat Table | WAL Logical Messages |
|---|---|---|
| Setup Complexity | Medium (table + publication) | Low (just configuration) |
| Database Overhead | Table writes + indexes | WAL writes only |
| Visibility | Query via SQL | Kafka messages only |
| Audit Trail | Persistent in table | Transient in WAL |
| Maintenance | Table cleanup needed | No maintenance |
| Kafka Topics | cdc.public.debezium_heartbeat | cdc.message |
| Recommended For | Audit requirements | Production systems |
Test Results: Effectiveness Demonstrated
Test Setup
- Workload: 800,000 inserts/batch (400K audit_logs + 400K metrics)
- Replicated tables: orders, customers (zero writes)
- Test duration: 1 hour
Phase 1: Without Heartbeat (35 minutes)
Time: 15:00 - 15:35
Initial lag: 16 GB
Final lag: 23 GB
Growth: 7 GB in 35 minutes (~200 MB/min)
Replication slot LSN: Frozen at 9/ED2BFA78
Phase 2: With Heartbeat Table (25 minutes)
Time: 15:38 - 16:01
Configuration: heartbeat.interval.ms=30000
Result: Lag dropped from 23 GB to 344 bytes
LSN: Advancing every 30 seconds
Heartbeat messages: Flowing to Kafka
Phase 3: With WAL Messages (Continuous)
Time: 16:02 onwards
Configuration: pg_logical_emit_message()
Workload: Continuous 4x heavy
Result: Lag maintained at 137-191 MB (vs 23+ GB without)
LSN: Advancing regularly
Key Findings
✅ Heartbeat prevents unbounded lag growth
✅ Both approaches equally effective
✅ WAL messages cleaner for production
✅ 30-second interval optimal for most workloads
AWS DMS Alternative
If you're using AWS DMS instead of Debezium, use the built-in heartbeat feature:
Endpoint Settings:
| Parameter | Value | Description |
|---|---|---|
heartbeatEnable | true | Enables heartbeat feature |
heartbeatSchema | public | Schema for heartbeat table |
heartbeatFrequency | 5 | Update frequency in minutes |
Extra Connection Attributes:
heartbeatEnable=true;heartbeatSchema=public;heartbeatFrequency=5
DMS automatically creates and manages the heartbeat table.
Troubleshooting
Issue 1: Heartbeat Not Firing
Symptoms: No heartbeat messages in Kafka, lag still growing
Diagnosis:
aws logs tail /aws/mskconnect/rds-postgres-debezium \
--since 10m | grep -i heartbeat
Common Causes:
- Heartbeat table not in
table.include.list - Heartbeat table not in publication
- Kafka topic doesn't exist
Solution:
-- Verify table in publication
SELECT * FROM pg_publication_tables WHERE pubname = 'debezium_publication';
-- Add if missing
ALTER PUBLICATION debezium_publication ADD TABLE public.debezium_heartbeat;
Issue 2: Kafka Topic Error
Symptoms: Logs show UNKNOWN_TOPIC_OR_PARTITION
Solution: Manually create the Kafka topic:
kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVERS \
--create --topic cdc.public.debezium_heartbeat \
--partitions 1 --replication-factor 2
Issue 3: Heartbeat Firing But Lag Still Growing
Symptoms: Heartbeat messages in Kafka but replication lag increasing
Diagnosis:
-- Check if slot is advancing
SELECT slot_name, confirmed_flush_lsn
FROM pg_replication_slots
WHERE slot_name = 'debezium_slot';
-- Wait 1 minute, check again - LSN should have changed
Solution: Check connector state and restart if needed
Best Practices
1. Choose the Right Interval
- High-volume systems: 30 seconds (default)
- Low-volume systems: 60-120 seconds
- Critical systems: 15-30 seconds
2. Monitor Continuously
Set up CloudWatch alarms for:
- Replication lag > 100 MB
- Heartbeat age > 2 minutes
- Connector state changes
3. Use WAL Messages for Production
Unless you need audit trails, prefer pg_logical_emit_message() for:
- Cleaner architecture
- Lower overhead
- Easier maintenance
4. Test Before Production
Always test heartbeat configuration in non-production:
- Simulate idle replicated tables
- Generate load on non-replicated tables
- Verify lag stays minimal
- Monitor for 24 hours
Conclusion
Implementing Debezium heartbeat is essential for production CDC systems where replicated tables may be idle while other database activity continues. Our testing demonstrated dramatic lag reduction (23 GB → 344 bytes) with heartbeat enabled. Choose the WAL logical messages approach for cleaner implementation, or use the heartbeat table if you need audit trails.
Next Steps
- Implement Monitoring: Set up comprehensive monitoring and alerting (Blog Post #4)
- Optimize Performance: Tune heartbeat interval based on your workload
- Plan for Failures: Document recovery procedures
No comments:
Post a Comment