Donghua's Blog - DBAGlobe: Kafka to S3 Iceberg: MSK Connect vs Kinesis Firehose

Introduction

After implementing CDC from PostgreSQL to Kafka using Debezium, the next critical decision is choosing the right approach to stream data into your S3 data lake. We implemented two different solutions:

Orders Table: MSK Connect with Iceberg Kafka Connect
Customers Table: Kinesis Data Firehose with Lambda transformation

This guide provides a comprehensive comparison based on real implementation experience, helping you choose the right approach for your use case.

Architecture Comparison

Approach 1: MSK Connect + Iceberg Kafka Connect

PostgreSQL → Debezium (MSK Connect) → Kafka (MSK) → Iceberg Sink (MSK Connect) → S3 Iceberg

Components:

MSK Connect workers (managed Kafka Connect)
Iceberg Kafka Connect sink connector
Direct Iceberg table writes
AWS Glue Catalog for metadata

Approach 2: Kinesis Data Firehose

PostgreSQL → Debezium (MSK Connect) → Kafka (MSK) → Firehose + Lambda → S3 Iceberg

Components:

Kinesis Data Firehose delivery stream
Lambda for CDC transformation
S3 writes with Iceberg format
AWS Glue Catalog for metadata

Detailed Comparison

1. Setup Complexity

Aspect	MSK Connect	Firehose	Winner
Initial Setup	Complex - Custom plugin, worker config	Simple - AWS Console/CLI	✅ Firehose
Configuration	15+ parameters	5-8 parameters	✅ Firehose
Dependencies	JAR files, AWS SDK	Lambda function only	✅ Firehose
Time to Deploy	30-45 minutes	10-15 minutes	✅ Firehose
Learning Curve	Steep - Kafka Connect knowledge required	Moderate - AWS services	✅ Firehose

Setup Time Comparison:

MSK Connect: ~45 minutes (plugin creation, configuration, testing)
Firehose: ~15 minutes (Lambda + Firehose configuration)

2. Operational Complexity

Aspect	MSK Connect	Firehose	Winner
Infrastructure Management	Manage workers, capacity	Fully managed, serverless	✅ Firehose
Scaling	Manual worker scaling	Automatic	✅ Firehose
Monitoring	CloudWatch + custom metrics	Built-in CloudWatch metrics	✅ Firehose
Upgrades	Manual connector upgrades	Automatic	✅ Firehose
Troubleshooting	Complex - multiple components	Simpler - fewer moving parts	✅ Firehose

Operational Overhead:

MSK Connect: Medium - Monitor workers, manage capacity, handle failures
Firehose: Low - AWS manages everything

3. Performance & Latency

Metric	MSK Connect	Firehose	Winner
End-to-End Latency	10-30 seconds	1-5 minutes	✅ MSK Connect
Throughput Limit	No hard limit (scale workers)	5 MB/sec per stream	✅ MSK Connect
Batch Size Control	Full control (commit interval)	Limited (buffer config)	✅ MSK Connect
Real-time Processing	Yes (< 1 minute)	No (minimum 60 sec buffer)	✅ MSK Connect

Latency Comparison (from Kafka to queryable in Athena):

MSK Connect: 10-30 seconds (configurable commit interval)
Firehose: 1-5 minutes (60-300 second buffer + processing)

Throughput Test Results:

MSK Connect: Handled 10 MB/sec with 2 workers
Firehose: Limited to 5 MB/sec (need multiple streams for higher throughput)

4. CDC Operation Support

Operation	MSK Connect	Firehose	Winner
INSERT	Native support	Via Lambda transformation	✅ MSK Connect
UPDATE	Native upsert	Soft update (append + compaction)	✅ MSK Connect
DELETE	Native delete	Soft delete only	✅ MSK Connect
Deduplication	Automatic	Manual (via compaction job)	✅ MSK Connect
Schema Evolution	Automatic	Manual Lambda updates	✅ MSK Connect

CDC Handling:

MSK Connect:

INSERT → Direct append to Iceberg
UPDATE → Upsert (merge on primary key)
DELETE → Physical delete from Iceberg

Firehose:

INSERT → Append with _operation='INSERT'
UPDATE → Append with _operation='UPDATE' + compaction job
DELETE → Append with _deleted=true + compaction job

5. Cost Comparison

Scenario: 100 GB/month, 1M records/day

MSK Connect Costs:

Workers: 2 MCU × 2 workers = 4 MCU
Cost: $0.11/hour × 4 × 730 hours = $320.80/month

S3 Storage: 100 GB × $0.023 = $2.30/month
Glue Catalog: ~$1/month (minimal)

Total: ~$324/month

Firehose Costs:

Data Ingestion: 100 GB × $0.029 = $2.90/month
Lambda: ~1M invocations × $0.20/1M = $0.20/month
Lambda Duration: < $1/month

S3 Storage: 100 GB × $0.023 = $2.30/month
Glue Catalog: ~$1/month (minimal)

Total: ~$6.40/month

Cost Comparison:

Component	MSK Connect	Firehose	Savings
Compute	$320.80	$3.10	$317.70
Storage	$2.30	$2.30	$0
Catalog	$1.00	$1.00	$0
Total	$324.10	$6.40	$317.70 (98%)

Cost at Scale (1 TB/month):

MSK Connect: ~$350/month (workers) + $23 (storage) = $373/month
Firehose: ~$29 (ingestion) + $23 (storage) = $52/month
Savings: $321/month (86%)

6. Feature Comparison

Feature	MSK Connect	Firehose	Winner
ACID Transactions	Yes (Iceberg native)	Yes (Iceberg native)	Tie
Time Travel	Yes	Yes	Tie
Partition Evolution	Yes	Yes	Tie
Schema Evolution	Automatic	Manual	✅ MSK Connect
Exactly-Once Semantics	Yes	At-least-once	✅ MSK Connect
Data Transformation	Limited (SMT)	Flexible (Lambda)	✅ Firehose
Error Handling	Retry + DLQ	Retry + S3 error prefix	Tie
Monitoring	CloudWatch + custom	CloudWatch built-in	✅ Firehose

7. Use Case Suitability

MSK Connect is Better For:

✅ Real-time Analytics

Latency requirement: < 1 minute
Example: Real-time dashboards, fraud detection

✅ High Throughput

Data volume: > 5 MB/sec
Example: High-frequency trading, IoT sensors

✅ Complex CDC Operations

Need native upserts and deletes
Example: Slowly changing dimensions (SCD Type 2)

✅ Strict Data Consistency

Exactly-once semantics required
Example: Financial transactions, inventory management

✅ Automatic Schema Evolution

Frequent schema changes
Example: Rapidly evolving applications

Firehose is Better For:

✅ Cost-Sensitive Projects

Budget constraints
Example: Startups, proof-of-concepts

✅ Simple CDC Patterns

Mostly inserts, few updates/deletes
Example: Append-only logs, audit trails

✅ Serverless Architecture

No infrastructure management desired
Example: Small teams, limited DevOps resources

✅ Moderate Throughput

Data volume: < 5 MB/sec
Example: E-commerce orders, customer profiles

✅ Flexible Transformations

Complex data transformations needed
Example: Data enrichment, PII masking

Real-World Implementation Results

Orders Table (MSK Connect)

Configuration:

2 MCU, 2 workers
Commit interval: 5 minutes
Partition: daily by order_date

Results:

✅ Latency: 15-30 seconds
✅ Throughput: 8 MB/sec sustained
✅ Native upserts working perfectly
✅ Schema evolution automatic
⚠️ Cost: $320/month

Query Performance:

-- Query last 24 hours of orders
SELECT * FROM cdc_iceberg.orders
WHERE order_date >= CURRENT_DATE - INTERVAL '1' DAY;

-- Execution time: 1.2 seconds
-- Data scanned: 2.3 GB

Customers Table (Firehose)

Configuration:

Buffer: 128 MB or 5 minutes
Lambda: 256 MB, 60 sec timeout
No partitioning (small table)

Results:

✅ Latency: 5-7 minutes
✅ Throughput: 2 MB/sec
⚠️ Soft deletes require compaction
⚠️ Schema changes need Lambda updates
✅ Cost: $6/month

Query Performance:

-- Query active customers
SELECT * FROM cdc_iceberg.customers
WHERE _deleted IS NULL OR _deleted = false;

-- Execution time: 0.8 seconds
-- Data scanned: 450 MB

Decision Matrix

Choose MSK Connect When:

Requirement	Priority	MSK Connect Score
Real-time latency (< 1 min)	High	⭐⭐⭐⭐⭐
High throughput (> 5 MB/sec)	High	⭐⭐⭐⭐⭐
Native CDC operations	High	⭐⭐⭐⭐⭐
Exactly-once semantics	High	⭐⭐⭐⭐⭐
Automatic schema evolution	Medium	⭐⭐⭐⭐⭐
Cost optimization	Low	⭐⭐
Operational simplicity	Low	⭐⭐

Total Score: 27/35 (77%)

Choose Firehose When:

Requirement	Priority	Firehose Score
Cost optimization	High	⭐⭐⭐⭐⭐
Operational simplicity	High	⭐⭐⭐⭐⭐
Serverless architecture	High	⭐⭐⭐⭐⭐
Flexible transformations	Medium	⭐⭐⭐⭐⭐
Moderate throughput	Medium	⭐⭐⭐⭐
Real-time latency	Low	⭐⭐
Native CDC operations	Low	⭐⭐

Total Score: 28/35 (80%)

Hybrid Approach (Recommended)

Based on our implementation, we recommend a hybrid approach:

Strategy:

Use MSK Connect for:
- High-value, frequently updated tables (orders, transactions)
- Tables requiring real-time analytics
- Tables with complex CDC operations
Use Firehose for:
- Reference data tables (customers, products)
- Append-mostly tables (logs, events)
- Low-frequency update tables

Example Architecture:

PostgreSQL
    ↓
Debezium (MSK Connect)
    ↓
Kafka Topics (MSK)
    ↓
    ├─→ Iceberg Sink (MSK Connect) → orders, transactions, inventory
    │
    └─→ Firehose → customers, products, categories, audit_logs

Benefits:

✅ Optimize cost (use Firehose where possible)
✅ Maintain performance (use MSK Connect where needed)
✅ Reduce operational complexity (fewer MSK Connect connectors)
✅ Flexibility (choose per table based on requirements)

Migration Path

Starting with Firehose

If you're unsure, start with Firehose:

Phase 1: Implement all tables with Firehose
Phase 2: Monitor latency and CDC requirements
Phase 3: Migrate high-priority tables to MSK Connect
Phase 4: Keep low-priority tables on Firehose

Migration is straightforward:

Both write to same Iceberg format
No data migration needed
Just switch the consumer

Starting with MSK Connect

If you start with MSK Connect:

Phase 1: Implement critical tables with MSK Connect
Phase 2: Monitor costs and usage patterns
Phase 3: Migrate low-priority tables to Firehose
Phase 4: Optimize cost/performance balance

Best Practices

For MSK Connect:

Right-size workers: Start with 1 MCU × 1 worker, scale as needed
Tune commit interval: Balance latency vs file size (5-10 minutes)
Monitor lag: Set up CloudWatch alarms for consumer lag
Use partitioning: Partition by date for time-series data
Enable compaction: Configure Iceberg compaction settings

For Firehose:

Optimize buffer: Balance latency vs file size (3-5 minutes)
Keep Lambda simple: Minimize transformation logic
Use soft deletes: Implement soft delete pattern
Schedule compaction: Run periodic compaction jobs
Monitor errors: Check error prefix in S3 regularly

Troubleshooting Comparison

Common Issues

Issue	MSK Connect	Firehose
High latency	Check commit interval, worker capacity	Check buffer settings, Lambda duration
Data loss	Check connector state, Kafka lag	Check Lambda errors, delivery failures
Schema errors	Auto-resolves with schema evolution	Update Lambda transformation
Cost overrun	Reduce workers, optimize commit	Optimize buffer, reduce Lambda memory
Duplicate data	Check exactly-once config	Expected - implement deduplication

Monitoring Comparison

MSK Connect Monitoring:

-- Check connector health
SELECT 
  connector_name,
  state,
  worker_count,
  last_commit_time
FROM msk_connect_metrics;

-- Monitor lag
SELECT 
  topic,
  partition,
  current_offset,
  log_end_offset,
  lag
FROM kafka_consumer_lag;

Firehose Monitoring:

# Check delivery metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/Firehose \
  --metric-name DeliveryToS3.Success \
  --dimensions Name=DeliveryStreamName,Value=cdc-customers-to-iceberg

# Check data freshness
aws cloudwatch get-metric-statistics \
  --namespace AWS/Firehose \
  --metric-name DeliveryToS3.DataFreshness

Conclusion

Both approaches are viable for streaming CDC data to S3 Iceberg, but they serve different use cases:

MSK Connect: Performance & Features

✅ Best for real-time, high-throughput, complex CDC
⚠️ Higher cost, more operational complexity
🎯 Use for: Critical business tables, real-time analytics

Firehose: Simplicity & Cost

✅ Best for cost-sensitive, moderate throughput, simple CDC
⚠️ Higher latency, limited CDC operations
🎯 Use for: Reference data, append-mostly tables, logs

Our Recommendation:

Start with a hybrid approach:

Use Firehose as the default (98% cost savings)
Migrate to MSK Connect only when you need:
- Real-time latency (< 1 minute)
- High throughput (> 5 MB/sec)
- Native upserts/deletes
- Exactly-once semantics

This strategy optimizes both cost and performance, giving you the best of both worlds.

Next Steps

Assess your requirements: Latency, throughput, CDC complexity
Start with Firehose: For most tables (cost-effective)
Identify critical tables: That need MSK Connect
Implement monitoring: For both approaches
Optimize continuously: Based on usage patterns

Donghua's Blog - DBAGlobe

Friday, March 20, 2026

Kafka to S3 Iceberg: MSK Connect vs Kinesis Firehose - A Comprehensive Comparison