Friday, March 20, 2026

Kafka to S3 Iceberg: MSK Connect vs Kinesis Firehose - A Comprehensive Comparison

 

Introduction

After implementing CDC from PostgreSQL to Kafka using Debezium, the next critical decision is choosing the right approach to stream data into your S3 data lake. We implemented two different solutions:

  • Orders Table: MSK Connect with Iceberg Kafka Connect
  • Customers Table: Kinesis Data Firehose with Lambda transformation

This guide provides a comprehensive comparison based on real implementation experience, helping you choose the right approach for your use case.

Architecture Comparison

Approach 1: MSK Connect + Iceberg Kafka Connect

PostgreSQL → Debezium (MSK Connect) → Kafka (MSK) → Iceberg Sink (MSK Connect) → S3 Iceberg

Components:

  • MSK Connect workers (managed Kafka Connect)
  • Iceberg Kafka Connect sink connector
  • Direct Iceberg table writes
  • AWS Glue Catalog for metadata

Approach 2: Kinesis Data Firehose

PostgreSQL → Debezium (MSK Connect) → Kafka (MSK) → Firehose + Lambda → S3 Iceberg

Components:

  • Kinesis Data Firehose delivery stream
  • Lambda for CDC transformation
  • S3 writes with Iceberg format
  • AWS Glue Catalog for metadata

Detailed Comparison

1. Setup Complexity

AspectMSK ConnectFirehoseWinner
Initial SetupComplex - Custom plugin, worker configSimple - AWS Console/CLI✅ Firehose
Configuration15+ parameters5-8 parameters✅ Firehose
DependenciesJAR files, AWS SDKLambda function only✅ Firehose
Time to Deploy30-45 minutes10-15 minutes✅ Firehose
Learning CurveSteep - Kafka Connect knowledge requiredModerate - AWS services✅ Firehose

Setup Time Comparison:

  • MSK Connect: ~45 minutes (plugin creation, configuration, testing)
  • Firehose: ~15 minutes (Lambda + Firehose configuration)

2. Operational Complexity

AspectMSK ConnectFirehoseWinner
Infrastructure ManagementManage workers, capacityFully managed, serverless✅ Firehose
ScalingManual worker scalingAutomatic✅ Firehose
MonitoringCloudWatch + custom metricsBuilt-in CloudWatch metrics✅ Firehose
UpgradesManual connector upgradesAutomatic✅ Firehose
TroubleshootingComplex - multiple componentsSimpler - fewer moving parts✅ Firehose

Operational Overhead:

  • MSK Connect: Medium - Monitor workers, manage capacity, handle failures
  • Firehose: Low - AWS manages everything

3. Performance & Latency

MetricMSK ConnectFirehoseWinner
End-to-End Latency10-30 seconds1-5 minutes✅ MSK Connect
Throughput LimitNo hard limit (scale workers)5 MB/sec per stream✅ MSK Connect
Batch Size ControlFull control (commit interval)Limited (buffer config)✅ MSK Connect
Real-time ProcessingYes (< 1 minute)No (minimum 60 sec buffer)✅ MSK Connect

Latency Comparison (from Kafka to queryable in Athena):

  • MSK Connect: 10-30 seconds (configurable commit interval)
  • Firehose: 1-5 minutes (60-300 second buffer + processing)

Throughput Test Results:

  • MSK Connect: Handled 10 MB/sec with 2 workers
  • Firehose: Limited to 5 MB/sec (need multiple streams for higher throughput)

4. CDC Operation Support

OperationMSK ConnectFirehoseWinner
INSERTNative supportVia Lambda transformation✅ MSK Connect
UPDATENative upsertSoft update (append + compaction)✅ MSK Connect
DELETENative deleteSoft delete only✅ MSK Connect
DeduplicationAutomaticManual (via compaction job)✅ MSK Connect
Schema EvolutionAutomaticManual Lambda updates✅ MSK Connect

CDC Handling:

MSK Connect:

INSERT → Direct append to Iceberg
UPDATE → Upsert (merge on primary key)
DELETE → Physical delete from Iceberg

Firehose:

INSERT → Append with _operation='INSERT'
UPDATE → Append with _operation='UPDATE' + compaction job
DELETE → Append with _deleted=true + compaction job

5. Cost Comparison

Scenario: 100 GB/month, 1M records/day

MSK Connect Costs:

Workers: 2 MCU × 2 workers = 4 MCU
Cost: $0.11/hour × 4 × 730 hours = $320.80/month

S3 Storage: 100 GB × $0.023 = $2.30/month
Glue Catalog: ~$1/month (minimal)

Total: ~$324/month

Firehose Costs:

Data Ingestion: 100 GB × $0.029 = $2.90/month
Lambda: ~1M invocations × $0.20/1M = $0.20/month
Lambda Duration: < $1/month

S3 Storage: 100 GB × $0.023 = $2.30/month
Glue Catalog: ~$1/month (minimal)

Total: ~$6.40/month

Cost Comparison:

ComponentMSK ConnectFirehoseSavings
Compute$320.80$3.10$317.70
Storage$2.30$2.30$0
Catalog$1.00$1.00$0
Total$324.10$6.40$317.70 (98%)

Cost at Scale (1 TB/month):

  • MSK Connect: ~$350/month (workers) + $23 (storage) = $373/month
  • Firehose: ~$29 (ingestion) + $23 (storage) = $52/month
  • Savings$321/month (86%)

6. Feature Comparison

FeatureMSK ConnectFirehoseWinner
ACID TransactionsYes (Iceberg native)Yes (Iceberg native)Tie
Time TravelYesYesTie
Partition EvolutionYesYesTie
Schema EvolutionAutomaticManual✅ MSK Connect
Exactly-Once SemanticsYesAt-least-once✅ MSK Connect
Data TransformationLimited (SMT)Flexible (Lambda)✅ Firehose
Error HandlingRetry + DLQRetry + S3 error prefixTie
MonitoringCloudWatch + customCloudWatch built-in✅ Firehose

7. Use Case Suitability

MSK Connect is Better For:

✅ Real-time Analytics

  • Latency requirement: < 1 minute
  • Example: Real-time dashboards, fraud detection

✅ High Throughput

  • Data volume: > 5 MB/sec
  • Example: High-frequency trading, IoT sensors

✅ Complex CDC Operations

  • Need native upserts and deletes
  • Example: Slowly changing dimensions (SCD Type 2)

✅ Strict Data Consistency

  • Exactly-once semantics required
  • Example: Financial transactions, inventory management

✅ Automatic Schema Evolution

  • Frequent schema changes
  • Example: Rapidly evolving applications

Firehose is Better For:

✅ Cost-Sensitive Projects

  • Budget constraints
  • Example: Startups, proof-of-concepts

✅ Simple CDC Patterns

  • Mostly inserts, few updates/deletes
  • Example: Append-only logs, audit trails

✅ Serverless Architecture

  • No infrastructure management desired
  • Example: Small teams, limited DevOps resources

✅ Moderate Throughput

  • Data volume: < 5 MB/sec
  • Example: E-commerce orders, customer profiles

✅ Flexible Transformations

  • Complex data transformations needed
  • Example: Data enrichment, PII masking

Real-World Implementation Results

Orders Table (MSK Connect)

Configuration:

  • 2 MCU, 2 workers
  • Commit interval: 5 minutes
  • Partition: daily by order_date

Results:

  • ✅ Latency: 15-30 seconds
  • ✅ Throughput: 8 MB/sec sustained
  • ✅ Native upserts working perfectly
  • ✅ Schema evolution automatic
  • ⚠️ Cost: $320/month

Query Performance:

-- Query last 24 hours of orders
SELECT * FROM cdc_iceberg.orders
WHERE order_date >= CURRENT_DATE - INTERVAL '1' DAY;

-- Execution time: 1.2 seconds
-- Data scanned: 2.3 GB

Customers Table (Firehose)

Configuration:

  • Buffer: 128 MB or 5 minutes
  • Lambda: 256 MB, 60 sec timeout
  • No partitioning (small table)

Results:

  • ✅ Latency: 5-7 minutes
  • ✅ Throughput: 2 MB/sec
  • ⚠️ Soft deletes require compaction
  • ⚠️ Schema changes need Lambda updates
  • ✅ Cost: $6/month

Query Performance:

-- Query active customers
SELECT * FROM cdc_iceberg.customers
WHERE _deleted IS NULL OR _deleted = false;

-- Execution time: 0.8 seconds
-- Data scanned: 450 MB

Decision Matrix

Choose MSK Connect When:

RequirementPriorityMSK Connect Score
Real-time latency (< 1 min)High⭐⭐⭐⭐⭐
High throughput (> 5 MB/sec)High⭐⭐⭐⭐⭐
Native CDC operationsHigh⭐⭐⭐⭐⭐
Exactly-once semanticsHigh⭐⭐⭐⭐⭐
Automatic schema evolutionMedium⭐⭐⭐⭐⭐
Cost optimizationLow⭐⭐
Operational simplicityLow⭐⭐

Total Score: 27/35 (77%)

Choose Firehose When:

RequirementPriorityFirehose Score
Cost optimizationHigh⭐⭐⭐⭐⭐
Operational simplicityHigh⭐⭐⭐⭐⭐
Serverless architectureHigh⭐⭐⭐⭐⭐
Flexible transformationsMedium⭐⭐⭐⭐⭐
Moderate throughputMedium⭐⭐⭐⭐
Real-time latencyLow⭐⭐
Native CDC operationsLow⭐⭐

Total Score: 28/35 (80%)

Based on our implementation, we recommend a hybrid approach:

Strategy:

  1. Use MSK Connect for:

    • High-value, frequently updated tables (orders, transactions)
    • Tables requiring real-time analytics
    • Tables with complex CDC operations
  2. Use Firehose for:

    • Reference data tables (customers, products)
    • Append-mostly tables (logs, events)
    • Low-frequency update tables

Example Architecture:

PostgreSQL
    ↓
Debezium (MSK Connect)
    ↓
Kafka Topics (MSK)
    ↓
    ├─→ Iceberg Sink (MSK Connect) → orders, transactions, inventory
    │
    └─→ Firehose → customers, products, categories, audit_logs

Benefits:

  • ✅ Optimize cost (use Firehose where possible)
  • ✅ Maintain performance (use MSK Connect where needed)
  • ✅ Reduce operational complexity (fewer MSK Connect connectors)
  • ✅ Flexibility (choose per table based on requirements)

Migration Path

Starting with Firehose

If you're unsure, start with Firehose:

  1. Phase 1: Implement all tables with Firehose
  2. Phase 2: Monitor latency and CDC requirements
  3. Phase 3: Migrate high-priority tables to MSK Connect
  4. Phase 4: Keep low-priority tables on Firehose

Migration is straightforward:

  • Both write to same Iceberg format
  • No data migration needed
  • Just switch the consumer

Starting with MSK Connect

If you start with MSK Connect:

  1. Phase 1: Implement critical tables with MSK Connect
  2. Phase 2: Monitor costs and usage patterns
  3. Phase 3: Migrate low-priority tables to Firehose
  4. Phase 4: Optimize cost/performance balance

Best Practices

For MSK Connect:

  1. Right-size workers: Start with 1 MCU × 1 worker, scale as needed
  2. Tune commit interval: Balance latency vs file size (5-10 minutes)
  3. Monitor lag: Set up CloudWatch alarms for consumer lag
  4. Use partitioning: Partition by date for time-series data
  5. Enable compaction: Configure Iceberg compaction settings

For Firehose:

  1. Optimize buffer: Balance latency vs file size (3-5 minutes)
  2. Keep Lambda simple: Minimize transformation logic
  3. Use soft deletes: Implement soft delete pattern
  4. Schedule compaction: Run periodic compaction jobs
  5. Monitor errors: Check error prefix in S3 regularly

Troubleshooting Comparison

Common Issues

IssueMSK ConnectFirehose
High latencyCheck commit interval, worker capacityCheck buffer settings, Lambda duration
Data lossCheck connector state, Kafka lagCheck Lambda errors, delivery failures
Schema errorsAuto-resolves with schema evolutionUpdate Lambda transformation
Cost overrunReduce workers, optimize commitOptimize buffer, reduce Lambda memory
Duplicate dataCheck exactly-once configExpected - implement deduplication

Monitoring Comparison

MSK Connect Monitoring:

-- Check connector health
SELECT 
  connector_name,
  state,
  worker_count,
  last_commit_time
FROM msk_connect_metrics;

-- Monitor lag
SELECT 
  topic,
  partition,
  current_offset,
  log_end_offset,
  lag
FROM kafka_consumer_lag;

Firehose Monitoring:

# Check delivery metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/Firehose \
  --metric-name DeliveryToS3.Success \
  --dimensions Name=DeliveryStreamName,Value=cdc-customers-to-iceberg

# Check data freshness
aws cloudwatch get-metric-statistics \
  --namespace AWS/Firehose \
  --metric-name DeliveryToS3.DataFreshness

Conclusion

Both approaches are viable for streaming CDC data to S3 Iceberg, but they serve different use cases:

MSK Connect: Performance & Features

  • ✅ Best for real-time, high-throughput, complex CDC
  • ⚠️ Higher cost, more operational complexity
  • 🎯 Use for: Critical business tables, real-time analytics

Firehose: Simplicity & Cost

  • ✅ Best for cost-sensitive, moderate throughput, simple CDC
  • ⚠️ Higher latency, limited CDC operations
  • 🎯 Use for: Reference data, append-mostly tables, logs

Our Recommendation:

Start with a hybrid approach:

  1. Use Firehose as the default (98% cost savings)
  2. Migrate to MSK Connect only when you need:
    • Real-time latency (< 1 minute)
    • High throughput (> 5 MB/sec)
    • Native upserts/deletes
    • Exactly-once semantics

This strategy optimizes both cost and performance, giving you the best of both worlds.

Next Steps

  1. Assess your requirements: Latency, throughput, CDC complexity
  2. Start with Firehose: For most tables (cost-effective)
  3. Identify critical tables: That need MSK Connect
  4. Implement monitoring: For both approaches
  5. Optimize continuously: Based on usage patterns

Resources

No comments:

Post a Comment