Friday, March 20, 2026

Blog Post Series: Kafka to S3 Iceberg Data Lake

 

Overview

This series provides comprehensive guidance on streaming CDC data from Kafka to S3 Iceberg tables, comparing two approaches: MSK Connect with Iceberg Kafka Connect vs Amazon Kinesis Data Firehose.

Blog Posts Created

Blog Post #5: Building a Real-Time Data Lake with MSK Connect + Iceberg

Link: https://www.dbaglobe.com/2026/03/building-real-time-data-lake-kafka-to.html

Implementation: Orders table using MSK Connect

Topics Covered:

  • Apache Iceberg benefits and features
  • Creating Iceberg Kafka Connect custom plugin
  • AWS Glue Catalog setup
  • IAM permissions configuration
  • Connector configuration and deployment
  • Verification and monitoring
  • Advanced features (time travel, schema evolution, partition evolution)
  • Performance optimization
  • Troubleshooting

Key Results:

  • ✅ Latency: 15-30 seconds
  • ✅ Throughput: 8 MB/sec sustained
  • ✅ Native CDC operations (INSERT/UPDATE/DELETE)
  • ✅ Automatic schema evolution
  • ⚠️ Cost: ~$320/month

Best For:

  • Real-time analytics (< 1 minute latency)
  • High throughput (> 5 MB/sec)
  • Complex CDC operations
  • Exactly-once semantics

Blog Post #6: Streaming CDC Data to S3 Iceberg with Kinesis Firehose

Link: https://www.dbaglobe.com/2026/03/streaming-kafka-cdc-to-s3-iceberg-with.html

Implementation: Customers table using Firehose

Topics Covered:

  • Kinesis Data Firehose benefits
  • Lambda transformation for CDC events
  • IAM permissions setup
  • Firehose delivery stream configuration
  • Soft delete pattern implementation
  • Periodic compaction strategy
  • Monitoring and optimization
  • Cost analysis

Key Results:

  • ✅ Latency: 5-7 minutes
  • ✅ Throughput: 2 MB/sec
  • ✅ Cost: ~$6/month (98% cheaper than MSK Connect)
  • ⚠️ Soft deletes require compaction
  • ⚠️ Schema changes need Lambda updates

Best For:

  • Cost-sensitive projects
  • Serverless architecture
  • Moderate throughput (< 5 MB/sec)
  • Simple CDC patterns (mostly inserts)

Blog Post #7: MSK Connect vs Firehose - Comprehensive Comparison

Link: https://www.dbaglobe.com/2026/03/kafka-to-s3-iceberg-msk-connect-vs.html

Topics Covered:

  • Detailed feature comparison
  • Setup complexity analysis
  • Performance benchmarks
  • Cost comparison (100 GB and 1 TB scenarios)
  • CDC operation support
  • Use case suitability
  • Real-world implementation results
  • Decision matrix
  • Hybrid approach recommendation
  • Migration paths
  • Best practices for both approaches

Key Findings:

Cost Comparison (100 GB/month):

  • MSK Connect: $324/month
  • Firehose: $6/month
  • Savings: 98%

Latency Comparison:

  • MSK Connect: 10-30 seconds
  • Firehose: 1-5 minutes

Recommendation: Hybrid approach

  • Use MSK Connect for critical, high-frequency tables
  • Use Firehose for reference data and append-mostly tables

Complete Architecture

┌─────────────────────┐
│   PostgreSQL        │
│   (RDS/Aurora)      │
└──────────┬──────────┘
           │
           │ Logical Replication
           ▼
┌─────────────────────┐
│   Debezium CDC      │
│   (MSK Connect)     │
└──────────┬──────────┘
           │
           │ CDC Events
           ▼
┌─────────────────────┐
│   Kafka Topics      │
│   (Amazon MSK)      │
└──────────┬──────────┘
           │
           ├─────────────────────┐
           │                     │
           ▼                     ▼
┌──────────────────┐   ┌──────────────────┐
│  Iceberg Sink    │   │  Kinesis         │
│  (MSK Connect)   │   │  Firehose        │
│                  │   │  + Lambda        │
│  • orders        │   │  • customers     │
│  • transactions  │   │  • products      │
└────────┬─────────┘   └────────┬─────────┘
         │                      │
         └──────────┬───────────┘
                    │
                    ▼
         ┌─────────────────────┐
         │  S3 Iceberg Tables  │
         │  (AWS Glue Catalog) │
         └──────────┬──────────┘
                    │
                    ▼
         ┌─────────────────────┐
         │  Query Engines      │
         │  • Athena           │
         │  • Spark            │
         │  • Trino            │
         └─────────────────────┘

Comparison Summary

Setup Complexity

AspectMSK ConnectFirehoseWinner
Setup Time45 minutes15 minutes✅ Firehose
ConfigurationComplexSimple✅ Firehose
Learning CurveSteepModerate✅ Firehose

Performance

MetricMSK ConnectFirehoseWinner
Latency10-30 sec1-5 min✅ MSK Connect
ThroughputUnlimited5 MB/sec✅ MSK Connect
Real-timeYesNo✅ MSK Connect

Cost (100 GB/month)

ComponentMSK ConnectFirehoseSavings
Total$324$698%

CDC Operations

OperationMSK ConnectFirehoseWinner
INSERTNativeVia Lambda✅ MSK Connect
UPDATENative upsertSoft update✅ MSK Connect
DELETENative deleteSoft delete✅ MSK Connect

Decision Framework

Use MSK Connect When:

✅ Real-time latency required (< 1 minute)
✅ High throughput needed (> 5 MB/sec)
✅ Complex CDC operations (native upserts/deletes)
✅ Exactly-once semantics required
✅ Automatic schema evolution needed

Example Tables: orders, transactions, inventory, real-time events

Use Firehose When:

✅ Cost is primary concern
✅ Serverless architecture preferred
✅ Moderate throughput (< 5 MB/sec)
✅ Simple CDC patterns (mostly inserts)
✅ Latency of 1-5 minutes acceptable

Example Tables: customers, products, categories, audit_logs

Strategy: Use both approaches based on table characteristics

Implementation:

  1. Default to Firehose for cost savings (98% cheaper)
  2. Migrate to MSK Connect only when needed:
    • Real-time requirements
    • High throughput
    • Complex CDC operations

Benefits:

  • Optimize cost (Firehose where possible)
  • Maintain performance (MSK Connect where needed)
  • Reduce operational complexity
  • Flexibility per table

Real-World Results

Orders Table (MSK Connect)

Configuration: 2 MCU, 2 workers, 5-min commit
Results:
  - Latency: 15-30 seconds
  - Throughput: 8 MB/sec
  - Native CDC: ✅
  - Cost: $320/month

Customers Table (Firehose)

Configuration: 128 MB buffer, 5-min interval, Lambda 256 MB
Results:
  - Latency: 5-7 minutes
  - Throughput: 2 MB/sec
  - Soft deletes: ⚠️ (requires compaction)
  - Cost: $6/month

Key Takeaways

  1. Cost vs Performance Trade-off:

    • Firehose: 98% cheaper but 10x higher latency
    • MSK Connect: Real-time but 50x more expensive
  2. CDC Operation Support:

    • MSK Connect: Native upserts/deletes
    • Firehose: Soft deletes + compaction jobs
  3. Operational Complexity:

    • Firehose: Fully managed, serverless
    • MSK Connect: Requires worker management
  4. Hybrid Approach Best:

    • Use Firehose as default
    • Use MSK Connect for critical tables
    • Optimize cost/performance balance

Migration Paths

Starting with Firehose:

  1. Implement all tables with Firehose
  2. Monitor latency and CDC requirements
  3. Migrate high-priority tables to MSK Connect
  4. Keep low-priority tables on Firehose

Starting with MSK Connect:

  1. Implement critical tables with MSK Connect
  2. Monitor costs and usage patterns
  3. Migrate low-priority tables to Firehose
  4. Optimize cost/performance balance

Best Practices

For MSK Connect:

  • Right-size workers (start with 1 MCU × 1 worker)
  • Tune commit interval (5-10 minutes)
  • Monitor consumer lag
  • Use date partitioning
  • Enable Iceberg compaction

For Firehose:

  • Optimize buffer (3-5 minutes)
  • Keep Lambda transformation simple
  • Implement soft delete pattern
  • Schedule periodic compaction
  • Monitor error prefix in S3

Cost Optimization

At Different Scales:

100 GB/month:

  • MSK Connect: $324
  • Firehose: $6
  • Savings: $318 (98%)

1 TB/month:

  • MSK Connect: $373
  • Firehose: $52
  • Savings: $321 (86%)

10 TB/month:

  • MSK Connect: $500 (scale workers)
  • Firehose: $290
  • Savings: $210 (42%)

Recommendation: Firehose becomes less cost-effective at very high scale (> 10 TB/month)

Target Audience

  • Data Engineers: Building data lake pipelines
  • Solution Architects: Designing CDC architectures
  • DevOps Engineers: Operating data infrastructure
  • Data Platform Teams: Choosing technologies

Prerequisites

Readers should have completed:

  • Blog Posts #1-4 (Debezium CDC setup)
  • Understanding of Kafka and CDC concepts
  • AWS experience (MSK, S3, Glue, Lambda)
  • Basic SQL and Python knowledge

Conclusion

This blog post series provides a complete, production-ready guide to streaming CDC data from Kafka to S3 Iceberg tables. By implementing both MSK Connect and Firehose approaches, we provide real-world comparison data to help readers make informed decisions.

Key Insight: There's no one-size-fits-all solution. The hybrid approach—using Firehose as the default and MSK Connect for critical tables—provides the optimal balance of cost, performance, and operational simplicity.

These posts will help readers:

  • Understand both approaches deeply
  • Make data-driven technology choices
  • Implement production-ready solutions
  • Optimize cost and performance
  • Avoid common pitfalls

Resources

No comments:

Post a Comment