Overview
This series provides comprehensive guidance on streaming CDC data from Kafka to S3 Iceberg tables, comparing two approaches: MSK Connect with Iceberg Kafka Connect vs Amazon Kinesis Data Firehose.
Blog Posts Created
Blog Post #5: Building a Real-Time Data Lake with MSK Connect + Iceberg
Link: https://www.dbaglobe.com/2026/03/building-real-time-data-lake-kafka-to.html
Implementation: Orders table using MSK Connect
Topics Covered:
- Apache Iceberg benefits and features
- Creating Iceberg Kafka Connect custom plugin
- AWS Glue Catalog setup
- IAM permissions configuration
- Connector configuration and deployment
- Verification and monitoring
- Advanced features (time travel, schema evolution, partition evolution)
- Performance optimization
- Troubleshooting
Key Results:
- ✅ Latency: 15-30 seconds
- ✅ Throughput: 8 MB/sec sustained
- ✅ Native CDC operations (INSERT/UPDATE/DELETE)
- ✅ Automatic schema evolution
- ⚠️ Cost: ~$320/month
Best For:
- Real-time analytics (< 1 minute latency)
- High throughput (> 5 MB/sec)
- Complex CDC operations
- Exactly-once semantics
Blog Post #6: Streaming CDC Data to S3 Iceberg with Kinesis Firehose
Link: https://www.dbaglobe.com/2026/03/streaming-kafka-cdc-to-s3-iceberg-with.html
Implementation: Customers table using Firehose
Topics Covered:
- Kinesis Data Firehose benefits
- Lambda transformation for CDC events
- IAM permissions setup
- Firehose delivery stream configuration
- Soft delete pattern implementation
- Periodic compaction strategy
- Monitoring and optimization
- Cost analysis
Key Results:
- ✅ Latency: 5-7 minutes
- ✅ Throughput: 2 MB/sec
- ✅ Cost: ~$6/month (98% cheaper than MSK Connect)
- ⚠️ Soft deletes require compaction
- ⚠️ Schema changes need Lambda updates
Best For:
- Cost-sensitive projects
- Serverless architecture
- Moderate throughput (< 5 MB/sec)
- Simple CDC patterns (mostly inserts)
Blog Post #7: MSK Connect vs Firehose - Comprehensive Comparison
Link: https://www.dbaglobe.com/2026/03/kafka-to-s3-iceberg-msk-connect-vs.html
Topics Covered:
- Detailed feature comparison
- Setup complexity analysis
- Performance benchmarks
- Cost comparison (100 GB and 1 TB scenarios)
- CDC operation support
- Use case suitability
- Real-world implementation results
- Decision matrix
- Hybrid approach recommendation
- Migration paths
- Best practices for both approaches
Key Findings:
Cost Comparison (100 GB/month):
- MSK Connect: $324/month
- Firehose: $6/month
- Savings: 98%
Latency Comparison:
- MSK Connect: 10-30 seconds
- Firehose: 1-5 minutes
Recommendation: Hybrid approach
- Use MSK Connect for critical, high-frequency tables
- Use Firehose for reference data and append-mostly tables
Complete Architecture
┌─────────────────────┐
│ PostgreSQL │
│ (RDS/Aurora) │
└──────────┬──────────┘
│
│ Logical Replication
▼
┌─────────────────────┐
│ Debezium CDC │
│ (MSK Connect) │
└──────────┬──────────┘
│
│ CDC Events
▼
┌─────────────────────┐
│ Kafka Topics │
│ (Amazon MSK) │
└──────────┬──────────┘
│
├─────────────────────┐
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Iceberg Sink │ │ Kinesis │
│ (MSK Connect) │ │ Firehose │
│ │ │ + Lambda │
│ • orders │ │ • customers │
│ • transactions │ │ • products │
└────────┬─────────┘ └────────┬─────────┘
│ │
└──────────┬───────────┘
│
▼
┌─────────────────────┐
│ S3 Iceberg Tables │
│ (AWS Glue Catalog) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Query Engines │
│ • Athena │
│ • Spark │
│ • Trino │
└─────────────────────┘
Comparison Summary
Setup Complexity
| Aspect | MSK Connect | Firehose | Winner |
|---|---|---|---|
| Setup Time | 45 minutes | 15 minutes | ✅ Firehose |
| Configuration | Complex | Simple | ✅ Firehose |
| Learning Curve | Steep | Moderate | ✅ Firehose |
Performance
| Metric | MSK Connect | Firehose | Winner |
|---|---|---|---|
| Latency | 10-30 sec | 1-5 min | ✅ MSK Connect |
| Throughput | Unlimited | 5 MB/sec | ✅ MSK Connect |
| Real-time | Yes | No | ✅ MSK Connect |
Cost (100 GB/month)
| Component | MSK Connect | Firehose | Savings |
|---|---|---|---|
| Total | $324 | $6 | 98% |
CDC Operations
| Operation | MSK Connect | Firehose | Winner |
|---|---|---|---|
| INSERT | Native | Via Lambda | ✅ MSK Connect |
| UPDATE | Native upsert | Soft update | ✅ MSK Connect |
| DELETE | Native delete | Soft delete | ✅ MSK Connect |
Decision Framework
Use MSK Connect When:
✅ Real-time latency required (< 1 minute)
✅ High throughput needed (> 5 MB/sec)
✅ Complex CDC operations (native upserts/deletes)
✅ Exactly-once semantics required
✅ Automatic schema evolution needed
Example Tables: orders, transactions, inventory, real-time events
Use Firehose When:
✅ Cost is primary concern
✅ Serverless architecture preferred
✅ Moderate throughput (< 5 MB/sec)
✅ Simple CDC patterns (mostly inserts)
✅ Latency of 1-5 minutes acceptable
Example Tables: customers, products, categories, audit_logs
Hybrid Approach (Recommended)
Strategy: Use both approaches based on table characteristics
Implementation:
- Default to Firehose for cost savings (98% cheaper)
- Migrate to MSK Connect only when needed:
- Real-time requirements
- High throughput
- Complex CDC operations
Benefits:
- Optimize cost (Firehose where possible)
- Maintain performance (MSK Connect where needed)
- Reduce operational complexity
- Flexibility per table
Real-World Results
Orders Table (MSK Connect)
Configuration: 2 MCU, 2 workers, 5-min commit
Results:
- Latency: 15-30 seconds
- Throughput: 8 MB/sec
- Native CDC: ✅
- Cost: $320/month
Customers Table (Firehose)
Configuration: 128 MB buffer, 5-min interval, Lambda 256 MB
Results:
- Latency: 5-7 minutes
- Throughput: 2 MB/sec
- Soft deletes: ⚠️ (requires compaction)
- Cost: $6/month
Key Takeaways
Cost vs Performance Trade-off:
- Firehose: 98% cheaper but 10x higher latency
- MSK Connect: Real-time but 50x more expensive
CDC Operation Support:
- MSK Connect: Native upserts/deletes
- Firehose: Soft deletes + compaction jobs
Operational Complexity:
- Firehose: Fully managed, serverless
- MSK Connect: Requires worker management
Hybrid Approach Best:
- Use Firehose as default
- Use MSK Connect for critical tables
- Optimize cost/performance balance
Migration Paths
Starting with Firehose:
- Implement all tables with Firehose
- Monitor latency and CDC requirements
- Migrate high-priority tables to MSK Connect
- Keep low-priority tables on Firehose
Starting with MSK Connect:
- Implement critical tables with MSK Connect
- Monitor costs and usage patterns
- Migrate low-priority tables to Firehose
- Optimize cost/performance balance
Best Practices
For MSK Connect:
- Right-size workers (start with 1 MCU × 1 worker)
- Tune commit interval (5-10 minutes)
- Monitor consumer lag
- Use date partitioning
- Enable Iceberg compaction
For Firehose:
- Optimize buffer (3-5 minutes)
- Keep Lambda transformation simple
- Implement soft delete pattern
- Schedule periodic compaction
- Monitor error prefix in S3
Cost Optimization
At Different Scales:
100 GB/month:
- MSK Connect: $324
- Firehose: $6
- Savings: $318 (98%)
1 TB/month:
- MSK Connect: $373
- Firehose: $52
- Savings: $321 (86%)
10 TB/month:
- MSK Connect: $500 (scale workers)
- Firehose: $290
- Savings: $210 (42%)
Recommendation: Firehose becomes less cost-effective at very high scale (> 10 TB/month)
Target Audience
- Data Engineers: Building data lake pipelines
- Solution Architects: Designing CDC architectures
- DevOps Engineers: Operating data infrastructure
- Data Platform Teams: Choosing technologies
Prerequisites
Readers should have completed:
- Blog Posts #1-4 (Debezium CDC setup)
- Understanding of Kafka and CDC concepts
- AWS experience (MSK, S3, Glue, Lambda)
- Basic SQL and Python knowledge
Conclusion
This blog post series provides a complete, production-ready guide to streaming CDC data from Kafka to S3 Iceberg tables. By implementing both MSK Connect and Firehose approaches, we provide real-world comparison data to help readers make informed decisions.
Key Insight: There's no one-size-fits-all solution. The hybrid approach—using Firehose as the default and MSK Connect for critical tables—provides the optimal balance of cost, performance, and operational simplicity.
These posts will help readers:
- Understand both approaches deeply
- Make data-driven technology choices
- Implement production-ready solutions
- Optimize cost and performance
- Avoid common pitfalls
No comments:
Post a Comment