- π― Educational Mission
- π Why This Repository?
- π Learning Approach
- ποΈ Architecture
- π οΈ Core Stack
- π Lab Structure
- πΎ Sample Database
- π Quick Start
- π Requirements
- π§ Configuration
- π Documentation
- π Vendor Independence
- π€ Contributing
- π₯ Community and Learning
- π Related Practice Repositories
- π License
A comprehensive, vendor-independent Apache Iceberg learning environment designed for developers, data engineers, and students who want to master modern data lakehouse concepts through hands-on practice.
12 progressive labs with 100+ exercises. Completely free and open source. Built for learners, by learners.
This educational resource fills the gap between theoretical knowledge and practical skills in Apache Iceberg and data lakehouse technologies:
- Learn by Doing: Progressive hands-on labs build real skills
- Vendor Independent: Master concepts that apply across all platforms
- Production Patterns: Learn best practices used in real data engineering
- Multi-Engine Experience: Work with Spark, Trino, DuckDB, and more
- Community Driven: Built and improved by the data engineering community
Our labs are designed to build knowledge progressively:
- Beginner (Labs 0-2): Foundation and basic operations
- Intermediate (Labs 3-5): Advanced features and optimization
- Advanced (Labs 6-11): Production patterns and multi-engine architecture
Each lab includes:
- Clear Learning Objectives: Know what you'll achieve
- Step-by-Step Instructions: Guided exercises
- Real-World Scenarios: Practical use cases
- Solution Notebooks: Reference implementations
- Conceptual Guides: Deep-dive explanations
Gain experience with different query engines:
- Apache Spark: Data processing and ETL
- Trino: Interactive SQL analytics
- DuckDB: Local analytics and testing
- Kafka + Debezium: Real-time streaming and CDC
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Iceberg Code Practice β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Apache Polaris (Iceberg REST Catalog) β β
β β Vendor-independent catalog service β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Query Engines (Multi-Engine Lakehouse) β β
β β - Apache Spark (OSS) with Iceberg β β
β β - Trino (Interactive SQL) β β
β β - DuckDB (Local Analytics) β β
β β - Spark History Server (port 18080) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Streaming & CDC Infrastructure β β
β β - Apache Kafka (Event Streaming) β β
β β - Debezium (Change Data Capture) β β
β β - MySQL (CDC Source Database) β β
β β - Zookeeper (Kafka Coordination) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Storage Layer (S3-Compatible) β β
β β - ObjectScale CE (default) β β
β β - MinIO (optional alternative) β β
β β - s3a://spark-logs/ for History Server β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Apache Polaris (Incubating): Iceberg REST catalog
- Vendor-independent catalog service
- REST API for metadata management
- ObjectScale Community Edition (Default)
- MinIO (Alternative)
- Both provide S3-compatible APIs
- Apache Spark (OSS): Data processing engine
- Trino: Interactive SQL query engine
- DuckDB: Local analytics database
- Spark History Server: UI for viewing completed jobs
- Iceberg Spark Runtime: Iceberg table operations
- Apache Kafka: Distributed event streaming platform
- Debezium: Change Data Capture for database synchronization
- MySQL: Source database for CDC
- Zookeeper: Kafka coordination service
- k3s: Lightweight Kubernetes distribution
- Docker Compose: Alternative non-K8s setup
| Level | Labs | Time per Lab | What It Tests |
|---|---|---|---|
| Beginner | Labs 0-2 | 30-45 min | Basic setup, table operations, fundamental concepts |
| Intermediate | Labs 3-5 | 45-60 min | Advanced features, optimization patterns, real-world scenarios |
| Advanced | Labs 6-11 | 60-90 min | Performance analysis, CDC, streaming, multi-engine architecture |
- Generate and load realistic business data
- Explore sample database schema and relationships
- Practice queries on sample data
- Prerequisite for all subsequent labs
- Verify all components are running
- Test catalog connectivity
- Validate storage access
- Create Iceberg tables
- Insert and query data
- Understand schema evolution
- Partitioning strategies
- Time travel queries
- Schema evolution with migrations
- File compaction
- Snapshot management
- Query planning optimization
- Slowly Changing Dimensions (SCD)
- Upsert operations
- Batch and streaming patterns
- Complex Iceberg join operations
- Spark History Server UI exploration
- DAG inspection and metadata-only filtering
- Performance analysis and optimization
- File compaction and optimization strategies
- Snapshot management and expiration
- Orphan file cleanup and storage reclamation
- Table statistics collection and analysis
- Metadata optimization
- Table migration and rollback
- Backup and restore strategies
- Monitoring and alerting setup
- Automated maintenance procedures
- Set up Apache Kafka for real-time data streaming
- Produce and consume events with Kafka
- Integrate Spark Structured Streaming with Iceberg
- Implement real-time analytics on streaming data
- Handle exactly-once processing semantics
- Implement data quality validation
- Handle schema evolution in streaming pipelines
- Configure Debezium for MySQL CDC
- Set up MySQL for change data capture
- Create and manage Debezium connectors
- Stream CDC events to Kafka topics
- Consume CDC events with Spark Structured Streaming
- Apply CDC changes to Iceberg tables (inserts, updates, deletes)
- Handle schema evolution and data type conversions
- Monitor and troubleshoot CDC pipelines
- Create Spring Boot applications with Iceberg integration
- Configure Iceberg catalog and table access
- Implement CRUD operations on Iceberg tables
- Build REST APIs for Iceberg data access
- Implement transaction handling and error management
- Optimize performance with caching and connection pooling
- Implement data validation and business logic
- Add monitoring and logging to applications
- Configure multiple query engines (Spark, Trino, DuckDB)
- Ensure schema consistency across engines
- Implement engine-specific optimizations
- Handle data type conversions between engines
- Monitor and optimize multi-engine workloads
- Implement workload isolation and resource management
- Build cross-engine ETL pipelines
- Monitor multi-engine lakehouse operations
The environment includes a comprehensive sample database with realistic e-commerce data for hands-on learning:
- sample_customers (1,000 records): Customer dimension with segmentation
- sample_products (200 records): Product catalog with categories
- sample_orders (5,000 records): Order fact table with status tracking
- sample_transactions (10,000 records): Transaction details with payment methods
- sample_events (20,000 records): Web events for user engagement analysis
# Generate and load sample data
python3 scripts/generate_sample_data.py
./scripts/load_sample_data.sh- Sample Database Guide - Complete schema and usage documentation
- Lab 0: Sample Database Setup - Step-by-step loading and exploration
Follow our recommended learning path:
- Start with Fundamentals: Read Iceberg Fundamentals wiki page
- Set Up Environment: Follow Getting Started Guide
- Begin Lab 0: Load sample data with Lab 0
- Progress Through Labs: Follow the Learning Path
cd iceberg-code-practice
./scripts/setup.sh
kubectl apply -f k8s/cd iceberg-code-practice
cp .env.example .env
# Edit .env with your credentials
docker-compose up -d- Docker or Podman
- k3s (for K8s setup) OR Docker Compose (for lightweight setup)
- 16GB RAM minimum (increased for multi-engine and streaming workloads)
- 40GB disk space (increased for additional components)
# Use ObjectScale (default)
export STORAGE_BACKEND=objectscale
# Use MinIO
export STORAGE_BACKEND=minio# Spark History Server port
export SPARK_HISTORY_PORT=18080
# Event logs location
export SPARK_EVENT_LOGS=s3a://spark-logs/Wiki Guides (Comprehensive learning materials):
- Wiki Home - Main wiki page with all guides
- Getting Started Guide - Complete setup and first steps
- Iceberg Fundamentals - Core concepts and architecture
- Lab Guides - Detailed lab walkthroughs
- Learning Path - Recommended learning sequence
- Best Practices - Production-ready patterns
- Troubleshooting - Common issues and solutions
- Setup Guide - Detailed setup instructions for K8s and Docker Compose
- Architecture Overview - System architecture and component details
- Lab Guide - Complete lab sequence and learning path
- Troubleshooting - Common issues and solutions (including ObjectScale-specific issues)
- GitHub Pages Setup - Documentation deployment guide
- Wiki Setup - Wiki contribution and maintenance guide
Deep-dive tutorials explaining the "Why" behind the "How":
- Conceptual Guide 1: Environment Architecture - Understanding the Iceberg environment architecture
- Conceptual Guide 2: Table Operations & Schema Evolution - How Iceberg handles table operations and schema evolution
- Conceptual Guide 3: Advanced Features & Performance - Partitioning, Z-ordering, compaction, and metadata-only filtering
- Conceptual Guide 4: Spark + Iceberg Optimization - Spark-Iceberg integration and optimization techniques
- Conceptual Guide 5: Real-World Data Patterns - SCD, upsert, CDC, and batch/streaming patterns
- Conceptual Guide 6: Performance Analysis & DAG Inspection - Understanding query execution and performance analysis
- Conceptual Guide 7: Table Maintenance & Operations - Compaction, snapshots, monitoring, and automation
- Conceptual Guide 8: Real-Time Data Pipelines with Kafka and CDC - Kafka integration, CDC patterns, and streaming architectures
- Conceptual Guide 9: Application Integration with Iceberg - Building applications with Iceberg, repository patterns, and transaction management
- Conceptual Guide 10: Multi-Engine Lakehouse Architecture - Multi-engine design patterns, engine selection, and resource management
- Lab 0: Sample Database Setup - Generate and load sample data
- Lab 1: Environment Setup - Component verification and first Iceberg query
- Lab 2: Basic Operations - Tables, queries, schema evolution
- Lab 3: Advanced Features - Partitioning, compaction, metadata filtering
- Lab 4: Spark Optimizations - File management, query planning
- Lab 5: Real-World Patterns - SCD, upsert, CDC, star schema
- Lab 6: Performance & UI - DAG inspection, metadata-only filtering analysis
- Lab 7: Table Maintenance - Compaction, snapshots, monitoring, automation
- Lab 8: Kafka Integration - Real-time streaming with Kafka and Iceberg
- Lab 9: Real CDC with Debezium - Change data capture with Debezium
- Lab 10: Spring Boot with Iceberg - Building applications with Iceberg
- Lab 11: Multi-Engine Lakehouse - Multi-engine architecture and optimization
Interactive Jupyter notebooks for hands-on learning:
- Lab Notebooks - Student notebooks with exercises
- Solution Helper - How to use the solution helper when stuck
Complete solution notebooks for reference and validation:
- Lab 1 Solution - Environment setup solution
- Lab 2 Solution - Basic operations solution
- Lab 3 Solution - Advanced features solution
- Lab 4 Solution - Optimizations solution
- Lab 5 Solution - Real-world patterns solution
- Lab 6 Solution - Performance & UI solution
- Lab 7 Solution - Table maintenance solution
- Lab 8 Solution - Kafka integration solution
- Lab 9 Solution - CDC with Debezium solution
- Lab 10 Solution - Spring Boot with Iceberg solution
- Lab 11 Solution - Multi-engine lakehouse solution
- Solution Helper - Python helper for accessing solutions and hints
- Validate Solutions - CI/CD validation script for solution notebooks
- Convert Labs to Notebooks - Convert Markdown labs to Jupyter notebooks
- Generate Sample Data - Generate realistic business data
- Load Sample Data - Load sample data into Iceberg
This environment uses only Apache-licensed tools:
- Apache Spark (Apache 2.0)
- Apache Iceberg (Apache 2.0)
- Apache Polaris (Apache 2.0)
- Apache Kafka (Apache 2.0)
- Trino (Apache 2.0)
- DuckDB (MIT)
- Debezium (Apache 2.0)
- MySQL Community Server (GPL)
- k3s (MIT)
- MinIO (AGPL)
- ObjectScale CE (Apache 2.0)
No proprietary cloud services or consoles required.
This is a practice environment for learning. Feel free to extend labs, add examples, or improve the setup process.
Disclaimer: This is an independent educational resource for learning Apache Iceberg and data lakehouse concepts. It is not affiliated with, endorsed by, or sponsored by Apache Iceberg or any vendor.
This repository is an open educational resource built for the data engineering community. We believe in learning together and sharing knowledge.
- π Comprehensive Wiki: Detailed guides and tutorials for all skill levels
- π¬ GitHub Discussions: Ask questions and share insights with fellow learners
- π Issue Tracking: Report bugs and suggest improvements
- π Pull Requests: Contribute labs, fixes, and enhancements
- β Star the Repo: Show your support and help others discover this resource
We welcome contributions that improve the educational value:
- New Labs: Suggest new lab topics and exercises
- Better Explanations: Improve clarity of existing content
- Additional Examples: Add more practical examples
- Translation: Help translate content for global learners
- Bug Fixes: Report and fix issues in labs or documentation
See CONTRIBUTING.md for detailed contribution guidelines.
- Official Apache Iceberg Documentation: https://iceberg.apache.org/
- Apache Iceberg Slack Community: Join the conversation
- Iceberg Blog: Latest updates and articles
- Conference Talks: Learn from industry experts
Continue your learning journey with these related repositories:
- π€ DSPy Code Practice - Declarative LLM programming
- π§ LLM Fine-Tuning Practice - Model fine-tuning techniques
- π¦ DuckDB Code Practice - Analytics & SQL optimization
- β‘ Apache Spark Code Practice - Big data processing
- π§ Apache Beam Code Practice - Data pipelines
- βοΈ Scala Data Analysis Practice - Functional programming
- π Awesome My Notes - Comprehensive technical notes and learning resources
Apache License 2.0