Apache Iceberg Code Practice

📖 Table of Contents

🎯 Educational Mission
🎓 Why This Repository?
🎓 Learning Approach
🏗️ Architecture
🛠️ Core Stack
🎓 Lab Structure
💾 Sample Database
🚀 Quick Start
📋 Requirements
🔧 Configuration
📚 Documentation
🆘 Vendor Independence
🤝 Contributing
👥 Community and Learning
🔗 Related Practice Repositories
📄 License

🎯 Educational Mission

A comprehensive, vendor-independent Apache Iceberg learning environment designed for developers, data engineers, and students who want to master modern data lakehouse concepts through hands-on practice.

12 progressive labs with 100+ exercises. Completely free and open source. Built for learners, by learners.

🎓 Why This Repository?

This educational resource fills the gap between theoretical knowledge and practical skills in Apache Iceberg and data lakehouse technologies:

Learn by Doing: Progressive hands-on labs build real skills
Vendor Independent: Master concepts that apply across all platforms
Production Patterns: Learn best practices used in real data engineering
Multi-Engine Experience: Work with Spark, Trino, DuckDB, and more
Community Driven: Built and improved by the data engineering community

🎓 Learning Approach

Progressive Complexity

Our labs are designed to build knowledge progressively:

Beginner (Labs 0-2): Foundation and basic operations
Intermediate (Labs 3-5): Advanced features and optimization
Advanced (Labs 6-11): Production patterns and multi-engine architecture

Hands-On Learning

Each lab includes:

Clear Learning Objectives: Know what you'll achieve
Step-by-Step Instructions: Guided exercises
Real-World Scenarios: Practical use cases
Solution Notebooks: Reference implementations
Conceptual Guides: Deep-dive explanations

Multi-Engine Experience

Gain experience with different query engines:

Apache Spark: Data processing and ETL
Trino: Interactive SQL analytics
DuckDB: Local analytics and testing
Kafka + Debezium: Real-time streaming and CDC

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                   Iceberg Code Practice                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────────────────────────────────────────────┐  │
│  │         Apache Polaris (Iceberg REST Catalog)        │  │
│  │         Vendor-independent catalog service          │  │
│  └──────────────────────────────────────────────────────┘  │
│                              ↓                              │
│  ┌──────────────────────────────────────────────────────┐  │
│  │         Query Engines (Multi-Engine Lakehouse)      │  │
│  │         - Apache Spark (OSS) with Iceberg          │  │
│  │         - Trino (Interactive SQL)                   │  │
│  │         - DuckDB (Local Analytics)                  │  │
│  │         - Spark History Server (port 18080)         │  │
│  └──────────────────────────────────────────────────────┘  │
│                              ↓                              │
│  ┌──────────────────────────────────────────────────────┐  │
│  │         Streaming & CDC Infrastructure               │  │
│  │         - Apache Kafka (Event Streaming)            │  │
│  │         - Debezium (Change Data Capture)            │  │
│  │         - MySQL (CDC Source Database)               │  │
│  │         - Zookeeper (Kafka Coordination)             │  │
│  └──────────────────────────────────────────────────────┘  │
│                              ↓                              │
│  ┌──────────────────────────────────────────────────────┐  │
│  │         Storage Layer (S3-Compatible)              │  │
│  │         - ObjectScale CE (default)            │  │
│  │         - MinIO (optional alternative)             │  │
│  │         - s3a://spark-logs/ for History Server     │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

🛠️ Core Stack

Catalog

Apache Polaris (Incubating): Iceberg REST catalog
Vendor-independent catalog service
REST API for metadata management

Storage Options

ObjectScale Community Edition (Default)
MinIO (Alternative)
Both provide S3-compatible APIs

Compute Engines

Apache Spark (OSS): Data processing engine
Trino: Interactive SQL query engine
DuckDB: Local analytics database
Spark History Server: UI for viewing completed jobs
Iceberg Spark Runtime: Iceberg table operations

Streaming & CDC

Apache Kafka: Distributed event streaming platform
Debezium: Change Data Capture for database synchronization
MySQL: Source database for CDC
Zookeeper: Kafka coordination service

Orchestration

k3s: Lightweight Kubernetes distribution
Docker Compose: Alternative non-K8s setup

🎓 Lab Structure

Lab Difficulty & Time Estimates

Level	Labs	Time per Lab	What It Tests
Beginner	Labs 0-2	30-45 min	Basic setup, table operations, fundamental concepts
Intermediate	Labs 3-5	45-60 min	Advanced features, optimization patterns, real-world scenarios
Advanced	Labs 6-11	60-90 min	Performance analysis, CDC, streaming, multi-engine architecture

Lab 0: Sample Database Setup (NEW)

Generate and load realistic business data
Explore sample database schema and relationships
Practice queries on sample data
Prerequisite for all subsequent labs

Lab 1: Environment Setup

Verify all components are running
Test catalog connectivity
Validate storage access

Lab 2: Basic Iceberg Operations

Create Iceberg tables
Insert and query data
Understand schema evolution

Lab 3: Advanced Iceberg Features

Partitioning strategies
Time travel queries
Schema evolution with migrations

Lab 4: Iceberg + Spark Optimizations

File compaction
Snapshot management
Query planning optimization

Lab 5: Real-world Data Patterns

Slowly Changing Dimensions (SCD)
Upsert operations
Batch and streaming patterns

Lab 6: Performance & UI ⭐ (NEW)

Complex Iceberg join operations
Spark History Server UI exploration
DAG inspection and metadata-only filtering
Performance analysis and optimization

Lab 7: Table Maintenance and Operations (NEW)

File compaction and optimization strategies
Snapshot management and expiration
Orphan file cleanup and storage reclamation
Table statistics collection and analysis
Metadata optimization
Table migration and rollback
Backup and restore strategies
Monitoring and alerting setup
Automated maintenance procedures

Lab 8: Kafka Integration with Iceberg (NEW)

Set up Apache Kafka for real-time data streaming
Produce and consume events with Kafka
Integrate Spark Structured Streaming with Iceberg
Implement real-time analytics on streaming data
Handle exactly-once processing semantics
Implement data quality validation
Handle schema evolution in streaming pipelines

Lab 9: Real CDC with Debezium (NEW)

Configure Debezium for MySQL CDC
Set up MySQL for change data capture
Create and manage Debezium connectors
Stream CDC events to Kafka topics
Consume CDC events with Spark Structured Streaming
Apply CDC changes to Iceberg tables (inserts, updates, deletes)
Handle schema evolution and data type conversions
Monitor and troubleshoot CDC pipelines

Lab 10: Spring Boot with Iceberg (NEW)

Create Spring Boot applications with Iceberg integration
Configure Iceberg catalog and table access
Implement CRUD operations on Iceberg tables
Build REST APIs for Iceberg data access
Implement transaction handling and error management
Optimize performance with caching and connection pooling
Implement data validation and business logic
Add monitoring and logging to applications

Lab 11: Multi-Engine Lakehouse (NEW)

Configure multiple query engines (Spark, Trino, DuckDB)
Ensure schema consistency across engines
Implement engine-specific optimizations
Handle data type conversions between engines
Monitor and optimize multi-engine workloads
Implement workload isolation and resource management
Build cross-engine ETL pipelines
Monitor multi-engine lakehouse operations

💾 Sample Database

The environment includes a comprehensive sample database with realistic e-commerce data for hands-on learning:

Sample Tables

sample_customers (1,000 records): Customer dimension with segmentation
sample_products (200 records): Product catalog with categories
sample_orders (5,000 records): Order fact table with status tracking
sample_transactions (10,000 records): Transaction details with payment methods
sample_events (20,000 records): Web events for user engagement analysis

Loading Sample Data

# Generate and load sample data
python3 scripts/generate_sample_data.py
./scripts/load_sample_data.sh

Sample Data Documentation

Sample Database Guide - Complete schema and usage documentation
Lab 0: Sample Database Setup - Step-by-step loading and exploration

🚀 Quick Start

🎓 New to Apache Iceberg?

Follow our recommended learning path:

Start with Fundamentals: Read Iceberg Fundamentals wiki page
Set Up Environment: Follow Getting Started Guide
Begin Lab 0: Load sample data with Lab 0
Progress Through Labs: Follow the Learning Path

📋 Setup Options

Option 1: Kubernetes with k3s (Recommended)

cd iceberg-code-practice
./scripts/setup.sh
kubectl apply -f k8s/

Option 2: Docker Compose (Lightweight)

cd iceberg-code-practice
cp .env.example .env
# Edit .env with your credentials
docker-compose up -d

📋 Requirements

Docker or Podman
k3s (for K8s setup) OR Docker Compose (for lightweight setup)
16GB RAM minimum (increased for multi-engine and streaming workloads)
40GB disk space (increased for additional components)

🔧 Configuration

Storage Backend Selection

# Use ObjectScale (default)
export STORAGE_BACKEND=objectscale

# Use MinIO
export STORAGE_BACKEND=minio

Spark Configuration

# Spark History Server port
export SPARK_HISTORY_PORT=18080

# Event logs location
export SPARK_EVENT_LOGS=s3a://spark-logs/

📚 Documentation

🎓 Educational Resources

Wiki Guides (Comprehensive learning materials):

Wiki Home - Main wiki page with all guides
Getting Started Guide - Complete setup and first steps
Iceberg Fundamentals - Core concepts and architecture
Lab Guides - Detailed lab walkthroughs
Learning Path - Recommended learning sequence
Best Practices - Production-ready patterns
Troubleshooting - Common issues and solutions

Core Documentation

Setup Guide - Detailed setup instructions for K8s and Docker Compose
Architecture Overview - System architecture and component details
Lab Guide - Complete lab sequence and learning path
Troubleshooting - Common issues and solutions (including ObjectScale-specific issues)
GitHub Pages Setup - Documentation deployment guide
Wiki Setup - Wiki contribution and maintenance guide

🎓 Conceptual Guides (Tutorials)

Deep-dive tutorials explaining the "Why" behind the "How":

Conceptual Guide 1: Environment Architecture - Understanding the Iceberg environment architecture
Conceptual Guide 2: Table Operations & Schema Evolution - How Iceberg handles table operations and schema evolution
Conceptual Guide 3: Advanced Features & Performance - Partitioning, Z-ordering, compaction, and metadata-only filtering
Conceptual Guide 4: Spark + Iceberg Optimization - Spark-Iceberg integration and optimization techniques
Conceptual Guide 5: Real-World Data Patterns - SCD, upsert, CDC, and batch/streaming patterns
Conceptual Guide 6: Performance Analysis & DAG Inspection - Understanding query execution and performance analysis
Conceptual Guide 7: Table Maintenance & Operations - Compaction, snapshots, monitoring, and automation
Conceptual Guide 8: Real-Time Data Pipelines with Kafka and CDC - Kafka integration, CDC patterns, and streaming architectures
Conceptual Guide 9: Application Integration with Iceberg - Building applications with Iceberg, repository patterns, and transaction management
Conceptual Guide 10: Multi-Engine Lakehouse Architecture - Multi-engine design patterns, engine selection, and resource management

Lab Materials

Lab 0: Sample Database Setup - Generate and load sample data
Lab 1: Environment Setup - Component verification and first Iceberg query
Lab 2: Basic Operations - Tables, queries, schema evolution
Lab 3: Advanced Features - Partitioning, compaction, metadata filtering
Lab 4: Spark Optimizations - File management, query planning
Lab 5: Real-World Patterns - SCD, upsert, CDC, star schema
Lab 6: Performance & UI - DAG inspection, metadata-only filtering analysis
Lab 7: Table Maintenance - Compaction, snapshots, monitoring, automation
Lab 8: Kafka Integration - Real-time streaming with Kafka and Iceberg
Lab 9: Real CDC with Debezium - Change data capture with Debezium
Lab 10: Spring Boot with Iceberg - Building applications with Iceberg
Lab 11: Multi-Engine Lakehouse - Multi-engine architecture and optimization

💡 Jupyter Notebooks

Interactive Jupyter notebooks for hands-on learning:

Lab Notebooks - Student notebooks with exercises
Solution Helper - How to use the solution helper when stuck

🔧 Solutions Framework

Complete solution notebooks for reference and validation:

Lab 1 Solution - Environment setup solution
Lab 2 Solution - Basic operations solution
Lab 3 Solution - Advanced features solution
Lab 4 Solution - Optimizations solution
Lab 5 Solution - Real-world patterns solution
Lab 6 Solution - Performance & UI solution
Lab 7 Solution - Table maintenance solution
Lab 8 Solution - Kafka integration solution
Lab 9 Solution - CDC with Debezium solution
Lab 10 Solution - Spring Boot with Iceberg solution
Lab 11 Solution - Multi-engine lakehouse solution

🤖 Automation Scripts

Solution Helper - Python helper for accessing solutions and hints
Validate Solutions - CI/CD validation script for solution notebooks
Convert Labs to Notebooks - Convert Markdown labs to Jupyter notebooks
Generate Sample Data - Generate realistic business data
Load Sample Data - Load sample data into Iceberg

🆘 Vendor Independence

This environment uses only Apache-licensed tools:

Apache Spark (Apache 2.0)
Apache Iceberg (Apache 2.0)
Apache Polaris (Apache 2.0)
Apache Kafka (Apache 2.0)
Trino (Apache 2.0)
DuckDB (MIT)
Debezium (Apache 2.0)
MySQL Community Server (GPL)
k3s (MIT)
MinIO (AGPL)
ObjectScale CE (Apache 2.0)

No proprietary cloud services or consoles required.

🤝 Contributing

This is a practice environment for learning. Feel free to extend labs, add examples, or improve the setup process.

Disclaimer: This is an independent educational resource for learning Apache Iceberg and data lakehouse concepts. It is not affiliated with, endorsed by, or sponsored by Apache Iceberg or any vendor.

👥 Community and Learning

This repository is an open educational resource built for the data engineering community. We believe in learning together and sharing knowledge.

🤝 Learning Together

📖 Comprehensive Wiki: Detailed guides and tutorials for all skill levels
💬 GitHub Discussions: Ask questions and share insights with fellow learners
🐛 Issue Tracking: Report bugs and suggest improvements
🔄 Pull Requests: Contribute labs, fixes, and enhancements
⭐ Star the Repo: Show your support and help others discover this resource

🎓 Contributing to Learning

We welcome contributions that improve the educational value:

New Labs: Suggest new lab topics and exercises
Better Explanations: Improve clarity of existing content
Additional Examples: Add more practical examples
Translation: Help translate content for global learners
Bug Fixes: Report and fix issues in labs or documentation

See CONTRIBUTING.md for detailed contribution guidelines.

📚 Additional Learning Resources

Official Apache Iceberg Documentation: https://iceberg.apache.org/
Apache Iceberg Slack Community: Join the conversation
Iceberg Blog: Latest updates and articles
Conference Talks: Learn from industry experts

🔗 Related Practice Repositories

Continue your learning journey with these related repositories:

AI/ML Practice

🤖 DSPy Code Practice - Declarative LLM programming
🧠 LLM Fine-Tuning Practice - Model fine-tuning techniques

Data Engineering Practice

🦆 DuckDB Code Practice - Analytics & SQL optimization
⚡ Apache Spark Code Practice - Big data processing
🔧 Apache Beam Code Practice - Data pipelines

Programming Practice

⚙️ Scala Data Analysis Practice - Functional programming

Resource Hub

📚 Awesome My Notes - Comprehensive technical notes and learning resources

📄 License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
config/mysql-init		config/mysql-init
data/sample		data/sample
docs		docs
k8s		k8s
labs		labs
notebooks		notebooks
scripts		scripts
solutions		solutions
wiki		wiki
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation