Skip to content

nellaivijay/iceberg-code-practice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Apache Iceberg Code Practice

Python Apache Iceberg License Jupyter

πŸ“– Table of Contents

🎯 Educational Mission

A comprehensive, vendor-independent Apache Iceberg learning environment designed for developers, data engineers, and students who want to master modern data lakehouse concepts through hands-on practice.

12 progressive labs with 100+ exercises. Completely free and open source. Built for learners, by learners.

πŸŽ“ Why This Repository?

This educational resource fills the gap between theoretical knowledge and practical skills in Apache Iceberg and data lakehouse technologies:

  • Learn by Doing: Progressive hands-on labs build real skills
  • Vendor Independent: Master concepts that apply across all platforms
  • Production Patterns: Learn best practices used in real data engineering
  • Multi-Engine Experience: Work with Spark, Trino, DuckDB, and more
  • Community Driven: Built and improved by the data engineering community

πŸŽ“ Learning Approach

Progressive Complexity

Our labs are designed to build knowledge progressively:

  • Beginner (Labs 0-2): Foundation and basic operations
  • Intermediate (Labs 3-5): Advanced features and optimization
  • Advanced (Labs 6-11): Production patterns and multi-engine architecture

Hands-On Learning

Each lab includes:

  • Clear Learning Objectives: Know what you'll achieve
  • Step-by-Step Instructions: Guided exercises
  • Real-World Scenarios: Practical use cases
  • Solution Notebooks: Reference implementations
  • Conceptual Guides: Deep-dive explanations

Multi-Engine Experience

Gain experience with different query engines:

  • Apache Spark: Data processing and ETL
  • Trino: Interactive SQL analytics
  • DuckDB: Local analytics and testing
  • Kafka + Debezium: Real-time streaming and CDC

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Iceberg Code Practice                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         Apache Polaris (Iceberg REST Catalog)        β”‚  β”‚
β”‚  β”‚         Vendor-independent catalog service          β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                              ↓                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         Query Engines (Multi-Engine Lakehouse)      β”‚  β”‚
β”‚  β”‚         - Apache Spark (OSS) with Iceberg          β”‚  β”‚
β”‚  β”‚         - Trino (Interactive SQL)                   β”‚  β”‚
β”‚  β”‚         - DuckDB (Local Analytics)                  β”‚  β”‚
β”‚  β”‚         - Spark History Server (port 18080)         β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                              ↓                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         Streaming & CDC Infrastructure               β”‚  β”‚
β”‚  β”‚         - Apache Kafka (Event Streaming)            β”‚  β”‚
β”‚  β”‚         - Debezium (Change Data Capture)            β”‚  β”‚
β”‚  β”‚         - MySQL (CDC Source Database)               β”‚  β”‚
β”‚  β”‚         - Zookeeper (Kafka Coordination)             β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                              ↓                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         Storage Layer (S3-Compatible)              β”‚  β”‚
β”‚  β”‚         - ObjectScale CE (default)            β”‚  β”‚
β”‚  β”‚         - MinIO (optional alternative)             β”‚  β”‚
β”‚  β”‚         - s3a://spark-logs/ for History Server     β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ Core Stack

Catalog

  • Apache Polaris (Incubating): Iceberg REST catalog
  • Vendor-independent catalog service
  • REST API for metadata management

Storage Options

  • ObjectScale Community Edition (Default)
  • MinIO (Alternative)
  • Both provide S3-compatible APIs

Compute Engines

  • Apache Spark (OSS): Data processing engine
  • Trino: Interactive SQL query engine
  • DuckDB: Local analytics database
  • Spark History Server: UI for viewing completed jobs
  • Iceberg Spark Runtime: Iceberg table operations

Streaming & CDC

  • Apache Kafka: Distributed event streaming platform
  • Debezium: Change Data Capture for database synchronization
  • MySQL: Source database for CDC
  • Zookeeper: Kafka coordination service

Orchestration

  • k3s: Lightweight Kubernetes distribution
  • Docker Compose: Alternative non-K8s setup

πŸŽ“ Lab Structure

Lab Difficulty & Time Estimates

Level Labs Time per Lab What It Tests
Beginner Labs 0-2 30-45 min Basic setup, table operations, fundamental concepts
Intermediate Labs 3-5 45-60 min Advanced features, optimization patterns, real-world scenarios
Advanced Labs 6-11 60-90 min Performance analysis, CDC, streaming, multi-engine architecture

Lab 0: Sample Database Setup (NEW)

  • Generate and load realistic business data
  • Explore sample database schema and relationships
  • Practice queries on sample data
  • Prerequisite for all subsequent labs

Lab 1: Environment Setup

  • Verify all components are running
  • Test catalog connectivity
  • Validate storage access

Lab 2: Basic Iceberg Operations

  • Create Iceberg tables
  • Insert and query data
  • Understand schema evolution

Lab 3: Advanced Iceberg Features

  • Partitioning strategies
  • Time travel queries
  • Schema evolution with migrations

Lab 4: Iceberg + Spark Optimizations

  • File compaction
  • Snapshot management
  • Query planning optimization

Lab 5: Real-world Data Patterns

  • Slowly Changing Dimensions (SCD)
  • Upsert operations
  • Batch and streaming patterns

Lab 6: Performance & UI ⭐ (NEW)

  • Complex Iceberg join operations
  • Spark History Server UI exploration
  • DAG inspection and metadata-only filtering
  • Performance analysis and optimization

Lab 7: Table Maintenance and Operations (NEW)

  • File compaction and optimization strategies
  • Snapshot management and expiration
  • Orphan file cleanup and storage reclamation
  • Table statistics collection and analysis
  • Metadata optimization
  • Table migration and rollback
  • Backup and restore strategies
  • Monitoring and alerting setup
  • Automated maintenance procedures

Lab 8: Kafka Integration with Iceberg (NEW)

  • Set up Apache Kafka for real-time data streaming
  • Produce and consume events with Kafka
  • Integrate Spark Structured Streaming with Iceberg
  • Implement real-time analytics on streaming data
  • Handle exactly-once processing semantics
  • Implement data quality validation
  • Handle schema evolution in streaming pipelines

Lab 9: Real CDC with Debezium (NEW)

  • Configure Debezium for MySQL CDC
  • Set up MySQL for change data capture
  • Create and manage Debezium connectors
  • Stream CDC events to Kafka topics
  • Consume CDC events with Spark Structured Streaming
  • Apply CDC changes to Iceberg tables (inserts, updates, deletes)
  • Handle schema evolution and data type conversions
  • Monitor and troubleshoot CDC pipelines

Lab 10: Spring Boot with Iceberg (NEW)

  • Create Spring Boot applications with Iceberg integration
  • Configure Iceberg catalog and table access
  • Implement CRUD operations on Iceberg tables
  • Build REST APIs for Iceberg data access
  • Implement transaction handling and error management
  • Optimize performance with caching and connection pooling
  • Implement data validation and business logic
  • Add monitoring and logging to applications

Lab 11: Multi-Engine Lakehouse (NEW)

  • Configure multiple query engines (Spark, Trino, DuckDB)
  • Ensure schema consistency across engines
  • Implement engine-specific optimizations
  • Handle data type conversions between engines
  • Monitor and optimize multi-engine workloads
  • Implement workload isolation and resource management
  • Build cross-engine ETL pipelines
  • Monitor multi-engine lakehouse operations

πŸ’Ύ Sample Database

The environment includes a comprehensive sample database with realistic e-commerce data for hands-on learning:

Sample Tables

  • sample_customers (1,000 records): Customer dimension with segmentation
  • sample_products (200 records): Product catalog with categories
  • sample_orders (5,000 records): Order fact table with status tracking
  • sample_transactions (10,000 records): Transaction details with payment methods
  • sample_events (20,000 records): Web events for user engagement analysis

Loading Sample Data

# Generate and load sample data
python3 scripts/generate_sample_data.py
./scripts/load_sample_data.sh

Sample Data Documentation

πŸš€ Quick Start

πŸŽ“ New to Apache Iceberg?

Follow our recommended learning path:

  1. Start with Fundamentals: Read Iceberg Fundamentals wiki page
  2. Set Up Environment: Follow Getting Started Guide
  3. Begin Lab 0: Load sample data with Lab 0
  4. Progress Through Labs: Follow the Learning Path

πŸ“‹ Setup Options

Option 1: Kubernetes with k3s (Recommended)

cd iceberg-code-practice
./scripts/setup.sh
kubectl apply -f k8s/

Option 2: Docker Compose (Lightweight)

cd iceberg-code-practice
cp .env.example .env
# Edit .env with your credentials
docker-compose up -d

πŸ“‹ Requirements

  • Docker or Podman
  • k3s (for K8s setup) OR Docker Compose (for lightweight setup)
  • 16GB RAM minimum (increased for multi-engine and streaming workloads)
  • 40GB disk space (increased for additional components)

πŸ”§ Configuration

Storage Backend Selection

# Use ObjectScale (default)
export STORAGE_BACKEND=objectscale

# Use MinIO
export STORAGE_BACKEND=minio

Spark Configuration

# Spark History Server port
export SPARK_HISTORY_PORT=18080

# Event logs location
export SPARK_EVENT_LOGS=s3a://spark-logs/

πŸ“š Documentation

πŸŽ“ Educational Resources

Wiki Guides (Comprehensive learning materials):

Core Documentation

πŸŽ“ Conceptual Guides (Tutorials)

Deep-dive tutorials explaining the "Why" behind the "How":

Lab Materials

πŸ’‘ Jupyter Notebooks

Interactive Jupyter notebooks for hands-on learning:

πŸ”§ Solutions Framework

Complete solution notebooks for reference and validation:

πŸ€– Automation Scripts

πŸ†˜ Vendor Independence

This environment uses only Apache-licensed tools:

  • Apache Spark (Apache 2.0)
  • Apache Iceberg (Apache 2.0)
  • Apache Polaris (Apache 2.0)
  • Apache Kafka (Apache 2.0)
  • Trino (Apache 2.0)
  • DuckDB (MIT)
  • Debezium (Apache 2.0)
  • MySQL Community Server (GPL)
  • k3s (MIT)
  • MinIO (AGPL)
  • ObjectScale CE (Apache 2.0)

No proprietary cloud services or consoles required.

🀝 Contributing

This is a practice environment for learning. Feel free to extend labs, add examples, or improve the setup process.

Disclaimer: This is an independent educational resource for learning Apache Iceberg and data lakehouse concepts. It is not affiliated with, endorsed by, or sponsored by Apache Iceberg or any vendor.

πŸ‘₯ Community and Learning

This repository is an open educational resource built for the data engineering community. We believe in learning together and sharing knowledge.

🀝 Learning Together

  • πŸ“– Comprehensive Wiki: Detailed guides and tutorials for all skill levels
  • πŸ’¬ GitHub Discussions: Ask questions and share insights with fellow learners
  • πŸ› Issue Tracking: Report bugs and suggest improvements
  • πŸ”„ Pull Requests: Contribute labs, fixes, and enhancements
  • ⭐ Star the Repo: Show your support and help others discover this resource

πŸŽ“ Contributing to Learning

We welcome contributions that improve the educational value:

  • New Labs: Suggest new lab topics and exercises
  • Better Explanations: Improve clarity of existing content
  • Additional Examples: Add more practical examples
  • Translation: Help translate content for global learners
  • Bug Fixes: Report and fix issues in labs or documentation

See CONTRIBUTING.md for detailed contribution guidelines.

πŸ“š Additional Learning Resources

πŸ”— Related Practice Repositories

Continue your learning journey with these related repositories:

AI/ML Practice

Data Engineering Practice

Programming Practice

Resource Hub

πŸ“„ License

Apache License 2.0

About

Apache Iceberg Code Practice - for Learning Data Lakehouse

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors