A high-performance, fault-tolerant streaming data ingestion platform built in Rust. Designed for processing millions of events per second with sub-second latency, supporting multiple data sources and lakehouse formats.
The Stream Data Platform is a production-ready, enterprise-grade data ingestion system that provides:
- Extreme Performance: Optimized for high-throughput, low-latency data processing
- Fault Tolerance: Built-in recovery mechanisms and data durability guarantees
- Scalability: Horizontal scaling with distributed architecture
- Flexibility: Support for multiple data sources, formats, and sinks
- Observability: Comprehensive monitoring and metrics
- High-throughput Processing: Millions of events per second
- Sub-second Latency: Optimized for real-time data processing
- Fault Tolerance: Write-ahead logging and recovery mechanisms
- Backpressure Management: Intelligent flow control and batching
- Multi-format Support: JSON, CSV, Parquet, and custom formats
- API-first Design: HTTP and gRPC endpoints for data ingestion
- HTTP/gRPC APIs: RESTful and streaming interfaces
- Apache Kafka: High-throughput message queue integration
- Change Data Capture (CDC): Database change streams
- File Sources: Batch and streaming file processing
- Apache Iceberg: Lakehouse table format
- Delta Lake: ACID-compliant lakehouse format
- File Systems: Local and distributed file storage
- Object Storage: S3, GCS, Azure Blob Storage
- Modular Design: Separate crates for core functionality
- Async/Non-blocking: Built on Tokio for high concurrency
- Zero-copy Optimizations: Minimized memory allocations
- Production Ready: Comprehensive monitoring and metrics
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Kafka │ │ CDC │ │ HTTP/gRPC │
│ Source │ │ Source │ │ APIs │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└──────────────────┼──────────────────┘
│
┌──────────▼──────────┐
│ Ingestion Engine │
│ (Batching, │
│ Backpressure, │
│ WAL) │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ Sinks Layer │
│ Apache Iceberg │
│ Delta Lake │
└─────────────────────┘
# Clone the repository
git clone https://github.com/revitalyr/stream-data-platform.git
cd stream-data-platform
# Build the project
cargo build --release
# Create default configuration
./target/release/ingestion-server validateEdit config.toml to configure sources, sinks, and other settings:
POST /events- Ingest a single eventPOST /events/batch- Ingest multiple eventsGET /health- Health checkGET /status- Server statusGET /metrics- Performance metrics
The server exposes a gRPC API defined in crates/api/proto/ingestion.proto with methods:
IngestEventIngestBatchGetStatusGetMetrics
The system is designed for:
- Throughput: Millions of events per second
- Latency: Sub-second processing
- Durability: WAL ensures no data loss
- Scalability: Horizontal scaling with multiple instances
# Development build
cargo build
# Release build with optimizations
cargo build --release
# Run tests
cargo test
# Run benchmarks
cargo benchstream-data-platform/
├── crates/
│ ├── ingestion-core/ # Core ingestion engine
│ ├── wal/ # Write-ahead log implementation
│ └── ... # Additional crates
├── bin/
│ └── ingestion-server/ # Main binary application
├── benches/ # Performance benchmarks
├── tests/ # Integration tests
├── docs/ # Documentation
└── examples/ # Usage examples
# Run unit tests
cargo test
# Run integration tests
cargo test --test integration
# Run benchmarks
cargo bench
# Validate configuration
./target/release/ingestion-server validateThe system exposes Prometheus metrics on port 9090 (configurable):
events_received_total- Total events receivedevents_processed_total- Total events processedbatches_processed_total- Total batches processedbatch_size_bytes- Batch size distributionsink_write_duration_seconds- Sink write latency
streaming, data-ingestion, rust, high-performance, fault-tolerant, real-time, kafka, lakehouse, apache-iceberg, delta-lake, grpc, http-api, write-ahead-log, backpressure, batch-processing, enterprise-grade, production-ready, scalable, observability, metrics, monitoring
- Repository: https://github.com/revitalyr/stream-data-platform
- Language: Rust
- License: MIT OR Apache-2.0
- Version: 0.1.0
- Authors: Streaming Data Team
We welcome contributions! Please see our Contributing Guide for details.
MIT License - see LICENSE file for details.
For questions and support:
- Create an issue on GitHub
- Check our Documentation
- Join our community discussions