Stream Data Platform

A high-performance, fault-tolerant streaming data ingestion platform built in Rust. Designed for processing millions of events per second with sub-second latency, supporting multiple data sources and lakehouse formats.

Overview

The Stream Data Platform is a production-ready, enterprise-grade data ingestion system that provides:

Extreme Performance: Optimized for high-throughput, low-latency data processing
Fault Tolerance: Built-in recovery mechanisms and data durability guarantees
Scalability: Horizontal scaling with distributed architecture
Flexibility: Support for multiple data sources, formats, and sinks
Observability: Comprehensive monitoring and metrics

Key Features

Core Capabilities

High-throughput Processing: Millions of events per second
Sub-second Latency: Optimized for real-time data processing
Fault Tolerance: Write-ahead logging and recovery mechanisms
Backpressure Management: Intelligent flow control and batching
Multi-format Support: JSON, CSV, Parquet, and custom formats
API-first Design: HTTP and gRPC endpoints for data ingestion

Data Sources

HTTP/gRPC APIs: RESTful and streaming interfaces
Apache Kafka: High-throughput message queue integration
Change Data Capture (CDC): Database change streams
File Sources: Batch and streaming file processing

Data Sinks

Apache Iceberg: Lakehouse table format
Delta Lake: ACID-compliant lakehouse format
File Systems: Local and distributed file storage
Object Storage: S3, GCS, Azure Blob Storage

Architecture

Modular Design: Separate crates for core functionality
Async/Non-blocking: Built on Tokio for high concurrency
Zero-copy Optimizations: Minimized memory allocations
Production Ready: Comprehensive monitoring and metrics

Architecture

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Kafka     │    │   CDC       │    │  HTTP/gRPC  │
│   Source    │    │   Source    │    │   APIs      │
└──────┬──────┘    └──────┬──────┘    └──────┬──────┘
       │                  │                  │
       └──────────────────┼──────────────────┘
                          │
               ┌──────────▼──────────┐
               │  Ingestion Engine   │
               │  (Batching,         │
               │   Backpressure,     │
               │   WAL)              │
               └──────────┬──────────┘
                          │
               ┌──────────▼──────────┐
               │     Sinks Layer     │
               │  Apache Iceberg     │
               │  Delta Lake         │
               └─────────────────────┘

Quick Start

Installation

# Clone the repository
git clone https://github.com/revitalyr/stream-data-platform.git
cd stream-data-platform

# Build the project
cargo build --release

# Create default configuration
./target/release/ingestion-server validate

Configuration

Edit config.toml to configure sources, sinks, and other settings:

API Endpoints

HTTP API

POST /events - Ingest a single event
POST /events/batch - Ingest multiple events
GET /health - Health check
GET /status - Server status
GET /metrics - Performance metrics

gRPC API

The server exposes a gRPC API defined in crates/api/proto/ingestion.proto with methods:

IngestEvent
IngestBatch
GetStatus
GetMetrics

Performance

The system is designed for:

Throughput: Millions of events per second
Latency: Sub-second processing
Durability: WAL ensures no data loss
Scalability: Horizontal scaling with multiple instances

Development

Building

# Development build
cargo build

# Release build with optimizations
cargo build --release

# Run tests
cargo test

# Run benchmarks
cargo bench

Project Structure

stream-data-platform/
├── crates/
│   ├── ingestion-core/     # Core ingestion engine
│   ├── wal/               # Write-ahead log implementation
│   └── ...               # Additional crates
├── bin/
│   └── ingestion-server/  # Main binary application
├── benches/              # Performance benchmarks
├── tests/                # Integration tests
├── docs/                 # Documentation
└── examples/             # Usage examples

Testing

# Run unit tests
cargo test

# Run integration tests
cargo test --test integration

# Run benchmarks
cargo bench

# Validate configuration
./target/release/ingestion-server validate

Monitoring

The system exposes Prometheus metrics on port 9090 (configurable):

events_received_total - Total events received
events_processed_total - Total events processed
batches_processed_total - Total batches processed
batch_size_bytes - Batch size distribution
sink_write_duration_seconds - Sink write latency

Keywords

streaming, data-ingestion, rust, high-performance, fault-tolerant, real-time, kafka, lakehouse, apache-iceberg, delta-lake, grpc, http-api, write-ahead-log, backpressure, batch-processing, enterprise-grade, production-ready, scalable, observability, metrics, monitoring

Repository Information

Repository: https://github.com/revitalyr/stream-data-platform
Language: Rust
License: MIT OR Apache-2.0
Version: 0.1.0
Authors: Streaming Data Team

Contributing

We welcome contributions! Please see our Contributing Guide for details.

License

MIT License - see LICENSE file for details.

Support

For questions and support:

Create an issue on GitHub
Check our Documentation
Join our community discussions

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.cargo		.cargo
.github		.github
benches		benches
bin/ingestion-server		bin/ingestion-server
crates		crates
monitoring		monitoring
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
DEMO_GUIDE.md		DEMO_GUIDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
OPTIMIZATION_GUIDE.md		OPTIMIZATION_GUIDE.md
QUICK_START.md		QUICK_START.md
README.md		README.md
SECURITY.md		SECURITY.md
build-optimized.bat		build-optimized.bat
build-optimized.sh		build-optimized.sh
config.toml.example		config.toml.example
demo-apis.py		demo-apis.py
demo-config.toml		demo-config.toml
demo-file-storage.py		demo-file-storage.py
demo-performance.sh		demo-performance.sh
demo-sources.py		demo-sources.py
docker-compose.yml		docker-compose.yml
run-all-tests.bat		run-all-tests.bat
run-all-tests.sh		run-all-tests.sh
run-demo.sh		run-demo.sh
temp_result.json		temp_result.json
test-results-23Mon02-2126time		test-results-23Mon02-2126time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stream Data Platform

Overview

Key Features

Core Capabilities

Data Sources

Data Sinks

Architecture

Architecture

Quick Start

Installation

Configuration

API Endpoints

HTTP API

gRPC API

Performance

Development

Building

Project Structure

Testing

Monitoring

Keywords

Repository Information

Contributing

License

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Stream Data Platform

Overview

Key Features

Core Capabilities

Data Sources

Data Sinks

Architecture

Architecture

Quick Start

Installation

Configuration

API Endpoints

HTTP API

gRPC API

Performance

Development

Building

Project Structure

Testing

Monitoring

Keywords

Repository Information

Contributing

License

Support

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages