Skip to content

revitalyr/stream-data-platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Stream Data Platform

A high-performance, fault-tolerant streaming data ingestion platform built in Rust. Designed for processing millions of events per second with sub-second latency, supporting multiple data sources and lakehouse formats.

Overview

The Stream Data Platform is a production-ready, enterprise-grade data ingestion system that provides:

  • Extreme Performance: Optimized for high-throughput, low-latency data processing
  • Fault Tolerance: Built-in recovery mechanisms and data durability guarantees
  • Scalability: Horizontal scaling with distributed architecture
  • Flexibility: Support for multiple data sources, formats, and sinks
  • Observability: Comprehensive monitoring and metrics

Key Features

Core Capabilities

  • High-throughput Processing: Millions of events per second
  • Sub-second Latency: Optimized for real-time data processing
  • Fault Tolerance: Write-ahead logging and recovery mechanisms
  • Backpressure Management: Intelligent flow control and batching
  • Multi-format Support: JSON, CSV, Parquet, and custom formats
  • API-first Design: HTTP and gRPC endpoints for data ingestion

Data Sources

  • HTTP/gRPC APIs: RESTful and streaming interfaces
  • Apache Kafka: High-throughput message queue integration
  • Change Data Capture (CDC): Database change streams
  • File Sources: Batch and streaming file processing

Data Sinks

  • Apache Iceberg: Lakehouse table format
  • Delta Lake: ACID-compliant lakehouse format
  • File Systems: Local and distributed file storage
  • Object Storage: S3, GCS, Azure Blob Storage

Architecture

  • Modular Design: Separate crates for core functionality
  • Async/Non-blocking: Built on Tokio for high concurrency
  • Zero-copy Optimizations: Minimized memory allocations
  • Production Ready: Comprehensive monitoring and metrics

Architecture

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Kafka     │    │   CDC       │    │  HTTP/gRPC  │
│   Source    │    │   Source    │    │   APIs      │
└──────┬──────┘    └──────┬──────┘    └──────┬──────┘
       │                  │                  │
       └──────────────────┼──────────────────┘
                          │
               ┌──────────▼──────────┐
               │  Ingestion Engine   │
               │  (Batching,         │
               │   Backpressure,     │
               │   WAL)              │
               └──────────┬──────────┘
                          │
               ┌──────────▼──────────┐
               │     Sinks Layer     │
               │  Apache Iceberg     │
               │  Delta Lake         │
               └─────────────────────┘

Quick Start

Installation

# Clone the repository
git clone https://github.com/revitalyr/stream-data-platform.git
cd stream-data-platform

# Build the project
cargo build --release

# Create default configuration
./target/release/ingestion-server validate

Configuration

Edit config.toml to configure sources, sinks, and other settings:

API Endpoints

HTTP API

  • POST /events - Ingest a single event
  • POST /events/batch - Ingest multiple events
  • GET /health - Health check
  • GET /status - Server status
  • GET /metrics - Performance metrics

gRPC API

The server exposes a gRPC API defined in crates/api/proto/ingestion.proto with methods:

  • IngestEvent
  • IngestBatch
  • GetStatus
  • GetMetrics

Performance

The system is designed for:

  • Throughput: Millions of events per second
  • Latency: Sub-second processing
  • Durability: WAL ensures no data loss
  • Scalability: Horizontal scaling with multiple instances

Development

Building

# Development build
cargo build

# Release build with optimizations
cargo build --release

# Run tests
cargo test

# Run benchmarks
cargo bench

Project Structure

stream-data-platform/
├── crates/
│   ├── ingestion-core/     # Core ingestion engine
│   ├── wal/               # Write-ahead log implementation
│   └── ...               # Additional crates
├── bin/
│   └── ingestion-server/  # Main binary application
├── benches/              # Performance benchmarks
├── tests/                # Integration tests
├── docs/                 # Documentation
└── examples/             # Usage examples

Testing

# Run unit tests
cargo test

# Run integration tests
cargo test --test integration

# Run benchmarks
cargo bench

# Validate configuration
./target/release/ingestion-server validate

Monitoring

The system exposes Prometheus metrics on port 9090 (configurable):

  • events_received_total - Total events received
  • events_processed_total - Total events processed
  • batches_processed_total - Total batches processed
  • batch_size_bytes - Batch size distribution
  • sink_write_duration_seconds - Sink write latency

Keywords

streaming, data-ingestion, rust, high-performance, fault-tolerant, real-time, kafka, lakehouse, apache-iceberg, delta-lake, grpc, http-api, write-ahead-log, backpressure, batch-processing, enterprise-grade, production-ready, scalable, observability, metrics, monitoring

Repository Information

Contributing

We welcome contributions! Please see our Contributing Guide for details.

License

MIT License - see LICENSE file for details.

Support

For questions and support:

  • Create an issue on GitHub
  • Check our Documentation
  • Join our community discussions

About

A high-performance, fault-tolerant streaming data ingestion platform built in Rust. Designed for processing millions of events per second with sub-second latency, supporting multiple data sources and lakehouse formats.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors