Skip to content

Commit 05ea0bb

Browse files
authored
docs: update README and ARCHITECTURE to reflect current codebase (#97)
1 parent 3b2edf8 commit 05ea0bb

4 files changed

Lines changed: 168 additions & 79 deletions

File tree

ARCHITECTURE.md

Lines changed: 74 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,11 @@ Roboflow is a distributed data transformation pipeline that converts robotics ba
2727
|-------|---------|-----------|
2828
| `roboflow-core` | Foundation types, error handling, registry | `RoboflowError`, `CodecValue`, `TypeRegistry` |
2929
| `roboflow-storage` | Storage abstraction layer | `Storage`, `LocalStorage`, `S3Storage`, `StorageFactory` |
30-
| `roboflow-dataset` | Dataset format writers | `LerobotWriter`, `DatasetWriter`, `ImageData` |
31-
| `roboflow-distributed` | Distributed coordination via TiKV | `TiKVClient`, `BatchController`, `Worker`, `Catalog` |
32-
| `roboflow-sources` | Data source implementations | `BagSource`, `McapSource`, `RrdSource` |
33-
| `roboflow-sinks` | Data sink implementations | `LerobotSink`, `ZarrSink`, `DatasetFrame` |
30+
| `roboflow-executor` | Stage-based task executor | `StageExecutor`, `Pipeline`, `ExecutionPolicy`, `SlotPool` |
31+
| `roboflow-media` | Image and video encoding/decoding | `ImageData`, `VideoEncoder`, `ConcurrentVideoEncoder` |
32+
| `roboflow-dataset` | Dataset format writers and sources | `LerobotWriter`, `DatasetWriter`, `Source`, `BagSource`, `McapSource` |
33+
| `roboflow-pipeline` | Pipeline execution and stages | `DatasetPipelineExecutor`, `DiscoverStage`, `ConvertStage`, `MergeStage` |
34+
| `roboflow-distributed` | Distributed coordination via TiKV | `TiKVClient`, `BatchController`, `Worker`, `Scanner`, `Finalizer` |
3435

3536
## Core Abstractions
3637

@@ -55,21 +56,33 @@ trait SeekableStorage: Storage {
5556
- **S3**: AWS S3-compatible storage
5657
- **OSS**: Alibaba Cloud Object Storage
5758

58-
### Pipeline Stages
59+
### Source/Sink Pattern
5960

6061
```rust
6162
trait Source: Send + Sync {
6263
async fn initialize(&mut self, config: &SourceConfig) -> SourceResult<SourceMetadata>;
6364
async fn read_batch(&mut self, size: usize) -> SourceResult<Option<Vec<TimestampedMessage>>>;
6465
async fn finalize(&mut self) -> SourceResult<SourceStats>;
6566
}
67+
```
68+
69+
**Supported sources:**
70+
- **MCAP**: Streaming and memory-mapped reads
71+
- **ROS1 Bag**: Legacy bag format support
72+
- **RRD**: Rerun data format
73+
74+
### Pipeline Stages
6675

67-
trait Sink: Send + Sync {
68-
async fn initialize(&mut self, config: &SinkConfig) -> SinkResult<()>;
69-
async fn write_frame(&mut self, frame: DatasetFrame) -> SinkResult<()>;
70-
async fn flush(&mut self) -> SinkResult<()>;
71-
async fn finalize(&mut self) -> SinkResult<SinkStats>;
72-
fn supports_checkpointing(&self) -> bool;
76+
```rust
77+
// Stage-based execution inspired by Spark
78+
pub struct DiscoverStage;
79+
pub struct ConvertStage;
80+
pub struct MergeStage;
81+
82+
// Pipeline executor
83+
pub struct DatasetPipelineExecutor {
84+
writer: Box<dyn DatasetWriter>,
85+
config: DatasetPipelineConfig,
7386
}
7487
```
7588

@@ -117,18 +130,19 @@ The distributed system uses a Kubernetes-inspired design with TiKV as the contro
117130
| kubelet heartbeat | HeartbeatManager | Worker liveness |
118131
| Finalizers | Finalizer controller | Cleanup handling |
119132
| Job/CronJob | BatchSpec, WorkUnit | Work scheduling |
133+
| Scheduler | Scanner | File discovery and job creation |
120134

121135
### Batch State Machine
122136

123137
```
124138
┌──────────┐ ┌─────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
125139
│ Pending │───▶│ Discovering │───▶│ Running │───▶│ Merging │───▶│ Complete │
126140
└──────────┘ └─────────────┘ └──────────┘ └──────────┘ └──────────┘
127-
128-
129-
┌──────────┐
130-
│ Failed │
131-
└──────────┘
141+
142+
143+
┌──────────┐
144+
│ Failed │
145+
└──────────┘
132146
```
133147

134148
### TiKV Key Structure
@@ -142,6 +156,27 @@ roboflow/worker/{pod_id}/lock → LockRecord
142156
roboflow/worker/{pod_id}/checkpoint→ CheckpointState
143157
```
144158

159+
## CLI Commands
160+
161+
The unified `roboflow` binary provides all operations:
162+
163+
```bash
164+
# Run unified service (default: all roles)
165+
roboflow run
166+
167+
# Run specific roles
168+
roboflow run --role worker
169+
roboflow run --role finalizer
170+
171+
# Job management
172+
roboflow submit s3://bucket/file.bag --output s3://bucket/out/
173+
roboflow jobs list
174+
roboflow batch list
175+
176+
# Health check
177+
roboflow health
178+
```
179+
145180
## Dataset Writing
146181

147182
### LeRobot Format
@@ -151,23 +186,30 @@ struct LerobotConfig {
151186
pub dataset: DatasetConfig,
152187
pub mappings: Vec<Mapping>,
153188
pub video: VideoConfig,
154-
pub flushing: FlushingConfig, // Incremental flushing
189+
pub flushing: FlushingConfig,
190+
pub streaming: StreamingConfig,
155191
}
156192

157-
struct FlushingConfig {
158-
pub max_frames_per_chunk: usize, // Default: 1000
159-
pub max_memory_bytes: usize, // Default: 2GB
160-
pub incremental_video_encoding: bool,
193+
struct StreamingConfig {
194+
pub finalize_metadata_in_coordinator: bool,
161195
}
162196
```
163197

164-
### Incremental Flushing
198+
### Video Encoding
165199

166-
To prevent OOM on long recordings, the writer processes data in chunks:
200+
```rust
201+
// Concurrent video encoder for parallel chunk encoding
202+
pub struct ConcurrentVideoEncoder {
203+
config: ConcurrentEncoderConfig,
204+
}
167205

168-
1. **Frame-based**: Flush after N frames (configurable, default 1000)
169-
2. **Memory-based**: Flush when memory exceeds threshold (default 2GB)
170-
3. **Output structure**: `data/chunk-000/`, `data/chunk-001/`, etc.
206+
pub struct ConcurrentEncoderConfig {
207+
pub storage: Arc<dyn Storage>,
208+
pub key_prefix: String,
209+
pub codec: VideoCodec,
210+
pub crf: u8,
211+
}
212+
```
171213

172214
### Upload Coordinator
173215

@@ -176,7 +218,6 @@ struct EpisodeUploadCoordinator {
176218
pub storage: Arc<dyn Storage>,
177219
pub config: UploadConfig,
178220
pub progress: Option<UploadProgress>,
179-
// Worker pool for parallel uploads
180221
}
181222

182223
struct UploadConfig {
@@ -213,7 +254,7 @@ let data = arena.alloc_vec::<u8>(size);
213254

214255
```toml
215256
[source]
216-
type = "mcap" # or "bag", "rrd", "hdf5"
257+
type = "mcap" # or "bag", "rrd"
217258
path = "s3://bucket/path/to/data.mcap"
218259

219260
# Optional: topic filtering
@@ -325,17 +366,15 @@ enum CircuitState {
325366

326367
| Flag | Purpose |
327368
|------|---------|
328-
| `distributed` | TiKV distributed coordination (always enabled) |
329-
| `dataset-hdf5` | HDF5 dataset format support |
330-
| `dataset-parquet` | Parquet dataset format support |
331-
| `cloud-storage` | S3/OSS cloud storage support |
332-
| `gpu` | GPU compression (Linux only) |
333369
| `jemalloc` | jemalloc allocator (Linux only) |
334370
| `cli` | CLI support for binaries |
371+
| `profiling` | Profiling support for profiler binary |
372+
| `cpuid` | CPU-aware WindowLog detection (x86_64 only) |
373+
| `io-uring-io` | io_uring support for Linux 5.6+ |
335374

336375
## See Also
337376

338377
- `CLAUDE.md` - Developer guidelines and conventions
339-
- `tests/s3_pipeline_tests.rs` - Integration tests
340-
- `crates/roboflow-dataset/src/lerobot/` - Dataset writer implementation
378+
- `tests/` - Integration and E2E tests
379+
- `crates/roboflow-dataset/src/` - Dataset writer and source implementations
341380
- `crates/roboflow-distributed/src/` - Distributed coordination

CLAUDE.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,17 @@ Roboflow: Distributed data transformation pipeline converting robotics bag/MCAP
1414

1515
## Workspace Structure
1616

17-
The project uses a Cargo workspace with 5 crates:
17+
The project uses a Cargo workspace with 7 crates:
1818

1919
| Crate | Purpose |
2020
|-------|---------|
2121
| `roboflow-core` | Error types, registry, values |
2222
| `roboflow-storage` | S3, OSS, Local storage (always available) |
23-
| `roboflow-dataset` | KPS, LeRobot, streaming converters |
24-
| `roboflow-distributed` | TiKV client, catalog, circuit breaker |
25-
| `roboflow-hdf5` | Optional HDF5 format support |
23+
| `roboflow-executor` | Stage-based task executor for distributed pipelines |
24+
| `roboflow-media` | Image and video encoding/decoding |
25+
| `roboflow-dataset` | Dataset writers, sources (MCAP, bag), streaming converters |
26+
| `roboflow-pipeline` | Pipeline execution and stages for dataset processing |
27+
| `roboflow-distributed` | TiKV client, catalog, circuit breaker, worker coordination |
2628

2729
**Import patterns:**
2830
- Use facade re-exports from `roboflow`: `use roboflow::{Robocodec, DatasetWriter, ...}`

README.md

Lines changed: 44 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
[![License: MulanPSL-2.0](https://img.shields.io/badge/License-MulanPSL--2.0-blue.svg)](http://license.coscl.org.cn/MulanPSL2)
44
[![Rust](https://img.shields.io/badge/rust-1.80%2B-orange.svg)](https://www.rust-lang.org)
5+
[![codecov](https://codecov.io/gh/archebase/roboflow/branch/main/graph/badge.svg)](https://codecov.io/gh/archebase/roboflow)
56

67
[English](README.md) | [简体中文](README_zh.md)
78

@@ -73,46 +74,52 @@ Roboflow uses a **Kubernetes-inspired distributed control plane** for fault-tole
7374
|-------|---------|
7475
| `roboflow-core` | Error types, registry, values |
7576
| `roboflow-storage` | S3, OSS, Local storage (always available) |
76-
| `roboflow-dataset` | KPS, LeRobot, streaming converters |
77-
| `roboflow-distributed` | TiKV client, catalog, circuit breaker |
78-
| `roboflow-hdf5` | Optional HDF5 format support |
77+
| `roboflow-executor` | Stage-based task executor for distributed pipelines |
78+
| `roboflow-media` | Image and video encoding/decoding for robotics datasets |
79+
| `roboflow-dataset` | KPS, LeRobot, streaming converters, data sources |
80+
| `roboflow-pipeline` | Pipeline execution and stages for dataset processing |
81+
| `roboflow-distributed` | TiKV client, catalog, circuit breaker, worker coordination |
7982

8083
## Quick Start
8184

82-
### Submit a Conversion Job
83-
84-
```bash
85-
roboflow submit \
86-
--input s3://bucket/input.bag \
87-
--output s3://bucket/output/ \
88-
--config lerobot_config.toml
89-
```
90-
91-
### Run a Worker
85+
### Run the Unified Service
9286

9387
```bash
88+
# Set environment variables
9489
export TIKV_PD_ENDPOINTS="127.0.0.1:2379"
9590
export AWS_ACCESS_KEY_ID="your-key"
9691
export AWS_SECRET_ACCESS_KEY="your-secret"
9792

98-
roboflow worker
93+
# Run unified service (scanner + worker + finalizer + reaper)
94+
roboflow run
9995
```
10096

101-
### Run a Scanner
97+
### Run Specific Roles
10298

10399
```bash
104-
export SCANNER_INPUT_PREFIX="s3://bucket/input/"
105-
export SCANNER_OUTPUT_PREFIX="s3://bucket/jobs/"
100+
# Worker only - processes work units
101+
roboflow run --role worker
106102

107-
roboflow scanner
103+
# Finalizer only - merges completed batches
104+
roboflow run --role finalizer
105+
106+
# With custom pod ID
107+
roboflow run --pod-id my-pod-1
108108
```
109109

110-
### List Jobs
110+
### Submit a Conversion Job
111+
112+
```bash
113+
roboflow submit s3://bucket/input.bag --output s3://bucket/output/
114+
```
115+
116+
### Manage Jobs
111117

112118
```bash
113119
roboflow jobs list
114120
roboflow jobs get <job-id>
115-
roboflow jobs retry <job-id>
121+
roboflow batch list
122+
roboflow batch get <batch-id>
116123
```
117124

118125
## Installation
@@ -166,6 +173,7 @@ encoding = "cdr"
166173
| `WORKER_POLL_INTERVAL_SECS` | Job poll interval | `5` |
167174
| `WORKER_MAX_CONCURRENT_JOBS` | Max concurrent jobs | `1` |
168175
| `SCANNER_SCAN_INTERVAL_SECS` | Scan interval | `60` |
176+
| `FINALIZER_POLL_INTERVAL_SECS` | Finalizer poll interval | `30` |
169177

170178
## Development
171179

@@ -188,6 +196,22 @@ cargo fmt
188196
cargo clippy --all-targets -- -D warnings
189197
```
190198

199+
### Development Infrastructure
200+
201+
Start required services with Docker Compose:
202+
203+
```bash
204+
docker compose up -d # Start all services (MinIO, TiKV, PD)
205+
docker compose down # Stop all services
206+
```
207+
208+
**Services:**
209+
| Service | Purpose | Ports |
210+
|---------|---------|-------|
211+
| MinIO | S3-compatible object storage | 9000 (API), 9001 (Console) |
212+
| TiKV | Distributed KV storage | 20160 |
213+
| PD | TiKV placement driver | 2379, 2380 |
214+
191215
## Contributing
192216

193217
See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines.

0 commit comments

Comments
 (0)