Skip to content

Commit 89fd977

Browse files
committed
chore(release): bump version to 0.2.3
feat(pipeline): add .fork() method for deep copying pipeline state refactor(engine): enforce Singleton pattern and restrict Pipeline instantiation via factory only - Implemented `.fork()` to allow pipeline branching without state mutation. - Refactored `Engine` to act as a strict Singleton. - `Pipeline()` constructor is now restricted; must use `Engine.ingest()`. - Raises PermissionError on direct Pipeline instantiation.
1 parent 5065cef commit 89fd977

4 files changed

Lines changed: 57 additions & 25 deletions

File tree

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
[package]
22
name = "phaeton"
3-
version = "0.2.2"
3+
version = "0.2.3"
44
edition = "2021"
55
authors = ["Zahraan Dzakii Tsaqiif <zahraandzakiits@gmail.com>"]
6-
description = "A high-performance Python library for preprocessing and sanitizing raw data streams, accelerated by Rust."
6+
description = "A high-performance preprocessing and ETL engine for sanitizing raw data streams, accelerated by Rust."
77
license = "MIT"
88

99
[lib]

README.md

Lines changed: 52 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,11 @@
55
[![Rust](https://img.shields.io/badge/built%20with-Rust-orange)](https://www.rust-lang.org/)
66
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
77

8-
> ⚠️ **Project Status:** Phaeton is currently in **Experimental Beta (v0.2.2)**.
9-
> The core streaming engine is functional, but the library is currently under limited maintenance due to the author's personal schedule.
8+
> ⚠️ **Project Status:** Phaeton is currently in **Experimental Beta (v0.2.3)**.
9+
> The core streaming engine is functional, but the library is currently under limited maintenance due to the author's personal schedule. So, some methods are still not working or are only dummy or mockup methods.
1010
1111

12-
**Phaeton** is a specialized, Rust-powered preprocessing engine designed to sanitize raw data streams before they reach your analytical environment.
12+
**Phaeton** is a specialized, Rust-powered preprocessing and ETL engine designed to sanitize raw data streams before they reach your analytical environment.
1313

1414
It acts as the strictly typed **"Gatekeeper"** of your data pipeline. Unlike traditional DataFrame libraries that attempt to load entire datasets into RAM, Phaeton employs a **zero-copy streaming architecture**. It processes data chunk-by-chunk filtering noise, fixing encodings, and standardizing formats ensuring **O(1) memory complexity** relative to file size.
1515

@@ -50,29 +50,52 @@ We generated a **Chaos Dataset** containing **1 Million Rows** of mixed dirty da
5050
5151
---
5252
## Usage Example
53+
Based on the features available in the current version.
5354

5455
```python
5556
import phaeton
5657

5758
# 1. Initialize Engine (Auto-detect cores)
58-
engine = phaeton.Engine()
59+
eng = phaeton.Engine(workers=0, batch_size=25_000)
5960

6061
# 2. Define Pipeline
61-
pipeline = (
62-
engine.ingest("dirty_data.csv")
63-
.prune(col="email") # Drop rows if email is empty
64-
.discard("status", "BANNED", mode="exact") # Filter specific values
65-
.scrub("username", "trim") # Clean whitespace
66-
.scrub("salary", "currency") # Parse "Rp 5.000" to number
67-
.cast("salary", "int", clean=True) # Safely cast to Integer
68-
.fuzzyalign("city", ref=["Jakarta", "Bandung"], threshold=0.85) # Fix typos
69-
.quarantine("quarantine.csv") # Save bad data here
70-
.dump("clean_data.csv") # Save good data here
62+
63+
# Base Pipeline
64+
base = (
65+
eng.ingest("dirty_data.csv")
66+
.prune(col="email") # Drop rows if email is empty
67+
.prune(col="salary") # Drop rows if salary is empty
68+
.scrub("username", "trim") # Clean whitespace
69+
.scrub("salary", "currency") # Parse "Rp 5.000" to number
70+
.cast("salary", "int", clean=True) # Safely cast to Integer
71+
.fuzzyalign("city",
72+
ref=["Jakarta", "Bandung"],
73+
threshold=0.85
74+
) # Fix typos
75+
)
76+
77+
# 3 Pipeline branching using .fork() (Optional)
78+
79+
# Pipeline 1: Keep all rows except status 'BANNED'
80+
p1 = (
81+
base.fork()
82+
.discard("status", "BANNED", mode="exact") # Filter specific values (BANNED)
83+
.quarantine("quarantine_1.csv") # Save bad data here
84+
.dump("clean_data_1.csv") # Save good data here
85+
)
86+
87+
# Pipeline 2: Only rows with 'ACTIVE' status keeped
88+
p2 = (
89+
base.fork()
90+
.keep("status", "ACTIVE", mode="exact") # Keep specific values (ACTIVE)
91+
.quarantine("quarantine_output_2.csv") # Save bad data here
92+
.dump("cleaned_output_2.csv", format="csv") # Save good data here
7193
)
7294

73-
# 3. Execute
74-
stats = engine.exec(pipeline)
75-
print(f"Processed: {stats.processed}, Saved: {stats.saved}")
95+
# 4. Execute Two Pipeline in Parallel
96+
stats = engine.exec([p1, p2])
97+
print(f"Pipeline 1 = Processed: {stats[0].processed}, Saved: {stats[0].saved}")
98+
print(f"Pipeline 2 = Processed: {stats[1].processed}, Saved: {stats[1].saved}")
7699
```
77100

78101
---
@@ -138,11 +161,19 @@ Methods to save the final results or handle rejected data.
138161
| `.quarantine(path)` | Saves rejected rows (with reasons) to a separate CSV file. |
139162
| `.dump(path, format)` | Saves clean data to `.csv`, `.parquet`, or `.json` formats. |
140163

164+
#### Utility & Workflow
165+
Methods to save the final results or handle rejected data.
166+
167+
| Method | Description |
168+
| :--- | :--- |
169+
| `.fork()` | Creates a deep copy of the current pipeline branch. Useful for splitting logic (e.g., saving to multiple formats or creating different clean levels) without rewriting steps. |
170+
| `.peek(n)` | Previews the first n rows. |
171+
141172
---
142173

143174
## Roadmap
144175

145-
Phaeton is currently in **Beta (v0.2.2)**. Here is the status of our development pipeline:
176+
Phaeton is currently in **Beta (v0.2.3)**. Here is the status of our development:
146177

147178
| Feature | Status | Implementation Notes |
148179
| :--- | :---: | :--- |
@@ -152,8 +183,9 @@ Phaeton is currently in **Beta (v0.2.2)**. Here is the status of our development
152183
| **Fuzzy Alignment** | ✅ Ready | Jaro-Winkler for typo correction |
153184
| **Quarantine System** | ✅ Ready | Full audit trail for rejected rows |
154185
| **Basic Text Scrubbing** | ✅ Ready | Trim, HTML strip, Case conversion |
155-
| **Header Normalization** | 🚧 In Progress | `snake_case`, `camelCase` conversions |
156-
| **Date Normalization** | 🚧 In Progress | Auto-detect & reformat dates |
186+
| **Inspector Engine** | 📝 Planned | Dedicated stream for data profiling (Read-Only) |
187+
| **Header Normalization** | 📝 Planned | `snake_case`, `camelCase` conversions |
188+
| **Date Normalization** | 📝 Planned | Auto-detect & reformat dates |
157189
| **Deduplication** | 📝 Planned | Row-level & Column-level dedupe |
158190
| **Hashing & Anonymization** | 📝 Planned | SHA-256 for PII data |
159191
| **Parquet/Arrow Support** | 📝 Planned | Native output integration |

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ build-backend = "maturin"
44

55
[project]
66
name = "phaeton"
7-
version = "0.2.2"
8-
description = "A high-performance Python library for preprocessing and sanitizing raw data streams, accelerated by Rust."
7+
version = "0.2.3"
8+
description = "A high-performance preprocessing and ETL engine for sanitizing raw data streams, accelerated by Rust."
99
readme = "README.md"
1010
license = {file = "LICENSE"}
1111
authors = [

0 commit comments

Comments
 (0)