You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(pipeline): add .fork() method for deep copying pipeline state
refactor(engine): enforce Singleton pattern and restrict Pipeline instantiation via factory only
- Implemented `.fork()` to allow pipeline branching without state mutation.
- Refactored `Engine` to act as a strict Singleton.
- `Pipeline()` constructor is now restricted; must use `Engine.ingest()`.
- Raises PermissionError on direct Pipeline instantiation.
> ⚠️ **Project Status:** Phaeton is currently in **Experimental Beta (v0.2.2)**.
9
-
> The core streaming engine is functional, but the library is currently under limited maintenance due to the author's personal schedule.
8
+
> ⚠️ **Project Status:** Phaeton is currently in **Experimental Beta (v0.2.3)**.
9
+
> The core streaming engine is functional, but the library is currently under limited maintenance due to the author's personal schedule. So, some methods are still not working or are only dummy or mockup methods.
10
10
11
11
12
-
**Phaeton** is a specialized, Rust-powered preprocessing engine designed to sanitize raw data streams before they reach your analytical environment.
12
+
**Phaeton** is a specialized, Rust-powered preprocessing and ETL engine designed to sanitize raw data streams before they reach your analytical environment.
13
13
14
14
It acts as the strictly typed **"Gatekeeper"** of your data pipeline. Unlike traditional DataFrame libraries that attempt to load entire datasets into RAM, Phaeton employs a **zero-copy streaming architecture**. It processes data chunk-by-chunk filtering noise, fixing encodings, and standardizing formats ensuring **O(1) memory complexity** relative to file size.
15
15
@@ -50,29 +50,52 @@ We generated a **Chaos Dataset** containing **1 Million Rows** of mixed dirty da
50
50
51
51
---
52
52
## Usage Example
53
+
Based on the features available in the current version.
53
54
54
55
```python
55
56
import phaeton
56
57
57
58
# 1. Initialize Engine (Auto-detect cores)
58
-
engine= phaeton.Engine()
59
+
eng= phaeton.Engine(workers=0, batch_size=25_000)
59
60
60
61
# 2. Define Pipeline
61
-
pipeline = (
62
-
engine.ingest("dirty_data.csv")
63
-
.prune(col="email") # Drop rows if email is empty
64
-
.discard("status", "BANNED", mode="exact") # Filter specific values
65
-
.scrub("username", "trim") # Clean whitespace
66
-
.scrub("salary", "currency") # Parse "Rp 5.000" to number
67
-
.cast("salary", "int", clean=True) # Safely cast to Integer
@@ -138,11 +161,19 @@ Methods to save the final results or handle rejected data.
138
161
|`.quarantine(path)`| Saves rejected rows (with reasons) to a separate CSV file. |
139
162
|`.dump(path, format)`| Saves clean data to `.csv`, `.parquet`, or `.json` formats. |
140
163
164
+
#### Utility & Workflow
165
+
Methods to save the final results or handle rejected data.
166
+
167
+
| Method | Description |
168
+
| :--- | :--- |
169
+
|`.fork()`| Creates a deep copy of the current pipeline branch. Useful for splitting logic (e.g., saving to multiple formats or creating different clean levels) without rewriting steps. |
170
+
|`.peek(n)`| Previews the first n rows. |
171
+
141
172
---
142
173
143
174
## Roadmap
144
175
145
-
Phaeton is currently in **Beta (v0.2.2)**. Here is the status of our development pipeline:
176
+
Phaeton is currently in **Beta (v0.2.3)**. Here is the status of our development:
146
177
147
178
| Feature | Status | Implementation Notes |
148
179
| :--- | :---: | :--- |
@@ -152,8 +183,9 @@ Phaeton is currently in **Beta (v0.2.2)**. Here is the status of our development
152
183
|**Fuzzy Alignment**| ✅ Ready | Jaro-Winkler for typo correction |
153
184
|**Quarantine System**| ✅ Ready | Full audit trail for rejected rows |
154
185
|**Basic Text Scrubbing**| ✅ Ready | Trim, HTML strip, Case conversion |
155
-
|**Header Normalization**| 🚧 In Progress |`snake_case`, `camelCase` conversions |
0 commit comments