Skip to content

Commit ae334f9

Browse files
author
Ricardo Decal
committed
add convert_arrow_to_parquet_streaming.py to tools
1 parent efb0ec9 commit ae334f9

10 files changed

Lines changed: 233 additions & 315 deletions

.github/copilot-instructions.md

Lines changed: 0 additions & 160 deletions
This file was deleted.

README.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,42 @@ Inspired by [Simon Willison's tools collection](https://github.com/simonw/tools)
1313

1414
## Available Tools
1515

16+
### [convert_arrow_to_parquet_streaming.py](python/convert_arrow_to_parquet_streaming.py)
17+
18+
Output of `uv run https://tools.ricardodecal.com/python/convert_arrow_to_parquet_streaming.py --help`:
19+
20+
```text
21+
Usage: convert_arrow_to_parquet_streaming.py [OPTIONS]
22+
23+
Convert Arrow shards to Parquet.
24+
25+
- Discovers all .arrow files under a given source directory - Converts each
26+
file to Parquet - Uses streaming in order to keep memory bounded and convert
27+
files larger than available RAM - Handles both Arrow IPC File and Stream
28+
formats (tries file, falls back to stream)
29+
30+
Notes: - Use `--preserve-subdirs` to mirror the input directory tree under
31+
the output dir. - Use `--overwrite` to re-create files; otherwise existing
32+
outputs are skipped.
33+
34+
Arguments:
35+
SOURCE_DIR: Directory containing .arrow files.
36+
OUTPUT_DIR: Directory to write .parquet files.
37+
38+
Examples:
39+
40+
uv run https://tools.ricardodecal.com/python/convert_arrow_to_parquet_streaming.py --source-dir ./arrow_data --output-dir ./parquet_data
41+
uv run https://tools.ricardodecal.com/python/convert_arrow_to_parquet_streaming.py --source-dir ./arrow_data --output-dir ./parquet_data --preserve-subdirs
42+
43+
Options:
44+
--source-dir DIRECTORY Directory containing .arrow files [required]
45+
--output-dir DIRECTORY Directory to write .parquet files
46+
--overwrite Overwrite existing parquet files
47+
--preserve-subdirs Preserve input subdirectory structure inside output
48+
dir
49+
--help Show this message and exit.
50+
```
51+
1652
### [yt_transcript.py](python/yt_transcript.py)
1753

1854
Output of `uv run https://tools.ricardodecal.com/python/yt_transcript.py --help`:

index.html

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,38 @@ <h1>Ricardo Decal's Tools</h1>
170170
<p>This is an experiment in low-stakes vibe coding. The code lives in <a href="https://github.com/crypdick/tools"><code>crypdick/tools</code></a>.</p>
171171
<p>Inspired by <a href="https://github.com/simonw/tools">Simon Willison's tools collection</a>.</p>
172172
<h2>Available Tools</h2>
173+
<h3><a href="python/convert_arrow_to_parquet_streaming.py">convert_arrow_to_parquet_streaming.py</a></h3>
174+
<p>Output of <code>uv run https://tools.ricardodecal.com/python/convert_arrow_to_parquet_streaming.py --help</code>:</p>
175+
<pre><code class="language-text">Usage: convert_arrow_to_parquet_streaming.py [OPTIONS]
176+
177+
Convert Arrow shards to Parquet.
178+
179+
- Discovers all .arrow files under a given source directory - Converts each
180+
file to Parquet - Uses streaming in order to keep memory bounded and convert
181+
files larger than available RAM - Handles both Arrow IPC File and Stream
182+
formats (tries file, falls back to stream)
183+
184+
Notes: - Use `--preserve-subdirs` to mirror the input directory tree under
185+
the output dir. - Use `--overwrite` to re-create files; otherwise existing
186+
outputs are skipped.
187+
188+
Arguments:
189+
SOURCE_DIR: Directory containing .arrow files.
190+
OUTPUT_DIR: Directory to write .parquet files.
191+
192+
Examples:
193+
194+
uv run https://tools.ricardodecal.com/python/convert_arrow_to_parquet_streaming.py --source-dir ./arrow_data --output-dir ./parquet_data
195+
uv run https://tools.ricardodecal.com/python/convert_arrow_to_parquet_streaming.py --source-dir ./arrow_data --output-dir ./parquet_data --preserve-subdirs
196+
197+
Options:
198+
--source-dir DIRECTORY Directory containing .arrow files [required]
199+
--output-dir DIRECTORY Directory to write .parquet files
200+
--overwrite Overwrite existing parquet files
201+
--preserve-subdirs Preserve input subdirectory structure inside output
202+
dir
203+
--help Show this message and exit.
204+
</code></pre>
173205
<h3><a href="python/yt_transcript.py">yt_transcript.py</a></h3>
174206
<p>Output of <code>uv run https://tools.ricardodecal.com/python/yt_transcript.py --help</code>:</p>
175207
<pre><code class="language-text">Usage: yt_transcript.py [OPTIONS] URL [OUTPUT_FILE]

pyproject.toml

Lines changed: 0 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -35,17 +35,6 @@ no_implicit_optional = true
3535
show_error_codes = true
3636
show_column_numbers = true
3737

38-
# Exclude test files from strict checking
39-
[[tool.mypy.overrides]]
40-
module = "tests.*"
41-
ignore_errors = true
42-
43-
[tool.pytest.ini_options]
44-
testpaths = ["tests"]
45-
python_files = "test_*.py"
46-
python_functions = "test_*"
47-
addopts = "-v --strict-markers"
48-
4938
[tool.ruff]
5039
target-version = "py312"
5140

@@ -64,8 +53,5 @@ select = [
6453
]
6554
ignore = ["E501"]
6655

67-
[tool.ruff.lint.per-file-ignores]
68-
"tests/*" = ["S101"] # Allow assert in tests
69-
7056
[tool.ruff.lint.isort]
7157
known-first-party = ["tools"]

python/README.md

Lines changed: 9 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -12,21 +12,16 @@ uv run https://tools.ricardodecal.com/python/foo.py [args]
1212
uv run python/foo.py [args]
1313
```
1414

15-
## Available Tools
16-
17-
- `yt_transcript.py`: Fetch YouTube transcripts for a single video or a whole playlist into a single file.
18-
19-
---
20-
2115
## Creating Tools
2216

2317
### Required Structure
2418

2519
**CRITICAL:** The tool's `--help` output is used to **auto-generate the main README.md and the website**.
2620

2721
- Ensure your tool supports `--help` (standard with `click`).
28-
- The docstring of your main command function (decorated with `@click.command`) provides the help description.
29-
- Keep the first line as a clear, concise summary.
22+
- **Module Docstring**: Keep the top-level module docstring to a single sentence summary.
23+
- **Main Docstring**: The docstring of your main command function (decorated with `@click.command`) provides the full help description.
24+
- Keep the first line of the main docstring as a clear, concise summary.
3025
- Manually document `@click.argument` arguments in an `Arguments:` section (Click does not auto-document them).
3126
- Include concrete `Examples:` section showing exactly how to run it.
3227
- Do not use generic `Usage:` placeholders.
@@ -52,6 +47,9 @@ def main(name: str, output: str | None) -> None:
5247
"""
5348
Tool description that will be used to auto-generate the main README.md and the website.
5449
50+
This docstring is the source of truth for the tool's documentation.
51+
It should include a detailed description, arguments list, and usage examples.
52+
5553
Arguments:
5654
NAME: The name of the person to greet.
5755
@@ -138,40 +136,6 @@ for item in track(items, description="Processing..."):
138136

139137
---
140138

141-
## Testing
142-
143-
Create `tests/test_your_tool.py`:
144-
145-
```python
146-
from pathlib import Path
147-
import subprocess
148-
149-
TOOL = Path(__file__).parent.parent / "python" / "foo.py"
150-
151-
def run_tool(*args):
152-
return subprocess.run(
153-
["uv", "run", str(TOOL), *args],
154-
capture_output=True,
155-
text=True,
156-
check=False,
157-
)
158-
159-
def test_help():
160-
result = run_tool("--help")
161-
assert result.returncode == 0
162-
assert "Usage:" in result.stdout
163-
164-
def test_basic():
165-
result = run_tool("arg")
166-
assert result.returncode == 0
167-
```
168-
169-
Run: `uv run pytest tests/test_your_tool.py -v`
170-
171-
See [tests/README.md](../tests/README.md) for details.
172-
173-
---
174-
175139
## Common Patterns
176140

177141
### File I/O
@@ -258,12 +222,11 @@ uv add --script python/foo.py click requests
258222
# Test
259223
uv run python/foo.py --help
260224

261-
# Add tests
262-
touch tests/test_my_tool.py
263-
uv run pytest tests/test_my_tool.py -v
225+
# Bump dependencies to latest versions
226+
uv remove --script python/foo.py click requests && uv add --script python/foo.py click requests
264227

265228
# Commit
266-
git add python/foo.py tests/test_my_tool.py
229+
git add python/foo.py
267230
git commit -m "Add foo: description"
268231

269232
# Test from URL after push

0 commit comments

Comments
 (0)