a saving machine for the modern web.
paste a link, get the file. transcribe it. edit the transcript like a doc. all local, no accounts, no upload limits, no telemetry. self-hosted on your machine, powered by yt-dlp, ffmpeg, and whisper.cpp — works on YouTube, TikTok, Instagram, Vimeo, and ~1000 other sites.
- save — one paste box, MP4/MP3 toggle, quality picker. live progress with %, MB, fragment count, ETA.
- bulk paste — drop a wall of URLs, batch them all at once.
- pause / resume — pause a download mid-flight; closing the tab auto-pauses (resumes on next visit).
- transcribe — local whisper.cpp on every saved file. word-level timestamps. exports
.txt/.srt/.vtt. - edit the transcript like a doc — contenteditable paragraphs, click-to-edit words, autosave, paragraph split/merge, find + replace, undo/redo, bookmarks, highlights, notes.
- speakers — optional automatic speaker labels via local diarization (no HuggingFace login). rename inline; the rename propagates everywhere.
- video stays in view — side-rail with the video, compact player (play/scrub/speed), speakers list, bookmarks. single sticky topbar.
- CLI + MCP (both alpha — see notes below) — drive everything from your terminal, or expose Trove as MCP tools to Claude Desktop / Cursor / Replit Agent.
- single Python process, single Docker container, no Node. mobile-friendly. light only — riso paper is the brand.
![]() |
![]() |
|---|---|
| mid-download | saved |
brew install yt-dlp ffmpeg # macOS — or apt install ffmpeg && pip install yt-dlp
git clone https://github.com/afk1997/trove.git
cd trove
./trove.shopen http://localhost:8899 and paste something.
or with Docker:
docker build -t trove .
docker run -p 8899:8899 -e HOST=0.0.0.0 -e TROVE_ALLOW_UNAUTH_PUBLIC=1 trovethe -e HOST=0.0.0.0 is required for Docker port-forwarding. trove refuses to start on a non-loopback bind without auth — TROVE_ALLOW_UNAUTH_PUBLIC=1 is the explicit "I know, I'm only exposing this to my host" opt-in. for LAN/internet exposure, drop the opt-in and set TROVE_TOKEN instead — see below.
| variable | default | what it does |
|---|---|---|
HOST |
127.0.0.1 |
bind address. set to 0.0.0.0 only with a token. |
PORT |
8899 |
TCP port. |
TROVE_TOKEN |
(unset) | when set, every /api/* request must send Authorization: Bearer <token>. |
TROVE_ALLOW_UNAUTH_PUBLIC |
(unset) | set to 1 to allow HOST=0.0.0.0 with no token (Docker port-forward, trusted LAN). without this opt-in, trove refuses to start on a non-loopback bind unless TROVE_TOKEN is set. |
TROVE_COOKIES_FROM_BROWSER |
(unset) | one of safari|chrome|firefox|brave|edge. required for YouTube right now (Google blocks cookieless yt-dlp). |
TROVE_CONCURRENT_FRAGMENTS |
4 |
parallel fragment downloads for HLS streams (YouTube etc.). clamped 1–32. |
TROVE_JOB_TTL_SECONDS |
3600 |
how long completed jobs (and their files) linger before being swept. |
TROVE_MAX_WORKERS |
4 |
concurrent downloads. excess returns HTTP 503. |
TROVE_RATE_LIMIT |
30 |
requests per minute per IP. set to 0 to disable. |
TROVE_BATCH_MAX_URLS |
50 |
hard cap on URLs accepted per /api/batch-download request. |
TROVE_DIARIZATION |
off |
on enables speaker labelling on transcribe (requires extra deps — see below). |
TROVE_EXTRACT_AUDIO_TIMEOUT |
14400 |
max seconds for the ffmpeg audio extract step (4 h covers any practical input). |
Note on
TROVE_TOKEN+ tab-close auto-pause: when a token is set, the browser'snavigator.sendBeaconcannot attach theAuthorizationheader, so closing the tab mid-download will not POST to/api/job/<id>/pause. The download continues running on the server until it finishes naturally — or, if you stop the server first, it is downgraded topausedon next restart and reappears in the queue. No work is lost either way; only the live "pause indicator" UX is deferred. Local (HOST=127.0.0.1, no token) deployments are unaffected.
the defaults assume localhost only. to expose trove safely:
- set a token:
export TROVE_TOKEN=$(openssl rand -hex 32) - set host:
export HOST=0.0.0.0 - run behind a reverse proxy that adds HTTPS (Caddy, nginx, fly.io, etc.).
without TROVE_TOKEN, anyone who can reach the port can download.
cookies are recommended for YouTube. short, public, non-monetized videos often work without them, but YouTube will eventually serve a sign-in wall for age-restricted content, certain regions, or longer/monetized uploads. to use cookies from your browser:
export TROVE_COOKIES_FROM_BROWSER=safari # or chrome / firefox / brave / edge
./trove.shthe browser must be installed on the host and have an active YouTube session.
trove transcribes any saved audio or video locally using whisper.cpp. no api keys, no cloud, no telemetry.
first time:
- save a media file (the existing flow)
- on the saved card, click
▸ transcribe— you'll see a one-time consent dialog - click
set it up ↗— you'll land on/transcribe/setup - trove auto-detects your machine (Metal on M-series Mac, CUDA on NVIDIA Linux, AVX/CPU otherwise) and shows four model options with realistic speed estimates for your machine
- pick one. trove downloads it from
huggingface.co/ggerganov/whisper.cpp(one-time, ~140 MB forbase) - you're done. transcription works offline forever after.
after first setup:
- click
▸ transcribeon any saved card → progress bar →▸ view transcript ↗opens the transcript editor in a new tab - check
auto-transcribeon the paste form to start transcription automatically when each download finishes
model storage:
models live at <trove>/models/ggml-*.bin. swap or remove via the same setup page in settings mode (footer link transcribe settings ↗).
Docker: the model directory is auto-persisted via a Docker volume. To make it visible/mountable on the host, run:
docker run -v ./models:/app/models -v ./downloads:/app/downloads -p 8899:8899 trovenetwork policy: the only outbound calls trove makes are (1) yt-dlp fetching the original media, and (2) the model download from huggingface during the setup wizard. transcription itself is 100% local.
the transcript page is a real document editor, not a passive viewer.
layout — single sticky topbar (title, saving indicator, undo / redo, search, export, more). document on the left (max 720 px). right rail with video, compact player (play / scrub / time / speed), speakers panel, bookmarks panel.
editing
- click any word → place your cursor and type to fix it. all edits autosave.
Enterinside a paragraph splits it at the cursor.Backspaceat the start of a paragraph merges it with the one above.Cmd+Z/Cmd+Shift+Zundo and redo.- right-click a selection for: copy · highlight · bookmark · note · export selection · revert paragraph.
playback + sync
- double-click a word → seek the video to that timestamp.
- click a
[00:14]time pill on a paragraph → seek to that paragraph's start (paused). Alt + clicka word → seek without moving the cursor.- the active word underlines as audio plays. the active paragraph gets a faint highlight. a
↓ jump to currentpill appears bottom-center if the active paragraph scrolls offscreen. Spaceplay/pause,J/Lskip ±5 s,,/.adjust speed,Cmd+Bbookmark current time.
word-timestamp realignment — whisper.cpp without DTW places the first word after each silence ~300-500 ms early, and the error compounds. trove uses silero-vad's speech regions as ground truth and snaps drifted single-word timestamps forward to the next speech region's start. word durations are also clamped at 1.5 s so the active-word highlight never lingers across silences. (run TROVE_DIARIZATION=on to enable; the realignment piggy-backs on the same vad pass that diarization uses.)
speakers
- with
TROVE_DIARIZATION=on, segments come back already labeledSpeaker 1,Speaker 2, etc. (see next section). - click any speaker label to rename — the new name propagates to every occurrence.
- the speakers panel in the side rail lights up the currently-talking row with an orange dot.
- without diarization, segments are split on speech pauses and speakers stay unlabeled; you can apply names manually from the rail.
bookmarks · highlights · notes
- bookmarks panel:
+ add bookmarkcaptures the current playback time. click a time pill to seek. edit the note inline. - select any text and right-click → highlight to mark a passage; → add note to attach a comment to a word.
- highlights and notes persist in
.words.jsonand survive transcribe-page reloads.
find + replace
Cmd+Fopens the search popover (find tab).Cmd+Shift+Fopens find + replace.- replace operates on every visible word; deleted words are skipped. the
casetoggle controls case-sensitivity.
export
.txt/.srt/.vttfrom the export menu. all three are regenerated from the current edits — your fixes ship.- the
export selectionaction on a right-click writes a stand-alone.txtof just the selection with timestamp markers.
autosave detail — every word edit, segment op, speaker rename, bookmark CRUD, etc. goes through a per-transcript txn lock on the server, writes the .words.json atomically (tempfile + os.replace), and regenerates the .txt / .srt / .vtt exports. you can close the tab any time; the document is the source of truth.
trove can auto-label speakers without any HuggingFace login or API key. The pipeline runs entirely on your machine:
- silero-vad finds speech regions in the audio.
- resemblyzer computes a 256-d voice embedding for overlapping 1.6 s windows inside each region.
- sklearn agglomerative clustering (Ward + Euclidean) groups embeddings into speakers.
- a 9-window median filter smooths single-window flips so a brief interjection doesn't create a phantom speaker.
- each whisper word is assigned the speaker of the cluster covering its timestamp; consecutive same-speaker words become one paragraph.
Realistic accuracy on clean two-person audio is ~70 %. Sub-1.6 s turns (say one-word interjections) are below resemblyzer's resolution floor and get absorbed into the longer neighbour's chunk — rename or split the resulting paragraph by hand if it matters.
To enable:
pip install resemblyzer silero-vad scikit-learn # ~800 MB (PyTorch is the bulk)
export TROVE_DIARIZATION=onWithout those deps installed, or with TROVE_DIARIZATION=off (the default), transcription behaves exactly as before — segments split on speech pauses and speakers stay unlabeled.
Status: unstable alpha.
trove(the CLI) andtrove-mcp(the MCP server below) ship as functional but largely untested. Surfaces and command names may change before they're promoted to stable. Local use only — don't script around them in production yet.
trove talks to a running Trove server through the stable /api/v1 JSON API. Stdlib-only — no extra deps.
# in the trove directory, with the venv activated
python cli.py serve # boot the server (alias for ./trove.sh)
python cli.py fetch URL [URL...] # queue one or many downloads
python cli.py list # show jobs (id, status, %, title)
python cli.py get <id> # full job detail
python cli.py wait <id> # block until job is done / error / cancelled
python cli.py transcribe <id> # kick off a transcribe
python cli.py transcript <tid> # fetch transcript (.txt by default)
python cli.py export <tid> txt|srt|vtt
python cli.py search <tid> "query" # find inside a transcript
python cli.py replace <tid> "old" "new"
python cli.py models list # whisper models on disk
python cli.py models pull <name> # download a modelConfigure with env vars:
export TROVE_URL=http://127.0.0.1:8899 # server base URL
export TROVE_TOKEN=... # bearer token if the server has oneRun python cli.py --help for the full command list.
Status: unstable alpha. Same caveats as the CLI. Tested only against the contract; not yet exercised in real agent workflows.
trove-mcp exposes Trove's HTTP surface as Model Context Protocol tools so a coding agent (Claude Desktop, Cursor, Replit Agent, etc.) can drive Trove end-to-end — queue downloads, poll status, transcribe, edit transcripts, search, replace, manage models. Transport: stdio.
Wire it into your MCP client config:
{
"mcpServers": {
"trove": {
"command": "python",
"args": ["/abs/path/to/trove/mcp_server.py"],
"env": {
"TROVE_URL": "http://127.0.0.1:8899",
"TROVE_TOKEN": ""
}
}
}
}The Trove server itself must already be running on the URL the MCP client points at. Each tool returns a clear error if the server is unreachable so the agent knows to prompt the user to start it (python cli.py serve).
Tool surface mirrors the CLI 1:1 (intentional — there's a parity check in the test suite that fails if either side adds a tool the other doesn't have).
- backend: Python 3.12 + Flask
- frontend: htmx 2 + vanilla JS + Tailwind CSS (standalone CLI, no Node at runtime)
- engine: yt-dlp + ffmpeg + whisper.cpp (via pywhispercpp)
- diarization (optional): silero-vad + resemblyzer + scikit-learn
- typography: Fraunces (display, with the WONK + opsz variable axes), Inter (UI), IBM Plex Mono (stamps), Source Serif 4 (transcript body)
this tool is for personal use. respect copyright laws and the terms of service of platforms you download from.
MIT. see LICENSE.
inspired by averygan/reclip (MIT).



