Skip to content

feat: add WebVTT converter#14

Open
bjesuiter wants to merge 2 commits into
Michaelliv:mainfrom
bjesuiter:feat/vtt-support
Open

feat: add WebVTT converter#14
bjesuiter wants to merge 2 commits into
Michaelliv:mainfrom
bjesuiter:feat/vtt-support

Conversation

@bjesuiter

@bjesuiter bjesuiter commented May 7, 2026

Copy link
Copy Markdown

Hey there, human here!

I wanted my openclaw to build itself a skill to download YouTube video transcripts, and it did so by downloading a vtt file from YouTube and adding a small python script to convert that to markdown.

I thought it might be a great addition to markit to support vtt to markdown files, with deduplication of rolling subtitles as used by YouTube.

So I vibe coded this PR, if you have any remarks or complaints, send them my way, I'll fix it! :)

Example

Input sample.vtt:

WEBVTT

00:00:00.000 --> 00:00:02.000
Hello world.

00:00:02.000 --> 00:00:04.000
This is a caption test.

Run:

markit sample.vtt -q

Output:

# Transcript

## Text

Hello world. This is a caption test.

## Timestamped Transcript

- [00:00:00.000] Hello world.
- [00:00:02.000] This is a caption test.

YouTube-style rolling captions are deduplicated, so cumulative cue fragments become one readable transcript instead of repeated text.

@bjesuiter

Copy link
Copy Markdown
Author

Refinement pass pushed in 46971b9 (refactor: refine WebVTT parsing).

Summary:

  • Made .vtt extension matching case-insensitive (.VTT now works).
  • Made WebVTT MIME matching case-insensitive.
  • Simplified skipped WebVTT block handling with a shared prefix list.
  • Consolidated cue timestamp-tag stripping into one regex.
  • Improved HTML entity decoding, including numeric and hex entities such as 🐟.
  • Added test coverage for the new edge cases.

Verified locally:

  • bun run check
  • bun test (123 pass)
  • bun run build

@Michaelliv

Michaelliv commented May 24, 2026

Copy link
Copy Markdown
Owner

Thanks for the PR! I like the goal of making transcripts easier to read, but I’m not fully convinced this should land in core as-is.

My main concern is that WebVTT is already a plain-text format, and the most important semantic information in captions is the timing. This converter seems to turn it into a nicer transcript, but in doing so it drops or weakens some of that structure:

  • cue end timestamps are not preserved
  • cue settings/metadata are discarded
  • the main text section flattens the transcript
  • the rolling-caption deduplication may make the text more readable, but it can also change the timing relationship of the original cues

Since users/agents can already read or parse raw VTT on demand with normal text tooling, I’m trying to understand the core value of converting it to Markdown if the conversion is lossy.

Could you explain the intended use case a bit more? In particular, do you see this as:

  1. a readability/transcript extraction feature, where losing some caption structure is acceptable, or
  2. a caption-preserving conversion, where we should keep start/end timestamps and avoid losing timing information?

If we keep this in core, I’d lean toward preserving the caption timing more explicitly, for example, including both start and end timestamps in the timestamped section, and treating any deduped plain transcript as a secondary convenience rather than the canonical output.

@bjesuiter

Copy link
Copy Markdown
Author

I'm using this to "archive" youtube videos i find interesting.
So this is not something like "i want to know something specific from the vtt" but more like "I want to store the transcript in my notes, like a blogpost and in case the video goes down for some reason"

I probably could store the vtt, but since my openclaw memory is in Markdown already, i'd like to preserve the coherence.
Also: a clean markdown transcript feels better for me for this use case, similar to:
Word for editing (aka. a "Working format") and PDF for export/archive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants