Skip to content

Support .vtt / WebVTT subtitle files #1682

@bertclaws

Description

@bertclaws

Summary

Please add first-class support for .vtt / WebVTT subtitle files.

Why this would help

MarkItDown is already useful as a bridge from many document/media formats into Markdown for LLM workflows. WebVTT is a common interchange format for:

  • meeting transcripts
  • video subtitles/captions
  • downloaded YouTube or conference transcripts
  • course/video note pipelines

Right now, .vtt does not appear to be explicitly supported. It may sometimes fall through as generic text depending on MIME/charset detection, but that is fragile and does not produce a clean Markdown result.

Expected behavior

Given a .vtt file, MarkItDown should parse it intentionally and produce readable Markdown instead of raw subtitle syntax.

For example, it should handle:

  • WEBVTT header
  • cue timestamps (00:00:01.000 --> 00:00:03.000)
  • cue identifiers
  • multiline subtitle blocks
  • optional speaker prefixes / metadata when present

Possible output shapes

Any of these would be better than raw passthrough:

  1. Clean transcript mode

    • strips timestamps/cue metadata
    • preserves text paragraphs
  2. Timestamp-preserving markdown mode

    • keeps timestamps in a readable markdown form, e.g.
      • [00:01] Speaker: text...
  3. Metadata-aware transcript

    • preserves speaker labels when present
    • drops formatting noise

Why this seems aligned with MarkItDown

MarkItDown already supports text-oriented and transcription-oriented inputs (including audio transcription and YouTube transcript workflows). WebVTT feels like a natural input format for the same family of use cases.

Minimal ask

Even a first step would be great:

  • explicitly recognize .vtt / text/vtt
  • convert it through a dedicated parser or lightweight cleaner
  • output readable Markdown/text content rather than raw subtitle markup

Thanks — this would make MarkItDown much more useful in note-taking / transcript-to-markdown pipelines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions