Skip to Content
We are live but in Staging 🎉

Video (DataKind)

Use DATA_KIND_VIDEO when your input is a video file (MP4/MKV/MOV, etc.).

In this mode, the pipeline treats the input as time-based media and typically produces image-like chunks (frames) that can be embedded for visual search and retrieval.


What this resembles

Choosing Video basically means:

“This input is a video. Please break it into meaningful moments (usually frames), embed them, and keep references so I can jump back to the right timestamp.”

Common examples:

  • Product demos and tutorials
  • Recorded meetings and webinars
  • Security footage
  • Training videos / explainers

How to send Video inputs

Video inputs typically use one of these locators:

  • UrlLocator (video downloadable via HTTP/HTTPS)
  • S3Locator (video stored in object storage)

And (optionally) set:

  • Input.kind_hint = DATA_KIND_VIDEO

Why the hint helps:

  • Faster routing (no type detection)
  • More predictable behavior

If you are sending bytes, setting either kind_hint or Meta.mime_type (e.g. video/mp4) is strongly recommended.


Video chunking technique

Chunking for video determines how the video is sampled into chunks for embedding.

Most commonly, the system extracts frames from the video and embeds them as image chunks.


Supported chunking techniques

Frames (VIDEO_CHUNK_TECHNIQUE_FRAMES)

The system extracts frames from the video and embeds them.

There are two common strategies (implementation may vary by deployment):

  • Keyframes: samples the most important scene-change frames (efficient and usually enough for “what is this video about?” search).
  • Interval sampling: samples frames every N seconds (better coverage for long videos with continuous changes).

Key options (typical):

  • strategy: KEYFRAMES or INTERVAL
  • interval_ms: sampling interval for interval mode
  • max_frames: optional cap to control cost

What you get:

  • multiple embedded chunks (one per extracted frame)
  • each chunk includes references back to the original video (time span) and the frame region (rect span)

Time ranges (VIDEO_CHUNK_TECHNIQUE_TIME_RANGES)

Some configurations support chunking the video into time windows (e.g., 0–10s, 10–20s). The system may then:

  • represent each window using one or more representative frames, and/or
  • attach transcript segments if audio transcription is enabled.

Use time ranges when:

  • you want results aligned to defined moments (chapters)
  • you plan to pair video search with transcript-based retrieval

What you get after chunking

Each produced video chunk includes:

  • the chunk content (usually a frame image)
  • a time span that points to the moment in the original video
  • a rect span that points to the region within that frame (often the full frame)
  • metadata you can use for filtering and display (name, source id, mime type)

This allows vector search to return a frame (or a moment) and still let you navigate back to the source video.


Practical guidance

For most video knowledge bases:

  • Start with Frames + Keyframes (good quality, efficient)
  • Use Interval sampling for:
    • long videos where content changes gradually
    • cases where keyframes miss important moments
  • Set max_frames if you need predictable cost

Video vs Image vs File

Use Video when:

  • the input is a video file (MP4/MKV/MOV)
  • you want time-based references (timestamps) in results

Use Image when:

  • you only need to index a single image (screenshot/poster)

Use File when:

  • the input is a document container (PDF/DOCX/PPTX)
  • you want text extraction + text chunking

Mental model

  • DataKind answers: “What is this input?” → video
  • Chunking answers: “How do we sample it?” → frames or time windows
  • Time spans answer: “When did this occur?” → a timestamp/range in the original video
  • Rect spans answer: “Where in the frame?” → an (x, y, width, height) region
Last updated on