Video (DataKind)
Use DATA_KIND_VIDEO when your input is a video file (MP4/MKV/MOV, etc.).
In this mode, the pipeline treats the input as time-based media and typically produces image-like chunks (frames) that can be embedded for visual search and retrieval.
What this resembles
Choosing Video basically means:
“This input is a video. Please break it into meaningful moments (usually frames), embed them, and keep references so I can jump back to the right timestamp.”
Common examples:
- Product demos and tutorials
- Recorded meetings and webinars
- Security footage
- Training videos / explainers
How to send Video inputs
Video inputs typically use one of these locators:
UrlLocator(video downloadable via HTTP/HTTPS)S3Locator(video stored in object storage)
And (optionally) set:
Input.kind_hint = DATA_KIND_VIDEO
Why the hint helps:
- Faster routing (no type detection)
- More predictable behavior
If you are sending bytes, setting either kind_hint or Meta.mime_type (e.g. video/mp4) is strongly recommended.
Video chunking technique
Chunking for video determines how the video is sampled into chunks for embedding.
Most commonly, the system extracts frames from the video and embeds them as image chunks.
Supported chunking techniques
Frames (VIDEO_CHUNK_TECHNIQUE_FRAMES)
The system extracts frames from the video and embeds them.
There are two common strategies (implementation may vary by deployment):
- Keyframes: samples the most important scene-change frames (efficient and usually enough for “what is this video about?” search).
- Interval sampling: samples frames every N seconds (better coverage for long videos with continuous changes).
Key options (typical):
strategy:KEYFRAMESorINTERVALinterval_ms: sampling interval for interval modemax_frames: optional cap to control cost
What you get:
- multiple embedded chunks (one per extracted frame)
- each chunk includes references back to the original video (time span) and the frame region (rect span)
Time ranges (VIDEO_CHUNK_TECHNIQUE_TIME_RANGES)
Some configurations support chunking the video into time windows (e.g., 0–10s, 10–20s). The system may then:
- represent each window using one or more representative frames, and/or
- attach transcript segments if audio transcription is enabled.
Use time ranges when:
- you want results aligned to defined moments (chapters)
- you plan to pair video search with transcript-based retrieval
What you get after chunking
Each produced video chunk includes:
- the chunk content (usually a frame image)
- a time span that points to the moment in the original video
- a rect span that points to the region within that frame (often the full frame)
- metadata you can use for filtering and display (name, source id, mime type)
This allows vector search to return a frame (or a moment) and still let you navigate back to the source video.
Practical guidance
Recommended defaults
For most video knowledge bases:
- Start with Frames + Keyframes (good quality, efficient)
- Use Interval sampling for:
- long videos where content changes gradually
- cases where keyframes miss important moments
- Set
max_framesif you need predictable cost
Video vs Image vs File
Use Video when:
- the input is a video file (MP4/MKV/MOV)
- you want time-based references (timestamps) in results
Use Image when:
- you only need to index a single image (screenshot/poster)
Use File when:
- the input is a document container (PDF/DOCX/PPTX)
- you want text extraction + text chunking
Mental model
- DataKind answers: “What is this input?” → video
- Chunking answers: “How do we sample it?” → frames or time windows
- Time spans answer: “When did this occur?” → a timestamp/range in the original video
- Rect spans answer: “Where in the frame?” → an (x, y, width, height) region