DataKind

DataKind describes the type of input content you are ingesting to support mult-modal embeddings.

It helps the pipeline pick the right extraction + chunking strategy (text parsing vs media decoding), and it improves performance by avoiding expensive detection steps.

You can provide it as an optional hint via Input.kind_hint.

Supported values

`DATA_KIND_TEXT`

Meaning: Plain text provided directly in the request.

Resembles: “I already have the text, just embed it.”

Typical locator: TextLocator

Examples:

Chat transcripts
Notes
Small JSON/text blobs you want embedded as-is

`DATA_KIND_FILE`

Meaning: A general file whose content must be extracted before chunking/embedding.

Resembles: “Fetch this file and extract its text/content.”

Typical locators: FileLocator, UrlLocator, S3Locator, BytesLocator

Examples:

PDF, DOCX
HTML files
CSV and other document-like formats

Notes:

Use this when the input is a document/container format and you expect text extraction.
If you already know it’s an image/audio/video, prefer those kinds instead.

`DATA_KIND_IMAGE`

Meaning: An image input.

Resembles: “Embed what is visible in this image.”

Typical locators: FileLocator, UrlLocator, S3Locator, BytesLocator

Examples:

PNG/JPEG/WebP
Scanned pages (image-based PDFs may still be DATA_KIND_FILE depending on your extractor)

Notes:

Image inputs may be processed via tiling (RectSpan) and vision/OCR depending on your pipeline configuration.

`DATA_KIND_AUDIO`

Meaning: An audio input.

Resembles: “Embed / index this audio content (often via transcription first).”

Typical locators: FileLocator, UrlLocator, S3Locator, BytesLocator

Examples:

MP3/WAV/M4A
Call recordings
Podcasts clips

Notes:

Audio chunking typically produces TimeSpan references back to the original media.

`DATA_KIND_VIDEO`

Meaning: A video input.

Resembles: “Embed the video (often via frames + optional audio transcript).”

Typical locators: FileLocator, UrlLocator, S3Locator, BytesLocator

Examples:

MP4/MKV
Product demos
Recorded meetings

Notes:

Video chunking can reference frames and/or time ranges using RectSpan and TimeSpan.

`DATA_KIND_UNSPECIFIED`

Meaning: No hint provided.

When omitted, the system will detect the kind using available signals (e.g., mime type, file extension, magic bytes) and fall back to conservative defaults.

Choosing the right DataKind

Use this quick guide:

You have the text in your request → TEXT
You have a document (PDF/DOCX/PPTX/HTML/CSV) → FILE
You have an image (PNG/JPEG/WebP) → IMAGE
You have audio (MP3/WAV/M4A) → AUDIO
You have video (MP4/MKV) → VIDEO

Practical tips

If you can, set Meta.mime_type (e.g. application/pdf, image/png). It makes detection cheaper and more accurate.
If your input is BytesLocator, setting either kind_hint or Meta.mime_type is strongly recommended.
DataKind is a hint: the backend may still override it if the content clearly doesn’t match.