File (DataKind)
Use DATA_KIND_FILE when your input is a document or file container that should be read and extracted before embedding.
This is the most common mode for ingesting real-world files such as PDFs and Office documents.
What this resembles
Choosing File basically means:
“This input is a file. Please load it, extract any useful content (text and/or images when supported), then chunk and embed the result.”
Common examples:
- PDF documents
- DOCX / PPTX
- HTML files
- CSV and line-delimited JSON (NDJSON)
- Zips / archives (when supported)
How to send File inputs
File inputs typically use one of these locators:
UrlLocator(file downloadable via HTTP/HTTPS)S3Locator(file stored in object storage)BytesLocator(small file bytes provided inline)
And (optionally) set:
Input.kind_hint = DATA_KIND_FILE
Why the hint helps:
- Faster routing (no type detection)
- More predictable behavior
If you are sending bytes, setting either kind_hint or Meta.mime_type (e.g. application/pdf) is strongly recommended.
What happens in File mode
In File mode, the system typically performs:
- Load the file from the locator
- Inspect it (mainly using mime type and basic detection)
- Extract content (e.g., text from PDFs/DOCX, records from CSV/NDJSON)
- Chunk the extracted content
- Embed each produced chunk
Not all file types support every extraction method. The system uses best-effort behavior based on the file format.
Common extraction paths
Text-like documents
Some file types are naturally text-first and are chunked like text:
text/plaintext/htmlapplication/json(treated as text unless you enable structured JSON handling)
These typically result in text chunks.
Record files (CSV / NDJSON)
Some file types are naturally record-based and are chunked by records:
text/csvapplication/x-ndjson
These typically result in record-oriented chunks, which helps keep rows/records intact during retrieval.
Documents with embedded media (PDF / DOCX)
Some file types can contain a mix of content.
For example:
- DOCX: extracted text + embedded images (when available)
- PDF: extracted text + embedded images (best-effort)
Depending on configuration, this can produce a mix of:
- text chunks (for extracted text)
- image chunks (for extracted images)
Chunking techniques in File mode
File mode can produce different chunk types depending on what gets extracted.
- Extracted text is chunked using the text chunk policy (see Text docs).
- Extracted images (when supported) are chunked using the image chunk policy (whole/tiles).
- Record files are chunked using record-friendly boundaries to keep records intact.
What you get after chunking
Each produced chunk includes:
- the chunk content (text, image tile, or record block)
- a reference back to where it came from (a “span”)
- metadata you can use for filtering and display (name, source id, mime type)
This allows vector search to return precise results while still linking back to the original file.
Practical guidance
Recommended defaults
For general knowledge base ingestion:
- Always set
Meta.mime_typeif you know it (e.g.,application/pdf,application/vnd.openxmlformats-officedocument.wordprocessingml.document) - Use the default text chunk policy for extracted text
- Use Image Whole by default for extracted images, and switch to tiles only for large/detail-heavy images
File vs Text vs Image
Use File when:
- your input is a document/container (PDF/DOCX/PPTX)
- you want extraction + chunking handled automatically
Use Text when:
- you already have the exact text and don’t want extraction
Use Image when:
- your input is a true image and you want visual chunking
Mental model
- DataKind answers: “What is this input?” → a file/document container
- Extraction answers: “What content can we get out of it?” → text, records, images (format-dependent)
- Chunking answers: “How do we split the extracted content?” → text policy / image policy / record boundaries
- Spans answer: “Where did this chunk come from?” → a reference back to the original file