Skip to Content
We are live but in Staging 🎉

File (DataKind)

Use DATA_KIND_FILE when your input is a document or file container that should be read and extracted before embedding.

This is the most common mode for ingesting real-world files such as PDFs and Office documents.


What this resembles

Choosing File basically means:

“This input is a file. Please load it, extract any useful content (text and/or images when supported), then chunk and embed the result.”

Common examples:

  • PDF documents
  • DOCX / PPTX
  • HTML files
  • CSV and line-delimited JSON (NDJSON)
  • Zips / archives (when supported)

How to send File inputs

File inputs typically use one of these locators:

  • UrlLocator (file downloadable via HTTP/HTTPS)
  • S3Locator (file stored in object storage)
  • BytesLocator (small file bytes provided inline)

And (optionally) set:

  • Input.kind_hint = DATA_KIND_FILE

Why the hint helps:

  • Faster routing (no type detection)
  • More predictable behavior

If you are sending bytes, setting either kind_hint or Meta.mime_type (e.g. application/pdf) is strongly recommended.


What happens in File mode

In File mode, the system typically performs:

  1. Load the file from the locator
  2. Inspect it (mainly using mime type and basic detection)
  3. Extract content (e.g., text from PDFs/DOCX, records from CSV/NDJSON)
  4. Chunk the extracted content
  5. Embed each produced chunk

Not all file types support every extraction method. The system uses best-effort behavior based on the file format.


Common extraction paths

Text-like documents

Some file types are naturally text-first and are chunked like text:

  • text/plain
  • text/html
  • application/json (treated as text unless you enable structured JSON handling)

These typically result in text chunks.


Record files (CSV / NDJSON)

Some file types are naturally record-based and are chunked by records:

  • text/csv
  • application/x-ndjson

These typically result in record-oriented chunks, which helps keep rows/records intact during retrieval.


Documents with embedded media (PDF / DOCX)

Some file types can contain a mix of content.

For example:

  • DOCX: extracted text + embedded images (when available)
  • PDF: extracted text + embedded images (best-effort)

Depending on configuration, this can produce a mix of:

  • text chunks (for extracted text)
  • image chunks (for extracted images)

Chunking techniques in File mode

File mode can produce different chunk types depending on what gets extracted.

  • Extracted text is chunked using the text chunk policy (see Text docs).
  • Extracted images (when supported) are chunked using the image chunk policy (whole/tiles).
  • Record files are chunked using record-friendly boundaries to keep records intact.

What you get after chunking

Each produced chunk includes:

  • the chunk content (text, image tile, or record block)
  • a reference back to where it came from (a “span”)
  • metadata you can use for filtering and display (name, source id, mime type)

This allows vector search to return precise results while still linking back to the original file.


Practical guidance

For general knowledge base ingestion:

  • Always set Meta.mime_type if you know it (e.g., application/pdf, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
  • Use the default text chunk policy for extracted text
  • Use Image Whole by default for extracted images, and switch to tiles only for large/detail-heavy images

File vs Text vs Image

Use File when:

  • your input is a document/container (PDF/DOCX/PPTX)
  • you want extraction + chunking handled automatically

Use Text when:

  • you already have the exact text and don’t want extraction

Use Image when:

  • your input is a true image and you want visual chunking

Mental model

  • DataKind answers: “What is this input?” → a file/document container
  • Extraction answers: “What content can we get out of it?” → text, records, images (format-dependent)
  • Chunking answers: “How do we split the extracted content?” → text policy / image policy / record boundaries
  • Spans answer: “Where did this chunk come from?” → a reference back to the original file
Last updated on