Text (DataKind)

Use DATA_KIND_TEXT when your input is already plain text.

This mode skips file extraction (PDF parsing, OCR, media decoding) and goes straight to chunking + embedding, which makes it the fastest ingestion path.

What this resembles

Choosing Text basically means:

“Here is the exact text I want to index or search. Please split it into chunks and embed it.”

Common examples:

Customer support chat transcripts
Notes / wiki paragraphs
Markdown content
Logs or JSON-as-text
Search queries (often paired with EMBED_TASK_QUERY)

How to send Text inputs

Text inputs typically use TextLocator and (optionally) set:

Input.kind_hint = DATA_KIND_TEXT

Why the hint helps:

Faster routing (no type detection)
More predictable behavior

If your content is already a string, prefer TextLocator over BytesLocator.

Chunking technique for Text

Chunking splits a long piece of text into smaller parts so that:

each chunk fits within model limits
search results point to the right region of the original text
meaning is preserved near chunk boundaries

For text, VNG uses a window-based chunking approach with optional overlap.

Key chunking options

These options are part of your text chunk policy (names may differ depending on the client SDK):

`max_chars`

Maximum size of each text chunk (in characters).

Larger chunks preserve more context.
Smaller chunks improve precision but can fragment meaning.

`overlap_chars`

How much content is repeated between consecutive chunks.

Overlap helps prevent losing meaning when an important sentence sits at the boundary between two chunks.

`boundary`

A preference for where chunk boundaries should occur (best-effort):

Any: split wherever needed
Newline: prefer splitting at line breaks
Paragraph: prefer splitting at blank lines

Boundary preferences are applied when possible. If no good boundary exists within the chunk size limit, the system will still split safely.

What you get after chunking

Each chunk is treated as an independent unit for embedding and retrieval.

Conceptually, every produced chunk includes:

the chunk text
a reference back to where it came from in the original text (a “span”)
metadata you can use for filtering and display (name, source id, mime type)

This allows you to retrieve a chunk from vector search and still know exactly which part of the original text it corresponds to.

Practical guidance

Recommended defaults

For a general knowledge base:

Use moderately large max_chars (paragraph-sized)
Use a small but non-zero overlap_chars
Use Paragraph for documents, Newline for logs/markdown

When to use Text vs File

Use Text when:

you already have the content as a string
you want exact control over what is embedded

Use File when:

your input is a document container (PDF/DOCX/PPTX)
you need extraction (parsing/OCR)

Mental model

DataKind answers: “What is this input?” → plain text
Chunking answers: “How do we split it?” → windows + overlap + boundary preference
Spans answer: “Where did this chunk come from?” → a reference back to the original text