Text (DataKind)
Use DATA_KIND_TEXT when your input is already plain text.
This mode skips file extraction (PDF parsing, OCR, media decoding) and goes straight to chunking + embedding, which makes it the fastest ingestion path.
What this resembles
Choosing Text basically means:
“Here is the exact text I want to index or search. Please split it into chunks and embed it.”
Common examples:
- Customer support chat transcripts
- Notes / wiki paragraphs
- Markdown content
- Logs or JSON-as-text
- Search queries (often paired with
EMBED_TASK_QUERY)
How to send Text inputs
Text inputs typically use TextLocator and (optionally) set:
Input.kind_hint = DATA_KIND_TEXT
Why the hint helps:
- Faster routing (no type detection)
- More predictable behavior
If your content is already a string, prefer TextLocator over BytesLocator.
Chunking technique for Text
Chunking splits a long piece of text into smaller parts so that:
- each chunk fits within model limits
- search results point to the right region of the original text
- meaning is preserved near chunk boundaries
For text, VNG uses a window-based chunking approach with optional overlap.
Key chunking options
These options are part of your text chunk policy (names may differ depending on the client SDK):
max_chars
Maximum size of each text chunk (in characters).
- Larger chunks preserve more context.
- Smaller chunks improve precision but can fragment meaning.
overlap_chars
How much content is repeated between consecutive chunks.
Overlap helps prevent losing meaning when an important sentence sits at the boundary between two chunks.
boundary
A preference for where chunk boundaries should occur (best-effort):
Any: split wherever neededNewline: prefer splitting at line breaksParagraph: prefer splitting at blank lines
Boundary preferences are applied when possible. If no good boundary exists within the chunk size limit, the system will still split safely.
What you get after chunking
Each chunk is treated as an independent unit for embedding and retrieval.
Conceptually, every produced chunk includes:
- the chunk text
- a reference back to where it came from in the original text (a “span”)
- metadata you can use for filtering and display (name, source id, mime type)
This allows you to retrieve a chunk from vector search and still know exactly which part of the original text it corresponds to.
Practical guidance
Recommended defaults
For a general knowledge base:
- Use moderately large
max_chars(paragraph-sized) - Use a small but non-zero
overlap_chars - Use
Paragraphfor documents,Newlinefor logs/markdown
When to use Text vs File
Use Text when:
- you already have the content as a string
- you want exact control over what is embedded
Use File when:
- your input is a document container (PDF/DOCX/PPTX)
- you need extraction (parsing/OCR)
Mental model
- DataKind answers: “What is this input?” → plain text
- Chunking answers: “How do we split it?” → windows + overlap + boundary preference
- Spans answer: “Where did this chunk come from?” → a reference back to the original text