Skip to Content
We are live but in Staging 🎉

Text (DataKind)

Use DATA_KIND_TEXT when your input is already plain text.

This mode skips file extraction (PDF parsing, OCR, media decoding) and goes straight to chunking + embedding, which makes it the fastest ingestion path.


What this resembles

Choosing Text basically means:

“Here is the exact text I want to index or search. Please split it into chunks and embed it.”

Common examples:

  • Customer support chat transcripts
  • Notes / wiki paragraphs
  • Markdown content
  • Logs or JSON-as-text
  • Search queries (often paired with EMBED_TASK_QUERY)

How to send Text inputs

Text inputs typically use TextLocator and (optionally) set:

  • Input.kind_hint = DATA_KIND_TEXT

Why the hint helps:

  • Faster routing (no type detection)
  • More predictable behavior

If your content is already a string, prefer TextLocator over BytesLocator.


Chunking technique for Text

Chunking splits a long piece of text into smaller parts so that:

  • each chunk fits within model limits
  • search results point to the right region of the original text
  • meaning is preserved near chunk boundaries

For text, VNG uses a window-based chunking approach with optional overlap.


Key chunking options

These options are part of your text chunk policy (names may differ depending on the client SDK):

max_chars

Maximum size of each text chunk (in characters).

  • Larger chunks preserve more context.
  • Smaller chunks improve precision but can fragment meaning.

overlap_chars

How much content is repeated between consecutive chunks.

Overlap helps prevent losing meaning when an important sentence sits at the boundary between two chunks.

boundary

A preference for where chunk boundaries should occur (best-effort):

  • Any: split wherever needed
  • Newline: prefer splitting at line breaks
  • Paragraph: prefer splitting at blank lines

Boundary preferences are applied when possible. If no good boundary exists within the chunk size limit, the system will still split safely.


What you get after chunking

Each chunk is treated as an independent unit for embedding and retrieval.

Conceptually, every produced chunk includes:

  • the chunk text
  • a reference back to where it came from in the original text (a “span”)
  • metadata you can use for filtering and display (name, source id, mime type)

This allows you to retrieve a chunk from vector search and still know exactly which part of the original text it corresponds to.


Practical guidance

For a general knowledge base:

  • Use moderately large max_chars (paragraph-sized)
  • Use a small but non-zero overlap_chars
  • Use Paragraph for documents, Newline for logs/markdown

When to use Text vs File

Use Text when:

  • you already have the content as a string
  • you want exact control over what is embedded

Use File when:

  • your input is a document container (PDF/DOCX/PPTX)
  • you need extraction (parsing/OCR)

Mental model

  • DataKind answers: “What is this input?” → plain text
  • Chunking answers: “How do we split it?” → windows + overlap + boundary preference
  • Spans answer: “Where did this chunk come from?” → a reference back to the original text
Last updated on