Skip to Content
We are live but in Staging 🎉

Audio (DataKind) & Chunking

Use DATA_KIND_AUDIO when your input is an audio file (MP3/WAV/M4A, etc.).

Currently, VNG supports Audio-based ingestion for audio inputs.

Audio-based means we embed the audio as audio (not as text). In practice, VNG converts audio into an audio representation (for example, time-windowed audio features) and embeds those representations so you can search and match sounds—even when there is no speech.

Planned (not yet supported):

  • Transcript-based: convert speech to text, then embed the text chunks
  • Hybrid: combine transcript chunks with audio-derived chunks

What this resembles

Choosing Audio basically means:

“This input is audio. Please break it into time segments, embed them, and keep references so I can jump back to the exact timestamp.”

Common examples:

  • Call recordings
  • Podcasts and interviews
  • Voice notes
  • Meeting audio

How to send Audio inputs

Audio inputs typically use one of these locators:

  • UrlLocator (audio downloadable via HTTP/HTTPS)
  • S3Locator (audio stored in object storage)
  • BytesLocator (small audio bytes provided inline)

And (optionally) set:

  • Input.kind_hint = DATA_KIND_AUDIO

Why the hint helps:

  • Faster routing (no type detection)
  • More predictable behavior

If you are sending bytes, setting either kind_hint or Meta.mime_type (e.g. audio/mpeg, audio/wav) is strongly recommended.


Audio chunking technique

Chunking for audio determines how the audio is split into segments for embedding.

Most commonly, the system uses time windows (e.g., 0–15s, 15–30s), optionally with overlap.


Supported chunking techniques

Time windows (AUDIO_CHUNK_TECHNIQUE_TIME_WINDOWS)

Splits audio into fixed-length time segments.

Key options (typical):

  • window_ms: length of each segment
  • overlap_ms: optional overlap between segments
  • max_windows: optional cap to control cost

Use time windows when:

  • you want predictable chunk sizes
  • you plan to show search results with timestamps

What you get after chunking

Each produced audio chunk includes:

  • the chunk content (audio-derived representation)
  • a time span that points to the moment in the original audio
  • metadata you can use for filtering and display (name, source id, mime type)

This allows vector search to return a segment and still let you navigate back to the source audio.


Practical guidance

For most knowledge bases:

  • Start with Time windows for predictable coverage
  • Use a small overlap if important context is often split at boundaries
  • Set max_windows for predictable cost on long recordings

Audio vs Video vs File

Use Audio when:

  • the input is an audio-only file (MP3/WAV/M4A)
  • you want results aligned to timestamps

Use Video when:

  • the input is a video file and you want frame + timestamp retrieval

Use File when:

  • the input is a document container (PDF/DOCX/PPTX)
  • you want text extraction + text chunking

Mental model

  • DataKind answers: “What is this input?” → audio
  • Chunking answers: “How do we split it?” → time windows
  • Time spans answer: “When did this occur?” → a timestamp/range in the original audio
Last updated on