Audio (DataKind) & Chunking
Use DATA_KIND_AUDIO when your input is an audio file (MP3/WAV/M4A, etc.).
Currently, VNG supports Audio-based ingestion for audio inputs.
Audio-based means we embed the audio as audio (not as text). In practice, VNG converts audio into an audio representation (for example, time-windowed audio features) and embeds those representations so you can search and match sounds—even when there is no speech.
Planned (not yet supported):
- Transcript-based: convert speech to text, then embed the text chunks
- Hybrid: combine transcript chunks with audio-derived chunks
What this resembles
Choosing Audio basically means:
“This input is audio. Please break it into time segments, embed them, and keep references so I can jump back to the exact timestamp.”
Common examples:
- Call recordings
- Podcasts and interviews
- Voice notes
- Meeting audio
How to send Audio inputs
Audio inputs typically use one of these locators:
UrlLocator(audio downloadable via HTTP/HTTPS)S3Locator(audio stored in object storage)BytesLocator(small audio bytes provided inline)
And (optionally) set:
Input.kind_hint = DATA_KIND_AUDIO
Why the hint helps:
- Faster routing (no type detection)
- More predictable behavior
If you are sending bytes, setting either kind_hint or Meta.mime_type (e.g. audio/mpeg, audio/wav) is strongly recommended.
Audio chunking technique
Chunking for audio determines how the audio is split into segments for embedding.
Most commonly, the system uses time windows (e.g., 0–15s, 15–30s), optionally with overlap.
Supported chunking techniques
Time windows (AUDIO_CHUNK_TECHNIQUE_TIME_WINDOWS)
Splits audio into fixed-length time segments.
Key options (typical):
window_ms: length of each segmentoverlap_ms: optional overlap between segmentsmax_windows: optional cap to control cost
Use time windows when:
- you want predictable chunk sizes
- you plan to show search results with timestamps
What you get after chunking
Each produced audio chunk includes:
- the chunk content (audio-derived representation)
- a time span that points to the moment in the original audio
- metadata you can use for filtering and display (name, source id, mime type)
This allows vector search to return a segment and still let you navigate back to the source audio.
Practical guidance
Recommended defaults
For most knowledge bases:
- Start with Time windows for predictable coverage
- Use a small overlap if important context is often split at boundaries
- Set
max_windowsfor predictable cost on long recordings
Audio vs Video vs File
Use Audio when:
- the input is an audio-only file (MP3/WAV/M4A)
- you want results aligned to timestamps
Use Video when:
- the input is a video file and you want frame + timestamp retrieval
Use File when:
- the input is a document container (PDF/DOCX/PPTX)
- you want text extraction + text chunking
Mental model
- DataKind answers: “What is this input?” → audio
- Chunking answers: “How do we split it?” → time windows
- Time spans answer: “When did this occur?” → a timestamp/range in the original audio