I'm beyond excited to share what we've been building: VideoSDK Real-Time AI Agents. Today, voice is becoming the new UI.
We expect agents to feel human, to understand us, respond instantly, and work seamlessly across web, mobile, and even telephony. But, to achieve this, developers have to stitch together: STT, LLM, TTS, glued with HTTP endpoints and, a prayer.
This most often results in agents that sound robotic, hallucinations and fail in product environments without observability. So we built something to solve that.
Now, we are open sourcing it!
Here’s what it offers:
- Global WebRTC infra with <80ms latency
- Native turn detection, VAD, and noise suppression
- Modular pipelines for STT, LLM, TTS, avatars, and - real-time model switching
- Built-in RAG + memory for grounding and hallucination resistance
- SDKs for web, mobile, Unity, IoT, and telephony — no glue code needed
- Agent Cloud to scale infinitely with one-click deployments — or self-host with full control
Think of it like moving from a walkie-talkie to a modern network tower that handles thousands of calls.
VideoSDK gives you the infrastructure to build voice agents that actually work in the real world, at scale.
I'd love your thoughts and questions! Happy to dive deep into architecture, use cases, or crazy edge cases you've been struggling with.
Why would I use this vs @openai/openai-agents-python (or openai-agents-ts) - the new realtime agents SDKs?
There are so many AI frameworks out there that live & die so quickly that I am generally hard pressed to use any of these unless there is some killer feature I absolutely need.
Hey
I’m Sagar, co-founder of VideoSDK.
I'm beyond excited to share what we've been building: VideoSDK Real-Time AI Agents. Today, voice is becoming the new UI.
We expect agents to feel human, to understand us, respond instantly, and work seamlessly across web, mobile, and even telephony. But, to achieve this, developers have to stitch together: STT, LLM, TTS, glued with HTTP endpoints and, a prayer.
This most often results in agents that sound robotic, hallucinations and fail in product environments without observability. So we built something to solve that.
Now, we are open sourcing it!
Here’s what it offers:
- Global WebRTC infra with <80ms latency - Native turn detection, VAD, and noise suppression - Modular pipelines for STT, LLM, TTS, avatars, and - real-time model switching - Built-in RAG + memory for grounding and hallucination resistance - SDKs for web, mobile, Unity, IoT, and telephony — no glue code needed - Agent Cloud to scale infinitely with one-click deployments — or self-host with full control Think of it like moving from a walkie-talkie to a modern network tower that handles thousands of calls.
VideoSDK gives you the infrastructure to build voice agents that actually work in the real world, at scale.
I'd love your thoughts and questions! Happy to dive deep into architecture, use cases, or crazy edge cases you've been struggling with.
Good! Is there way to prompt the TTS output tone like elevenlabs https://elevenlabs.io/docs/best-practices/prompting/eleven-v...
We are building AI companions, the tone prompting would be great
Got to hn frontpage and ignore comments on the post...
and made three accounts to add more praise lol. This should be removed.
Is this running in production at any site/company?.
Do you watermark the output to enable fraud detection?
[dead]
[dead]
[dead]
Why would I use this vs @openai/openai-agents-python (or openai-agents-ts) - the new realtime agents SDKs?
There are so many AI frameworks out there that live & die so quickly that I am generally hard pressed to use any of these unless there is some killer feature I absolutely need.
No demo? No demo video? Nothing?
how does it compare to chatterbox TTS? https://github.com/resemble-ai/chatterbox/
[dead]