Making Voice AI Agents More Human with TEN VAD and Turn Detection

As GPT-4o showed us, conversational AI is making the voice AI we imagined in movies like Her into reality. AI voice interactions and conversations are becoming richer, faster, and easier to use, making them a key part of building multimodal AI agents.

But there’s still a gap: voice agents still don’t behave quite like real humans.
In natural conversations, things like interruptions, pauses, and overlapping speech happen all the time. The user experience feels off when AI responses come too early, too late, or not at all.

In real-world conversations, how you pause or interrupt means a lot, whether it’s politeness, hesitation, confidence or more. It’s not just about what is said but how it is said. For voice agents to feel truly human, they need to do more than just “hear” and “reply” correctly, but listen, understand, and respond naturally, with full awareness of context.

To help make voice interaction with AI more human, we built two new state-of-the-art models TEN Voice Activity Detection (VAD) and TEN Turn Detection. Both models are built to make voice agents feel much more natural based on Agora’s more than 10 years of deep research in real-time voice communication and ultra-low latency streaming. Both models are supported by Agora and the community, allowing anyone to use, and the new models are key parts of the open-source conversational AI TEN ecosystem.

TEN VAD is a lightweight pre-trained Voice Activity Detection (VAD) model based on deep learning with low latency and high accuracy. It can be used for detecting whether a human voice is present in an audio frame or not.
TEN Turn Detection is an intelligent turn detection model designed specifically for full-duplex voice communication (allowing for overlapping speech, like human conversation) between humans and AI agents. It can detect natural turn-taking cues and enable contextually aware interruptions in conversations.

Developers can either use TEN VAD and TEN Turn Detection separately or combine both models to build a voice agent with human-like conversational experience.

TEN VAD: Handle speech with higher accuracy and lower cost

TEN VAD is a lightweight, low-latency and deep-learning based VAD model. It is designed to run before the Speech-to-Text (STT) system, which is before the voice input is fed into large language models to detect frames that contain human speech and filtering out the non-human speech. What it does is simple but powerful:

It accurately detects human speech in audio frames.
It filters out non-human sounds (background noise, silence, etc.).

By doing this, it not only makes downstream Speech-to-Text (STT) results more accurate but also cuts STT costs significantly — because you prevent sending voiceless audio into expensive processing pipelines.

VAD isn’t optional if you care about turn-taking. Accurate turn detection relies heavily on reliable VAD as a foundation.

Performance Comparison:

Compared to popular VADs like WebRTC Pitch VAD and Silero VAD on the open data samples, TEN VAD outperforms both on TEN VAD Test Sample, an open dataset collected from diverse scenarios, with frame-by-frame manual annotations for ground truth.

In addition, TEN VAD outperforms others in latency comparison. TEN VAD rapidly detects speech-to-non-speech transitions, while Silero VAD suffers from a delay of several hundred milliseconds, resulting in increased end-to-end latency in human-agent interaction systems.

The TEN VAD Test Sample with manually labeled frame-level VAD annotations is available for integration and test with just one click, so the community developers can build and benchmark VAD models easily.

Real-world problem solving:
From a real-world user case, the measurements show that using TEN VAD reduced the audio traffic by 62%.

Try out TEN VAD and start building on Hugging Face and GitHub

TEN Turn Detection: Empower agents with natural turn-taking and interruption handling

TEN Turn Detection is built for one of the trickiest parts of human-AI conversations: figuring out when someone is done speaking. It is built specifically for dynamic, real-time conversations between humans and AI agents, allowing AI to distinguish between a mid-sentence pause and the end of a question or statement. If agents jump in too early or wait too long, the conversation feels unnatural to humans.

TEN Turn Detection enables full-duplex interaction with AI agents, to make conversations more natural and human by detecting natural turn-taking signals in real-time.

How it works:

It looks at conversational context (what’s being said)
It picks up linguistic patterns that hint if a user is still thinking or done speaking

The goal is to enable voice agents to understand when to listen and when to speak, so conversation can flow more naturally.

The TEN Turn Detection model is open source and available to all voice agent builders in the community, with support for both English and Chinese.

Performance Comparison:

We benchmarked TEN Turn Detection with other models on a multi-scenario dataset. Here are the results:

Try out TEN Turn Detection and start building on Hugging Face and GitHub

Why TEN VAD and TEN Turn Detection?

When developers combine TEN VAD and TEN Turn Detection, they unlock a better way to build voice agents:

High quality: Both models offer the highest quality with ultra-low latency, high accuracy based on more than 10 years of deep research and industry know-how.
More Natural Conversations: TEN Turn Detection and TEN VAD allow voice agents to respond like a real human — waiting when it should, speaking when it’s time when interrupted. All with ultra-low latency.
Lower the Cost: With TEN VAD filtering out non-speech audio, you process way less data through expensive Speech-to-Text services. The real-world user case shows a significant down cut in the total cost by using both models together.
Easy to use: Both models can be used as extensions or plugins on the TEN Framework, one of the most adopted voice agent frameworks. For those already using the TEN Framework, these new models are plug-and-play. For those looking for a framework with more speech and interruption handling capabilities, it’s worth considering one with best-in-class VAD and turn detection available.

Both TEN VAD and TEN Turn Detection are designed to integrate seamlessly with the TEN Framework. Check out the demo video below to see the before-and-after differences of using TEN Turn Detection in TEN Agent (a conversational voice AI agent powered by TEN Framework).

You can run TEN VAD and TEN Turn Detection with the TEN Agent either on Hugging Face Spaces or locally on your own GPU.

Running on Hugging Face (Recommended for quick start)

Log into your Hugging Face account.
Visit our demo space: TEN Agent Demo on Hugging Face
In the top-right corner, open the Settings dropdown and select “Duplicate this Space” to deploy the full experience using your own Hugging Face-provided GPU.

Running Locally with Your Own GPU

Log into your Hugging Face account.
In the top-right corner of the demo space, click Settings to Run locally.
Follow the instructions in the guide of Run TEN Framework Locally to get the full TEN stack running locally

Now, in this new conversational AI trend, make your voice agent truly human-like!

Stay tuned for any future TEN family changes or releases on

Learn more about Agora's video and voice solutions

Ready to chat through your real-time video and voice needs? We're here to help! Current Twilio customers get up to 2 months FREE.

Complete the form, and one of our experts will be in touch.

Try Agora for Free

Try for Free

TEN

App Builder

Flexible Classroom

Download SDKs

Support Plans and Pricing