There are already more than a million streamers worldwide. But with different accents, speaking speeds, and the use of specialized jargon, viewers can struggle to understand the content. In addition, when streamers want to reach a global audience, subtitles provide immense convenience. LiveCap, a software created by developer Hakase Shojo, was built to solve this exact problem. It generates subtitles in real time, supporting both Japanese and English to help viewers follow along and bridge language gaps.
While real-time subtitling might look straightforward, it comes with significant technical challenges in practice:
- Subtitles often lag behind the streamer's voice.
- Speech-to-text accuracy is easily affected by the environment and speaking style.
- Silent segments get pushed into the speech recognition model, wasting unnecessary resources.
To deliver a smoother, more reliable experience, LiveCap needed a better way to handle audio.

Evolution From Silence to Speech
Early versions of LiveCap relied on "silence detection", a technique that identified a pause in speech to determine the end of a sentence before sending the audio to the recognition model. The problem was the delay: subtitles could only begin generating after a pause, creating a frustrating gap between what the streamer said and what appeared on screen.
The solution was Voice Activity Detection (VAD). This technology continuously identifies human speech, a far more efficient approach. But even here, not all tools are created equal.
Initially, LiveCap used Silero VAD, but it would often cut off the ends of sentences and produce unnatural, confusing transcripts. After multiple tests and comparisons, Hakase shojo switched to the open-source project TEN VAD.
The results were remarkable. TEN VAD offered faster, more accurate detection and proved incredibly stable in Japanese environments. LiveCap fully replaced Silero VAD with TEN VAD, and the false detections dropped from a staggering 67% to under 5%.
How TEN VAD Empowered LiveCap
- More accurate speech detection: TEN VAD consistently and accurately detects human voice, even in challenging scenarios like the inflections of Japanese sentence endings, dramatically reducing false detections.
- Ultra-low latency: With its rapid response time, TEN VAD is a perfect fit for real-time applications. It sharply identifies speech start and end points, keeping subtitles almost perfectly in sync with the streamer’s voice and improving the viewing experience.
- Lightweight and resource-efficient: The model is compact and consumes minimal CPU and memory. By detecting silence and noise, it avoids wasting resources on irrelevant audio.
- Foundation for Downstream Tasks: LiveCap’s speech recognition model requires audio chunks under five seconds. TEN VAD helps by splitting longer speech into precise sub-segments, enabling more stable and accurate transcription in real time.
"By integrating TEN VAD, LiveCap achieved much more natural transcripts in Japanese, reducing user frustration and increasing trust in the product during live usage. " Hakase shojo commented. He also noted that VAD-related technical details are rarely discussed.
But as this story shows, these seemingly simple technical details often hold the key to boosting a product's performance. By openly sharing these behind-the-scenes stories, Hakase shojo is not only providing streamers with a powerful tool but also offering valuable insights to fellow voice AI developers: the right tool choice to match the scenario, is the fastest way to solve technical challenges.
Beyond Subtitles: A Foundational Technology
The power of TEN VAD extends far beyond live subtitling. Its benefits are applicable to a wide range of real-time voice scenarios:
- In AI customer service, it enables faster responses to customer inquiries.
- In AI Tutor, it can accurately detect even the briefest, most hesitant utterances from a user.
In short, TEN VAD is a foundational capability for building real-time voice applications—whether for live streaming, conversational AI, or voice agents.