Breaking Language Barriers in Real-Time with Voice AI

I’ve spent the last few months talking to teams building in the conversational AI space, and honestly, most conversations blur together after a while. Another ASR optimization here, another LLM wrapper there. But my conversation with Artem Kukharenko and Ivan Kuzin from Palabra hit different.

Maybe it’s because they started with a problem I recognize immediately — that helpless feeling when you’re in a country where your English does you no good, where hand gestures and Google Translate screenshots become your only lifeline. Or maybe it’s because they’re tackling something that feels less like an incremental improvement and more like they’re trying to reverse-engineer the Tower of Babel.

Real-time speech-to-speech translation. Not the kind where you speak, wait three seconds, and hear a robotic voice stumble through your sentence. The kind where latency drops low enough that two people can actually have a conversation across languages, where your voice — your actual voice with its cadence and tone — crosses linguistic boundaries in real-time.

The Problem That Started Everything

Artem put it simply: “We lived as digital nomads in different countries and faced problems with languages ourselves.” He’d built real-time computer vision systems before Palabra, had the ML chops, understood the stack. But understanding distributed systems architecture doesn’t help you order food in Finnish or negotiate an apartment lease in Vietnamese.

“It’s much easier to learn a programming language than a foreign language, as you can hear,” he told me, and I caught myself laughing at the self-deprecating honesty. We’ve all been there — fluent in five programming languages, struggling to ask for directions in one foreign language.

That personal frustration became their north star. They weren’t building translation tech because it was the hot AI domain of 2024. They built it because they needed it to exist.

The Misconception Nobody Talks About

Ivan brought up something that’s been bothering me too: “The most common misconception here is that people think that AI is something like a common translator tool, like we used to have before in the pre-AI era.”

He’s right. When most people think “translation,” they’re still mentally anchored to those clunky phrase books or early Google Translate, where you’d paste in a sentence and get back something grammatically questionable. They’re not thinking about systems that can predict what you’re about to say before you finish saying it, that understand context across an entire conversation, that preserve the emotional weight of your words as they cross language boundaries.

The gap between what people imagine and what’s actually possible now is staggering. We’re not just swapping words anymore. We’re conveying meaning, preserving intent, maintaining emotional context — all in real-time. It’s the difference between reading a translated novel and having someone whisper the translation in your ear as the author speaks, capturing every pause and inflection.

The Latency Problem Nobody’s Solved (Yet)

Let me get technical for a minute because this is where it gets interesting.

When you say “hello,” it takes about 200 milliseconds for the sound to leave your mouth. The system needs at least 300 milliseconds to understand how that word connects to what comes next. But here’s where it gets complicated: in some languages, “hello” might come at the end of a sentence due to linguistic structure. The system needs to translate it correctly in the target language’s word order, without waiting for the entire sentence to complete.

Palabra’s approach? They predict what’s coming next.

“The system does try to predict the words the speaker will say and thus decreases the latency,” Ivan explained. They’ve built everything in-house — full stack control means they’re not chaining together third-party APIs and hoping for the best. They use sentence splitters, prediction algorithms, their own data pipelines.

Artem was refreshingly honest about where they stand: “We did great work optimizing it, but we still have room for improvement because we want our solution to work with zero latency on every language pair.”

Zero latency. On every language pair. That’s the kind of ambitious goal that makes you either roll your eyes or lean forward in your chair. I leaned forward.

Voice Cloning Across Languages Is Harder Than You Think

The technical challenge that really got my attention was voice cloning between languages. It’s relatively straightforward to clone a voice within a single language — English to English voice cloning is practically a solved problem at this point. But Chinese to English? In real-time? That’s a different beast entirely.

“It’s especially more challenging for languages with different meaning of intonations,” Artem explained. “Because in one language one intonation could mean excitement and in another language the same intonation will mean something else.”

Think about that for a second. Your voice carries so much more than just words — it carries emotion, emphasis, urgency, humor. A raised voice might signal anger in one culture and enthusiasm in another. The same tonal pattern that reads as a question in English might sound like a statement in Mandarin.

Most teams would solve this by breaking the problem into discrete steps: speech-to-text, text translation, text-to-speech. Use the best API for each component, chain them together, call it a day. Palabra deliberately chose the harder path.

“We decided to build our system in a more difficult way,” Artem said. “We have to train all the components and build different adapters between these components. But it gives us much more control over the whole pipeline.”

That control matters because the text-to-speech model needs audio features from the speech-to-text model — information that gets lost when you’re just passing text strings between disconnected APIs. The emotion, the prosody, the speaker’s identity — all of that has to flow through the entire pipeline or you end up with technically accurate translations that feel completely lifeless.

The Code-Switching Advantage

Here’s something I hadn’t considered before talking to Artem: in some ways, AI has a massive advantage over human interpreters when it comes to multilingual conversations.

Imagine you’re running a conference with ten different languages. You’d need at least ten human interpreters, and they’d all be translating through an intermediate language. Finnish to English, then English to Japanese. Every intermediary step loses a bit of meaning, adds a bit of latency, introduces potential errors.

“But for AI algorithm, it knows all different languages at the same time,” Artem explained. “So it can translate from any language to any language simultaneously.”

No intermediate steps. No telephone game where meaning degrades with each handoff. Just direct translation from source to target, regardless of the language pair.

I pushed back a bit on this. “It’s not perfect right now and it makes mistakes,” Artem admitted. “And in some cases, human interpreters work with much better quality.”

“But not always available,” I countered. Exactly. You can’t always find a Finnish-to-Japanese interpreter. You definitely can’t afford to have ten interpreters on standby for every possible language combination at your conference. The AI doesn’t sleep, doesn’t need breaks, doesn’t charge by the hour.

Where This Actually Matters

Theory is great. Technical challenges are intellectually stimulating. But I always want to know: who’s using this in production, and what are they using it for?

Ivan rattled off the current use cases: live events (they had two just that day, one in Taiwan), broadcasting, social commerce platforms like WhatNot and TikTok’s shopping features. “Live streaming overall are the most popular use case,” he said.

The economics make sense. These are all scenarios where real-time communication across language barriers directly impacts revenue. A streamer who can speak to English and Chinese audiences simultaneously potentially doubles their addressable market. An auction platform that translates bids in real-time opens up to international buyers.

But the story Ivan shared about a sales pitch hit me harder than the business metrics.

A seller was trying to close a deal with a buyer who spoke a different language. They used Palabra’s real-time translation. “The deal happened when they started speaking on an emotional level, and that won them the deal,” Ivan said. “Previously they couldn’t do that because all the communication was lacking this emotional component.”

That’s what this technology is really about. Not replacing human connection but enabling it across boundaries that previously made it impossible.

Standing Out in a Crowded Market

Let’s be honest: the AI translation space is getting crowded. Google, Microsoft, Meta — all the big players have translation products. So how does a startup differentiate?

“We’re staying focused on one big problem, which is simultaneous interpretation,” Ivan explained. They’re not trying to be a general-purpose translation platform. They’re not building a consumer app for translating restaurant menus. They’re laser-focused on the specific, gnarly problem of real-time speech-to-speech translation where latency, emotion, and accuracy all matter equally.

That focus extends to their deployment model. They offer on-premise options for enterprises worried about privacy and data sovereignty. “It’s like healthcare, finance, education,” Ivan said. “Some of them just want to work on premise because there is an element of trust.”

Not every use case needs cloud deployment. Sometimes a bank wants your translation engine running on their infrastructure, processing their data, with zero external API calls. Palabra built for that from day one.

Benchmarking With Humans, Not Just Metrics

Most AI companies benchmark against other AI systems. WER scores, BLEU scores, academic datasets. Palabra does something different: they benchmark against human interpreters.

“That’s the quality level we want to aim and that’s why we’re hiring in-house linguists or we work with companies who provide us with linguists,” Ivan explained. “We see how their speech is structured and make sure that we are on the same level.”

I appreciate this approach. It’s easy to optimize for metrics that don’t actually correlate with user satisfaction. You can have a technically impressive WER score on clean audio and still produce translations that native speakers find awkward or unnatural. Testing against how actual professional interpreters structure their speech keeps you honest.

The Developer Experience

As someone who lives in the developer relations world, I had to ask about the integration experience. What does it actually look like to build Palabra into your application?

“We provide good documentation, SDKs, RESTful API, WebRTC implementation,” Ivan said. They support the major use cases — browser-based apps, mobile apps, the works. Custom implementations for specific needs. Vocabulary tuning for industry-specific terminology.

Artem emphasized the importance of customization: “For clients who build an international platform, we can create a custom model tailored to customers’ needs. We could adapt vocabulary, and so on.”

This makes sense when you think about it. Medical terminology doesn’t translate the same way as casual conversation. Legal language has its own requirements. An e-commerce platform needs product-specific vocabulary. Off-the-shelf models won’t cut it for specialized domains.

The Five-Year Vision

I always like to end these conversations by pushing people to think bigger. Where is this all going?

“In five years, I guess, we’ll be seeing really great advancements,” Ivan said. “All the translations will be seamlessly integrated into all the phone calls, all the communication on the OS level.”

He’s describing a world where you start a video call and language selection is as natural as choosing dark mode. Where every piece of audio content on your phone is automatically available in your preferred language. Where the question isn’t “does this support translation?” but rather “wait, why doesn’t this translate?”

“We are moving humanity past this biblical issue where everyone will be able to speak to anyone,” Ivan continued. The Tower of Babel reference again — but this time, we’re rebuilding it successfully.

Artem focused on the technical evolution: “In two years already, in five years it’ll be crazy. But I think we’ll see seamless translation of all the content you have on your phone.”

The timeline feels aggressive but not impossible. We’ve seen how fast things move in this space. GPT-3 to GPT-4 happened faster than anyone expected. Voice synthesis went from robotic to indistinguishable from humans in about 18 months. Translation could follow a similar trajectory.

The Unsupervised Learning Foundation

Since I can’t resist going deep on the ML fundamentals, I asked about unsupervised learning’s role in all this.

“Unsupervised learning already plays a huge, maybe the biggest role right now,” Artem said. “All the large models like GPT-like models are pre-trained with unsupervised learning. It’s true for text models, it’s true for speech models.”

The challenge isn’t the unsupervised learning itself — it’s the data pipeline. You can’t just scrape the internet and feed it to your models. “You need to preprocess it and you have to clean it,” Artem explained. “You need to clean it without human laborers because they are time-limited.”

This is the unglamorous part of AI that nobody talks about at conferences. Data cleaning. Pipeline engineering. Building systems that can filter out the garbage and keep the signal, all at scale, all automated. It’s not as sexy as talking about transformer architectures, but it’s just as critical to making these systems work in production.

What’s Next If Not Translation?

I threw them a curveball at the end: if you weren’t working on real-time translation, what voice AI domain would you bet on?

Ivan went with dubbing. “This is an existing and ever growing business and the amount of content grows exponentially. You know that Netflix films a lot and Amazon does as well.”

Artem chose dialog systems. “I see huge applications for them. The accuracy of dialog systems is very impressive right now, but there is still room for improvement.”

Both answers make sense. Dubbing has clear commercial value and scales naturally as content production explodes. Dialog systems unlock new interaction paradigms — voice interfaces that actually work, AI assistants that understand context, customer service that doesn’t make you want to throw your phone.

But my favorite moment from the whole conversation came when Artem mentioned one of their early clients: a virtual reality company doing conferences in VR. You could attend in VR, choose your language for conversations with other attendees, select a different language for watching the main presentation.

“The craziest thing is that the conference was about agriculture,” Artem added. “So it was something about cows, but in virtual reality and with a translation.”

Ivan immediately asked: “Were there any cows wearing VR glasses?”

That exchange captured something important about this whole space. We’re building technology that sounds like science fiction — real-time translation, virtual reality conferences, AI-powered voice cloning — and applying it to the most mundane, practical problems. Agriculture conferences. Sales calls. Live shopping streams.

Why This Matters For Developers

If you’re building anything with real-time communication, this matters to you. Not because you necessarily need to implement translation today, but because your users will expect it tomorrow.

Think about what Ivan said: it’ll be strange if you don’t provide it. Just like it’s strange now when a website doesn’t have a mobile version, or when a video platform doesn’t support multiple resolutions. These capabilities move from “nice to have” to “expected” faster than we anticipate.

For Agora’s developer community, this represents both a challenge and an opportunity. The challenge is that user expectations around real-time communication keep rising. The opportunity is that we’re still early enough that building these capabilities in gives you a legitimate competitive advantage.

Palabra is showing us what’s possible when you control the full stack, when you’re willing to tackle the hard problems instead of chaining together APIs, when you benchmark against humans instead of just metrics. They’re not the only team pushing these boundaries, but they’re pushing them in interesting directions.

The technical challenges they’re working on — sub-300ms latency translation, cross-language voice cloning, emotion preservation across linguistic boundaries — those aren’t solved problems. They’re active areas of research with massive commercial implications.

And somewhere out there, there’s probably a cow wearing VR glasses at an agriculture conference, being seamlessly translated into fifteen different languages.

The future is weird. But at least we’ll all be able to talk to each other while it’s happening.

‍

Check out the full episode:

‍

Want to learn more about building real-time communication features into your applications? Check out Agora’s documentation and join our developer community.

Learn more about Agora's video and voice solutions

Ready to chat through your real-time video and voice needs? We're here to help! Current Twilio customers get up to 2 months FREE.

Complete the form, and one of our experts will be in touch.

Try Agora for Free

Try for Free

TEN

App Builder

Flexible Classroom

Download SDKs

Support Plans and Pricing