Skip to content
Real-Time Audio Chat and Streaming: Developer Basics featured

Real-Time Audio Chat and Streaming: Developer Basics

By Author: Yaniv Elmadawi In Developer

Yaniv Elmadawi is the Agora VP of Solutions and Technology Services, focused on helping customers bring their ideas to life. His team of Solution Architects are audio and video technology experts who are constantly pushing the boundaries of chat and streaming experiences.


Real-time, interactive audio is an increasingly popular feature for a wide range of applications. Whether it’s live social audio or the addition of audio chat to existing gaming apps or enterprise tools, live audio is on the rise. This means that a lot more developers are coming to us with questions about how to add high-quality real-time audio to their apps.

Because real-time audio is still new, many developers haven’t worked with it and may not be aware of what’s important when integrating audio into an app. The good news is that you don’t need the knowledge of an audio engineer to add live audio to your app. In this article, you’ll learn all you need to know to achieve high-quality interactive audio at scale.

Delivering High-Quality Real-Time, Interactive Audio

If you’re taking the time to add audio to an app, you want users to be fully engaged and immersed in the experience. Unfortunately, when you have multiple people communicating remotely with audio, issues with sound quality, background noise, echo, and more can easily distract or even ruin the experience.

It goes without saying that when delivering interactive audio experiences, we want everyone to enjoy the highest quality experience possible. So how can you deliver a high quality, distraction-free experience? From a technical perspective there are two places where we can make an impact on the live audio experience. One of these is on the front-end and it is largely about preserving clarity and/or providing enhancement.

Sound Clarity

Sound clarity is a term that relates to the ability to deliver the original audio input clearly and without distortion—as close to “being there” as possible. Clarity is a function of:

  • Sample rate
  • Gain control
  • Noise suppression
  • Echo cancellation

Audio is analog by nature and must be converted into a digital stream. Just like scanning an image, you can choose how much detail is preserved in that process. A higher sample rate means you get more of the detail that contributes to clarity, but also requires greater bandwidth. Gain is essentially the input volume and it is important that all participants be heard at more or less the same level. Noise suppression helps to block out unwanted background sounds—everything from air conditioning hum to the sound of traffic outside. If you’ve ever been on a call where someone sounds like they are yelling into a canyon with bad feedback, you’ve experienced echo. Echo cancellation prevents this by stopping the audio output from your speakers from getting picked up and put back in the chat as input.

Audio Enhancements

Audio enhancements are things that we can do to improve the sound post-input. These might include:

  • Voice effects
  • Spatial effects
  • Scenes and profiles

Voice effects involve altering the quality of a voice in some way—in pitch or timbre for example. Or there might be a reason to change someone’s voice altogether—to disguise it. Spatial effects refer to the ability to simulate the 3D qualities of a physical space—making it sound as if different participants are standing in different places. It is this attention to detail that helps make audio interaction a more immersive experience. Audio scenes and profiles can be used to apply specific sound settings or filters for specific use cases. For example, a profile could be used to select the audio quality like “high-fidelity stereo audio” while a scene could be selected for a gaming chat to optimize for reducing gaming noise.

In a perfect world, everyone, everywhere, has a blazing fast internet connection and bandwidth is free. That would allow us to always use the highest audio quality and do anything and everything to preserve and enhance the signal. But it is not a perfect world. Sound quality comes at a cost and there are tradeoffs to be made.

Challenges of the Real-Time User Experience

Now let’s talk about the user experience requirements for real-time, live communication. The biggest challenge to any real-time, interactive experience is the network. Network-related considerations like delay/latency, synchronicity, etc are all capable of completely ruining live audio communication. These considerations become increasingly important when you have a global audience or any users in areas with less-robust internet infrastructure.

Everything sent across a digital network, including audio, is broken into small pieces (packets) that find their way to the other end, often by separate routes, where they must be reassembled. Packets are routinely lost and/or delivered out of order. While there are a number of strategies for dealing with this mix up in order, they can cause delays which are detrimental to the real-time experience.

When people are communicating with each other in real-time, there two absolute requirements:

  • Low Latency (very little perceived delay)
  • Synchronicity—when there are more than two parties, everyone needs to hear the same things—at the same time

If you’ve ever been on a conference call where these requirements were not being met, you know first-hand what a frustrating mess this can be—oftentimes grinding communication to a complete halt. Achieving low-latency and synchronicity, across multiple parties, in multiple locations, across the public can be next to impossible. This is especially true when communicating internationally. The only reasonable solution is a managed real-time network.

Enabling Real-Time Voice Engagement

When two or more people are communicating in real-time over the internet, there is a lot going on behind the scenes. An analog signal is captured, converted into a digital stream, encrypted, packetized, transmitted, routed, received, depacketized, decrypted, converted back to analog, and then output.

The core technologies that are required for high quality real-time audio experiences are:

  • Codec (coder and decoder) to provide sound clarity and enhancements
  • Low-latency network to prevents delays and enable communication in real time

Ultimately, the best way to provide a high-quality real-time audio experience is to work with an experienced partner, like Agora, that can provide both of these essentials while making it easy to integrate into your app.

How Agora Provides High-Quality Real-Time Audio

Agora provides both of these audio essentials:

  • Proprietary and standard codecs optimized for real-time audio engagement
  • A managed, ultra-low-latency global network

Agora’s Audio Codec

Agora’s proprietary audio codec with 3A (Acoustic Echo Cancellation, Automatic Gain Control, and Adaptive Noise Suppression) delivers superior quality audio while eliminating unwanted disruptions. It includes a robust set of audio enhancement options and pre-engineered option sets that allow you to address a variety of unique situations without having to understand the nuances of audio engineering. Agora also provides predefined profiles and scenes to help target the audio processing based on the use case. For example, a profile for chat will focus on enhancing the participants’ voice signals while suppressing background noise.

Agora’s Software-Defined Real-Time Network (SD-RTN)

Agora’s intelligent network offers an innovative solution to the public internet by optimizing routing for the best user experience. We have more than 200 of our own data-centers positioned around the world and we have co-located our own servers within all of the major ISPs. Whenever possible, we establish peer-to-peer connections from the participants to Agora servers and then we connect these using a software-defined, real-time network (SD-RTN). This ensures the best possible routing at any given time. Through our proprietary network, we are able to consistently achieve extremely low average latency of 400ms—globally. And as a bonus, the Agora network is infinitely scalable—tested up to 1 million participants.

The Quick and Easy Way to Add Live Audio to Your App

Delivering an immersive, real-time, audio engagement is all about finding the appropriate balance—between sound quality and deliverability. But ultimately what matters is that you can add high-quality audio quickly and easily without the need for an audio engineer. Agora makes adding audio to your app simple by providing:

  • Quick time to market
  • Great docs and support
  • Exceptional Quality of Experience (QoE)
  • Global scalability

Want to learn more about Agora’s audio solutions? Check out Agora’s live audio streaming page.