Jim is CEO of Synervoz Communications, a Toronto-based software development company focused on building apps and SDKs to enhance voice and video calls with music, movies, TV, games, live streams, and other interactive components. Synervoz helps developers solve challenging audio problems, including noise, echo, mixing audio streams, cross-platform issues, voice-controlled user interfaces, Bluetooth, and more. Customers include well-established brands like Bose and Unity, as well as startups building the next generation of virtual hangouts. Join Jim on September 1-2 at the RTE2021 Virtual Conference to hear his panel session on Addressing and Solving the Audio Challenges of Remote Lifestyles.
Noise. Is. Aggravating. Especially on voice and video calls. It’s certainly an unwelcome party to most conversations, but it has always been around. That is, until recently. Machine learning and artificial intelligence have led to highly effective noise-suppression and cancellation techniques, many of which are now being used at various levels of technology stacks: in hardware, middleware, apps, and more recently, SDKs that will soon be available to a broad range of developers.
The industry developing these new techniques and technologies has mostly focused on obvious use cases such as call centers, telephony, and simple voice and video calls that power your daily business meetings, calls with friends and family, and so on. But calls are no longer just calls—increasingly, they are becoming online rooms or meeting spaces in which you can interact in new ways or participate in other activities together. Until recently, voice and video calls only needed to consider a single layer of audio: the Voice over IP layer. But as calls transform into rooms and spaces, opportunities abound for additional audio layers, and so too do the associated challenges.
As someone who has been innovating in audio technologies for the last several years, I can attest that noise suppression now works well enough to unlock some groundbreaking use cases that were previously infeasible—particularly in use cases that require multiple layers of simultaneous audio combination.
The challenges of combining voice chat and media
Let’s say you want to add a media player like Spotify or YouTube to a voice or video call. Now imagine that you’re hanging out in this voice or video call while listening to music as a group or watching something together, and one of you speaks:
- Can you hear the other person over the volume of the music or the show?
- Do you have to constantly mute and unmute yourself to avoid constantly sending noise over the wire?
- Are you hearing the audio twice since it’s coming out of the other person’s speaker and leaking into the mic?
These situations are likely familiar to people who have tried watching something together over FaceTime or a Zoom call. Even in the many watch party apps that have sprung into existence, text-based chat still dominates. The above problems don’t exist when you’re hanging out together in person, because your brain’s capacity to process spatial audio helps to separate audio sources. To some extent, new spatial audio technology will help in the digital realm as well. Nonetheless, the ability to cancel noise and better separate audio sources is key to unlocking more digital hangout use cases.
In the media player + voice call example described above, you probably want a way to reduce the volume of the music or video when someone speaks. Technically, it would require a voice activity detector (VAD). However, if you feed the VAD with a noisy signal, you’ll end up ducking (lowering the volume) the media player in response to noise. Or you’re likely to overcompensate and not detect voice when someone speaks—causing the ducker to fail, or worse, causing the voice signal to be ignored and not sent over the wire. An inaccurate VAD thus degrades the user experience. So much so that it’s still rare to find use cases where people leave a microphone open when watching or listening to something together.
Gaming is a counterexample where people do leave the microphone open, but the audio challenges in gaming are often easier than the aforementioned use case. This is because spatial audio is often a feature of the game and the masking of the Voice over IP channel by in-game sounds may be less frequent than is the case with music or video streams. And yet audio issues are still a frequent complaint among gamers. Discord, one of the industry leaders in this space, has demonstrated the importance of noise reduction via its partnership with Krisp. As Discord is increasingly used for nongaming use cases, noise reduction will be an important enabling technology. Consider a group of cyclists keeping a voice call open during their ride (some cyclists use Discord for this). Wind, traffic, and ambient noise present UX issues, especially in combination with listening to music simultaneously. It’s even harder to solve these issues on motorcycles, with added wind and engine noise.
It’s difficult to find a single application that addresses all aspects of all use cases simultaneously. While Discord may be great for gamers and many other use cases, it’s unlikely to be the optimal solution for cyclists and motorcyclists, whose interfaces will need to be optimized for things like hands-free operation and offline functionality. For similar reasons, there’s little question that many targeted applications will continue to be built and targeted at specific use cases. So, what do we think developers will build? How will they get access to this state-of-the-art noise-cancellation technology?
One platform to keep an eye on is Agora, a platform many development teams have turned to when building interactive voice, video, and live-streaming apps. A lot of unique use cases are already being built on top of Agora by developers, and even more will be unlocked with reliable noise suppression.
Online audio use cases
Noise suppression is used in hardware and software for many reasons, but a primary focus at the moment is improving internet-based voice and video calls. Let’s call these “online use cases.” As meeting online has become a mainstream behavior, there has been a demand for more interactivity on calls. Thanks to a vastly improved audio experience now being possible, you’re likely to see more:
- Apps that add a media player to voice and video rooms (e.g., for hangouts with music or watching videos together, as discussed earlier). You will likely need some combination of an accurate VAD + denoised signal + real-time mixing and ducking capabilities to make these use cases work well.
- Gaming apps with new ways to play together, like Bunch, Kosmi, or Piepacker.
- Apps that use spatial audio to position multiple layers of audio together, where each layer may represent a different person’s voice, an instrument, one of many ambient sounds, or otherwise.
- Virtual spaces like Second Life, Altspace, or Mozilla Hubs—but with audio-centric experiences such as virtual nightclubs and virtual concerts.
The integration of content and activities into calls is a movement that has started gathering momentum in recent months. Apple recently announced at its Worldwide Developers Conference updates for FaceTime and SharePlay. A raft of announcements from the likes of Spotify, Twitter, Discord, Facebook, Reddit, and a host of others signifies that the shift toward audio-centric online hangouts is already happening. It will all lead to new ways to hang out together online, and much of it will be possible only with modern noise-suppression techniques.
Offline audio use cases
Offline use cases are also being made feasible. That is, use cases where an internet connection is not required, such as those where communication happens directly between devices (peer to peer) or within a single device (e.g., embedded in headphones, to alter the audio in your environment). Consider the following use cases:
- The aforementioned online use cases combined with an off-grid option for when there is no cellular connection, as when you’re cycling, motorcycling, hiking, skiing, or running off-grid. In many of these use cases, users would like to listen to music and still be able to talk simultaneously, even without a data connection.
- Smartwatches or headphones that might connect offline, via Bluetooth, or via mesh networks, enabling communication underground or in large crowds with limited connectivity.
- Hearing aids or even regular headphones and apps that can help isolate a particular speaker, helping to solve the cocktail party problem. That would make the focus person sound clearer and louder while drowning out the people talking nearby.
- Recording apps that can be used anywhere to create memes, news, podcasts, or music at any passing moment (e.g., on the subway or the noisy streets), without the need for planning or high-end recording equipment, no matter how noisy the environment is.
The offline world gets even more interesting when you consider layers of audio that could be incorporated from the ambient environment. “Transparency” is a feature you may be familiar with on headphones, allowing you to hear what’s going on around you while listening to your music or podcast. But the technologies that help to separate voice from background noise can also be used to distinguish different sounds in one’s environment (audio source separation). Some sounds can be enhanced while others are suppressed.
Any of the following could be made louder, quieter, or positioned in space, depending on the use case: sirens, birds chirping, people talking nearby, mechanical equipment sounds, various construction sounds, and so on. And the ability to identify specific sounds and isolate them will likely spawn many new utilities (such as apps that listen for mechanical issues or other specific events), entertainment apps (like ways to listen to music together while walking or commuting), and creative works (like the bygone RjDj app in which environmental sounds were incorporated into the music or audio output).
Hybrid online / offline audio use cases
Another group of use cases combines elements of online and offline, such as:
- Use cases that involve running, cycling, motorcycling, and other activities but with a seamless transition from online to offline when you move between areas with different connectivity. Or perhaps connecting these use cases into applications like Zwift, Rouvy, Peloton, Forte, and so forth in order to encompass more of a sport in a single app (e.g., using the same app to ride with your cycling community, indoors and outdoors, with or without connectivity).
- Silent disco-style experiences where users can talk between headsets in a nightclub or event venue (offline) as well as allowing users to communicate across affiliated nightclubs or events (online).
- Tourism apps like Gesso that provide location-based audio (GPS + recorded audio) to describe what you’re currently looking at. Now add a layer of voice chat between tourists who are walking or on bikes (this layer could be online or offline), a layer for a guide (who could be physically present or remote), a layer of music to keep the vibe going between stops, and a layer of transparency for safety.
- Connecting physical rooms together with big screens. For example, imagine having two house parties or office parties occurring in two separate rooms, but you could simply walk up to the TV to talk to your counterparts in the other house or office. Smart speakers and other devices offer a similar opportunity. But these use cases are not feasible if there’s constant noise leaking through the devices. We need a precise way to determine what’s signal and what’s noise, and that technology is now being commercialized.
The future of audio
Audio is experiencing a renaissance; however, the industry still has lots of room to grow. We anticipate many new use cases will be built with solutions like Agora combined with innovative audio technology. We have positioned our own company, Synervoz, accordingly. We are a software development team focused on helping other companies build use cases like those discussed above. We have in-house SDKs, expertise, and partnerships with trusted brands like Bose and Agora and more to help with rapid prototyping and to minimize time to market with production-ready applications.
Ready to integrate audio technology into your application? Let us know your use case or head over to https://www.synervoz.com/ to learn more. To learn more about real-time engagement uses cases, join Jim and other thought leaders on September 1-2 at the RTE 2021 virtual Industry conference.