Skip to content
What it takes to build real-time voice and video infrastructure

In this series, WebRTC expert Tsahi Levent-Levi of BlogGeek.me provides an overview of the essential parts of a real-time voice and video infrastructure—from network to software updates. Check out his informative videos and read how Agora’s platform solves the challenges so you can focus on your innovation.

4.2 Group Calls

Watch time:
Category: Chapter 4: Media Processing

Compare group RTE session architectures and understand the strengths and weaknesses of each.

Dive deeper: Check out Agora’s Developer Resources to for tutorials on how to create audio and video chats for one-to-one or group calls.

Transcript

We’re going to look today at media processing and we’re going to focus on group calls with three or more participants joining a single discussion. We’re going to do that by explaining the different tactic architectures that are available for group sessions. We’re going to understand the differences between them, where each of them excels, and which ones are challenged somewhere. 

Read the full transcript

What we are going to focus today is the media, not the signaling. When the user connects to your RTE infrastructure, it connects it with signaling in order to say what it wants to do and then meter traffic commences. What we’re looking at interested in now is the media part. 

Let’s start with the first most naive architecture, which is the mesh architecture, also known as peer-to-peer. We’ve seen that on one-to-one calls, but we can scale it up as well. In this way, what you see here on the screen, is that each participant in the session, sends the media and receives the media directly from each and every other participant within the session. So, the person here at the top has four separate connections to all other four users in the session. So, each user has four sessions, or four connections, and there are 10 connections in total. If we’re we’re assuming one megabit per user, or just a number that I thrown out, then uplink is going to require four megabits per second, because we’re sending to four users, and the downlink is going to be four megabits per second is well. This brings us to a total of 20 megabits per second. Now, as nice as this may be, it doesn’t scale well. Mesh is very cheap to deploy. Because we don’t need media servers, we don’t route server—the media. So, we don’t need to pay for the networking or the CPU for that. You need to understand though, that more users in the session means that there is going to be worse media quality. Okay, that’s like a rule of thumb, because the uplink is going to be strained with a higher bitrateand the CPU is going to be engaged with more encoding and decoding with than with any other alternative that we have. A mesh solution usually will not scale beyond for users, especially especially not for video calls. For most use cases, even three is going to be a strain or a stretch to run in production properly.  

The second approach, which was very popular up to about 10 years ago, is called an MCU, which is mixing media. MCU stands for multipoint conferencing unit. This is a server in the middle here. All of the participants that we have send their media directly to the server. They also receive a single stream from the server. What the server does it takes all of the inputs, all of the video inputs, it decodes them, then it combines them into a composite view of everyone and mixes the audio. It encodes a single stream and sends it back to everyone. So, each person has a single connection to a total of five connections. The uplink is one megabit, the downlink is one megabit, and the total bitrate going on this network is then five in each direction, from the MCU and towards the MCU.  

In the mixing solution, what we end up doing is eat up a lot of servers CPU, because CPU needs to encode and decode data, and codecs are resource intensive. This is going to be the biggest part of our costs within the system. It is usually associated with classical legacy deployment environments, but you won’t see these deployed today, very much. Today, if you will see it, it’s more common in audio sessions or recording sessions, creating or generating a single stream out of a group call—makes a lot of sense. The good thing about mixing is that the more users in the session, the more string the reason, the MCU, which we don’t like, but the devices aren’t affected. The device doesn’t care in this scenario, if there are two participants, 10 participants, 100 participants or a million participants. The third and most popular approach is to use an SFU, or selective forwarding unit. Here, what we’re going to do is to route the traffic around. So, each user sends his media directly to the server and then the server selectively decides where to send that media to. Who does it need to route the media, from the users that are listening in. 

There are different approaches to doing that. This is the most basic one. So here if we have five participants, each participant has five connections, or five media streams one outgoing, four that are incoming; to a total of 25 across the network. The uplink is one megabit and the downlink, being naive is four megabit. We can again, optimize these things if we want to. 

So what happens with the routing approach? Routing approach eats up bandwidth, especially on the server side. This is why deploying as a fuse means that you need to focus on networking costs in your back end. It is the most modern and common approach today for group sessions and it is very popular for video services today. The hard part with SFU is that the scaling and optimization part is quite tricky. The bigger the session is, the higher or the more optimizations you will need to apply in order to make it work. Doing a three way call with an SFU? That’s easy. Doing 10 way group called SFU? That’s a bit harder. Going to 50 and above would require you to sweatand to optimize that and to fine tune the fuse to work perfectly for your use case. It is possible, but it’s just challenging. Okay, so the more users we have in a session, the more strain there is on the devices, because they need to handle and process more media and the more optimizations we will need to apply in order to reduce that strain from the users.  

There are also a lot of hybrid approaches that are out there. You can, for example, run a mesh network for peer-to-peer sessions up to two participants. Then if you have three users or more, start going for the SFU routing approach. So, if we’re starting call, and it’s the two of us, we’re on a peer-to-peer disengagement. A third person joins in, we’re all routed to the server instead. We can mix audio and route video. There are services that do that. You can also do mesh for the active participants, and then mix/broadcast for the passive participants. There are a lot of different approaches. Each architecture from the hybrid or the classic ones would fit different types of use cases and scenarios. And you need to figure out what is working best for you.  

So, if I had to summarize these three approaches, or these three architectures; for me, mesh is simply useless in large groups. I almost would never use that. Mixing is very expensive so to explain to me why you’re using mixing. What’s the compelling event that you must have mixing running for you? With routing you are going to require optimizations. So there is a lot of care and attention that you need to put in the optimization part. Thank you.