Skip to content
What it takes to build real-time voice and video infrastructure

In this series, WebRTC expert Tsahi Levent-Levi of provides an overview of the essential parts of a real-time voice and video infrastructure—from network to software updates. Check out his informative videos and read how Agora’s platform solves the challenges so you can focus on your innovation.

5.2 Mixed Device Capabilities

Watch time:
Category: Chapter 5: Media Servers

Learn why the wide variety of user devices creates performance challenges and explore potential solutions.


One of the headaches with media servers is handling mixed device capabilities. Here’s what we’re going to do in this lesson: We’re going to understand the challenges in these mixed capabilities and what that means. We’re also going to review potential solutions for this problem. I want to start with something classic and that’s Moore’s law. 

Read the full transcript

One of these graphs available out there is from Wikipedia. It shows the number of transistors and microchips and how it doubles every two years. Also the different chips that are out there, throughout the years. What we can see that in every new year and a half, two years, find the number that you want, the number of transistors increased by a factor of two, and the performance capabilities of our devices increased dramatically. If you have a device that is three years or four years or years old, it usually cannot perform as well as a new device today. This means that it is quite common today to have devices of different generations, within the exact same real time engagement session. 

 I’m going to have a group call with five people, some of them are going to come from high end MacBooks, others will come with low end Android phones, smartphones. This is something, again, that is out of my control and I need to be able to accommodate for that. The things that I need to think about when I need to deal with these things are first, the fact that we’ve got different performance capabilities for different devices. We’re going to have different display resolutions on each of these devices. Using your 80-inch display, mounted on the wall in a room is different than using a smartphone that has four or five inch display in front of it. Even if the resolutions are 10 ATP or 4k for both of them. The size of the desert display also matters. Then there is different network behavior. On some networks, I might have good connectivity, fibre to the home over Ethernet, versus using, for example, the 3g network, within an elevator to take the extreme case.  

Why is this going to be a real problem for us? If we have large video group sessions, think about four people or more. That’s when things start to become a headache. Also when we’re live streaming and broadcasting to a large audience. Each person within the audience is going to have a different type of device and network. There are three approaches to a solution here. We can use transcoding, we can use simulcast, or we can use SVC, scalable Video Coding. Let’s review them one by one.  

If we look at transcoding, the concept here is what is that what we’re going to do is that for each user, or client or device that we have, we’re going to create a separate transcoder in our cloud resources. What does that mean? Let’s say that we’ve got a group call with three participants. Our mixer in the cloud or MCU multipoint conferencing unit, that’s a media server that we are using for transcoding, receives all of the inputs—all of the media from the three participants. First thing it does is decode each stream separately, which takes a lot of CPU. This is why it’s colored in red here. After decoding, we will need to mix and combine them together. If these are three video streams, we will need to rescale them. I’m going to change the scale the resolutions of the incoming video to fit what I want to show in the layout that I’m going to stream out. After rescaling them, I’m going to compose them together to a single frame or to a single video stream. Once I’ve got that single video stream, I need to re-encode that. So, I’m going to do something that is CPU consuming, consuming, and resource intensive. I can either have a single encoder for everyone, or an encoder per viewer per participants, or I can have an encoder for a different set of participants. Okay, I can use an encoder for all mobile devices, an encoder for low end devices versus high end devices, an encoder for PCs or an encoder for bed networks or whatever it is that they want to do in order to group some users together. Once I have these encoded streams, I can send them out. The challenge with this approach is that it is expensive. 

That’s it. 

The second approach we have is simulcast. What we do is that the client itself, the one that is sending the data to begin with, (the one that creates the media streams that he wants to send out) is going to generate multiple encoded bitstreams, usually two or three such streams. These streams are going to be sent towards the media server, in this case, it’s going to be an SFU, or selective forwarding unit. This media server is going to route that media and forward it to the viewers as it sees fit. So, in this case, I’ve got two separate bitstreams in new different bitrates, low, medium high, from 360 p to 1080 P, these are different types of resolutions, the higher the number, the bigger the resolution and the bitrate. And let’s say that I want to send the data to a low-end smartphone on a bad network, I would go for the lowest bitrate available. If I want to go and connect that to a room system or high-end machine on a good network, I would send the high end stream the high bitrate stream. Okay. This way the media server can selectively decide what to send to whom. The processing here that takes more, we need more CPU here from the broadcaster, from the user on the left when he generates the stream because he needs to generate more than one stream. There is also an overhead on the bitrate the total bitrate that we need to have available on the uplink from the sender towards the cloud where the media servers reside. This is the most common approach today for real time communication in real time engagement platforms using simulcast. 

Another approach which is coming up and will probably be popular within four to five years, you see is SVC, or Scalable Video Coding. Here the concept is slightly different but quite similar to what we had with simulcast. The broadcaster here sends a single bitstream. But that thing single bitstream and cool includes multiple layers— however many layers they want. The media server can then decide which layers to send to whom. So low end device will get only the lowest layer, and the high-end device will would get more layers, that decoders within these devices are capable of decoding the layers independently. The more layers you have, the higher the quality is going to be. This approach, as I said, is similar to simulcast. It’s a bit more advanced, it requires the codecs to support SVC and the implementations of the codecs to support SVC. This is not that common today, but it is gaining gaining popularity.  

To summarize, what we need to do is to cater different devices. The problem is that different devices have different capabilities. What we need is to have the flexibility to cater to all of them within the exact same session, either a group call or a live broadcast. For that, we need to pick the technology that is most suitable for us beat transcoding simulcasting or SVC. And again, that would depend on the business model that we have and the type of solution that we want to give. In most cases, simulcast is going to be your best friend, at least today. Thank you.