Skip to content
What it takes to build real-time voice and video infrastructure

In this series, WebRTC expert Tsahi Levent-Levi of provides an overview of the essential parts of a real-time voice and video infrastructure—from network to software updates. Check out his informative videos and read how Agora’s platform solves the challenges so you can focus on your innovation.

4.1 One on One

Watch time:
Category: Chapter 4: Media Processing

Explore the mechanisms used to facilitate one-to-one RTE sessions and review potential pitfalls.

Dive deeper: Check out Agora’s Developer Resources to for tutorials on how to create audio and video chats for one-to-one or group calls.


Let’s talk a bit about media processing and I want to start off with the simplest scenario of one-on-one sessions. What we’re going to do is to first list the different mechanisms, mechanisms and architectures that are used to conduct one to one session. And to understand the pitfalls.

Read the full transcript

One to One session are the sessions where there are two people talking to each other remotely. The most common approach, and the most popular one is probably the peer-to-peer approach. Here, we’ve got the signaling server that connects both peers, both users attending the signaling path between them with messages, one, two, three, four here. These messages in the green line are the ones that are conveyed in order to start the session. Once we’ve exchanged these messages, the actual media is going to flow directly between the two users. This is denoted by line five. This is what we call peer to peer. When the media is flowing directly between the devices of both users, not going through media servers in the middle, it means that we’re running the media peer to peer. What do we know about this scenario exactly?  

So if we are running such a product, and we have our calls done in peer to peer, then this is what we know. We know the signaling part. Because in order for user a to reach chooser, B, he went through the service. To get connected to that, we have to go through the signaling pathway that we have. We know a bit about the connectivity, but not much. We know if they’re connected, maybe if they disconnected. If we add it into the signaling, we’ll know when someone is muted, for example, but that’s about it.  

Here’s what we don’t know about these sessions. We don’t know the tough parts of connectivity. How exactly are they connected? Are they really connected peer to peer or return relay? And we don’t know the media quality and performance. So we are giving a solution, an RTE solution (real time experience solution), but what we don’t know what is the user experience. Okay, we don’t know if there is any packet loss, or if there are issues with the bitrate. We don’t know if the users are happy with the audio or video that they are receiving or sending. That’s because of the media path in P2P.  

Within a peer-to-peer system, P2P, we are going to try to go direct from one user to another with the assumption that this will give us the best quality—we’re going to question that sample assumption later on. Sometimes going direct is impossible. I’ve got user a, I’ve got user B, let’s say there’s a firewall in the middle and it’s blocking traffic between users. In such a case, we may relay the data that we have to the returned server. We’re going to have a server that is going to use to relay the media because we can’t find any direct route. In both cases, we don’t know the experience unless we collect the data from the client devices directly. So I know that the data is flowing through a TURN server. But that’s about it. I don’t know more than that. Now, we need to understand that TURN servers are required in every RTE platform. It helps us get sessions connected, usually between 10% to 20% of your sessions will require a TURN server to make to let them through. That’s because of the way the networks are built, with some type of users on their specific networks, it might happen more. If you’re focused on enterprises, then you might see the turn numbers going above 20%. If your target audience is professional gamers, it might be a lot lower than 20%. It’s up to you to understand and find out what is the percentage of sessions that will go through a TURN server. 

 So what will happen in the case that we relay media is that the signaling will go through a signaling server, but instead of the media going directly between the devices of the users that are going to relate through TURN server, in this case, the server here you see the media going between the users to the server back and forth. This can happen in one of two ways. Either because there is no direct path, or because we decided that TURN is the better approach for us for certain reasons. We can also relay through a media server. When we relay through a media server, we have greater control over the media path—greater control than a peer-to-peer session that goes directly, and also greater control than a peer-to-peer session it is routed only through a TURN server. 

The other thing that we are going to use for media servers for is probably if you want to record the session. We can record the session on the client side, but in many cases, recording it on the server side makes a lot more sense. It adds one more benefit going through a media server.This is the opportunity to optimize the quality that we have for certain scenarios. If we know what is going on for the users, we understand what are their beat rates, what are their frame rates, how much packet loss there is, and what is the latency.We can try to do more and optimize it further. And let’s say a small example. We have two users, one in India, the other one in South Americaand we can go directly over the public Internet with a session between them. If we’re going direct, then we rely on the public Internet to get the best route possible. But the public Internet doesn’t really work like that. It will go through the route that was decided by the carriers on both ends the service providers that provide the internet access to the users in the session. Sometimes it might be great, sometimes not that much. It can go through weird routes that actually hinder the quality instead of improving it. Now, if we go through a managed route, it means that we’re going to put media servers or forced stern relays close to the end users. When we do that, we actually manage the data traffic that goes between these two servers. We orchestrate these routes, we force that route and reshape that route based on our needs. This is how SDN works today. This is how the RTE-SDN for Agora works as well. 

In summary, even the simplest form of one-to-one sessions can be hard. It’s hard not on the implementation part, but on the troubleshooting in production. Users are going to come and complain about quality not because your service is bad, but because they have issues. Issues might be within their home network within their devices or anything else and you need to be able to assist them in troubleshooting these issues. If you don’t know what they’re doing in their experience, you won’t be able to troubleshoot that one on one, at least in the peer-to-peer form also offers less control which means less optimization opportunities. With an RTE platform, real-time engagement platform, what we want is to have that kind of control, because we want the flexibility to provide the best possible solution for user in any given scenario and in any given time. Thank you.