Skip to content
What it takes to build real-time voice and video infrastructure

In this series, WebRTC expert Tsahi Levent-Levi of provides an overview of the essential parts of a real-time voice and video infrastructure—from network to software updates. Check out his informative videos and read how Agora’s platform solves the challenges so you can focus on your innovation.

1.3 Components of the RTE infrastructure

Watch time:
Category: Chapter 1: Introduction

In real-time engagement (RTE), there is a lot more going on than meets the eye. Get a high-level view of RTE platform infrastructure.

Dive deeper: Learn how Agora approaches RTE.


Let’s start with an introduction, then I want to begin with talking about the components of an RTE infrastructure, our real-time engagement platform. What I want to do in this lesson is to list the components that are required inside an RTE environment and also to review the purpose of these components.

Read the full transcript

We’ll begin off by making a stone soup. It’s kind of how I think about our tea real time engagement. When you look at the platform and start to break it down into building blocks, it ends up being a stone soup. Stone soup is an old story of a guy that came into time town, nobody knows him. He’s hungry. So he comes to one of the houses, knocks on the door and says that he wants to eat, nobody wants to deal with him. So he says you can make a soup out of a stone. This intrigues the people they let him in, he asked for a large pot and water, he boils the water puts the soup inside and starts steering and steering and steering. And people get a bit agitated. So they asked him, “Okay, how much time is it going to take you a stone out a soup out of a stone?” And he says,”Well, it’s going to take about a day.”

They don’t have the time, they don’t want to wait. So we say asking (sic), “Can you speed up the process a bit.” And he says, “Sure, if you bring me an onion, we can reduce a few hours of the time it will take to make the soup.” Obviously, they bring in the onion and he continues by adding more and more vegetables, and then they have a soup.

So what happens here is that we have the soup that started with a stone but ended up with a lot of different ingredients. When we talk about realtime engagement, we need to understand what our stone is to begin with. In a way our stone is the devices that we are going to use. What are these devices? Exactly. So these devices can be smartphones, as we’ve just seen in the image. There are also tablets, and laptops and desktops and other kinds of devices that we use on a daily basis is computing devices. But they can also be embedded devices and sensors, also browsers. With browsers, we mean webRTC (web real-time communications). It’s kind of a specification that now exists on all browsers, okay? In a browser is a kind of a device but not exactly. You don’t need to install or download anything, it’s just there. So you can send a link to someone and you can start using an RTE platform, just by interacting with his browser. So we’ve seen these devices.

Now let’s go and look at the infrastructure, all of the hidden parts of our stone soup. The first part would be signaling and messaging. I’ve got two users: this red hat guy over here wants to speak to another guy there. He needs to send a message—he wants to connect them to do an interaction with him. How does it (sic) get that? How does he find that additional person and is able to interact with him? That’s done through a signaling server or a messaging server?

Replication is to be able to connect these servers and register them somehow to know that they exist and then to facilitate the messages that go between them. That includes chat, messages, images, but also the negotiation process of sessions themselves. Sessions is where voice and video interactivity takes place. So you’ve got the signaling and messaging servers. Now we need to be able to connect the media. Okay, to connect the media, we’ve got these two users, and there’s a kind of a firewall or a NAT device in the middle. NAT stands for network address translation. These are devices that are built into the fabric of the internet today. They help with either security, negotiation, IP addressing and other data that is needed within our networks today within modern networks. What they also do, though, is they protect or they block media messages from going from our guy to the other person. And to that end, we’ve got NAT traversal solutions in the form of STUN and TURN servers—we’ll go and discuss them later. These servers are used to pass and relay the data through these TURN servers, and to be able to negotiate the data through firewalls and nuts if possible. So their purpose in life is to be able to make sure that the media that you’re trying to send from one user to another from one device to another to actually finds its way to the other side.

Another type of server that we have is media servers. When you want to do one-on-one session, that’s easy. When you want to do large group sessions or broadcasts or to process media somewhere in the server to do AI, artificial intelligence or to record stuff,then we employ and deploy media servers. These deals with these types of use cases in scenarios where you can’t do the things that you want to do directly on the client device.

So what did we have until now, we’ve seen our stone soup and in that stone soup are the devices and the browsers. We’ve seen seen the vegetables, the other infrastructure and servers that are usually hidden from the eye(when you start working with RTE), and we have the application server itself. This is the application that you are writing the logic of your application and wanted you to place next to it a signaling and messaging server. We had the NAT traversal servers, TURN, and then we had also media servers.That comprises our stone soup.

So what have we done so far? We’ve seen that there is more three time engagement than just the device that the clients interact with the video directly (the users). There are many different moving parts and different tasks that are needed in order to build our environment.