Skip to content
What it takes to build real-time voice and video infrastructure

In this series, WebRTC expert Tsahi Levent-Levi of provides an overview of the essential parts of a real-time voice and video infrastructure—from network to software updates. Check out his informative videos and read how Agora’s platform solves the challenges so you can focus on your innovation.

2.4 Server Costs

Watch time:
Category: Chapter 2: Challenges

One of the biggest business challenges with RTE is the cost of real-time servers. Learn about components, costs, and deployment alternatives.

Dive deeper: Agora provides everything you need to embed real-time interactive voice and video into your app—with an affordable pay-as-you-go pricing model and 10,000 free minutes ever month.


One of the biggest challenges that we’re going to have in our real-time engagement platform is the server costs of the real-time communication components. Here’s what we’re going to do in this lesson: We’re going to understand several requirements and the costs associated with them and we’re going to review potential deployment alternatives that we’ve got for these servers.

Read the full transcript

The big three components that are going to be the most of our costs in this case are first and foremost, the network, we’re going to pay a lot on network traffic. Then their CPU, the actual server machines that we’re going to deploy and run. And to some extent storage, at least if we’re recording sessions.

To remind you, this is what we’ve got in our architecture, we’ve got devices and browsers on the client end. And in server infrastructure, we’ve got application servers, signaling servers, NAT traversal, servers and media servers.

Let’s go one by one on these servers, to understand exactly what they are and what resources they need. And let’s start with the application server. The application server is where the application logic resides. This is where all of the logic that goes towards the user and the way our application works is being implemented.

In many ways, what we have here is the normal application development requirements in terms of resources, a lot of it will go towards memory, because we’ll need to store information about users and then use it in order to run the different scenarios that we have. And then CPU storage network are not that big of a deal. We’ve got the signaling server, which is used to connect users for session creating creation purposes. These are the messages that we communicate from one another in order to connect and disconnect sessions. The requirements here are almost the same as that of the web server, mostly memory and a bit less when it comes to CPU, storage and network. We want to deploy these signaling servers as close as possible to our application and to the database that we’ve got. In most cases, these are going to be stateful machines. They need to understand that what state the user is on and the session is on.

We’ve got the NAT traversal servers, these are stunning TURN servers mainly turn, they are used to traverse firewalls. The purpose of turn is to relay media through servers and when they relay that media, it means that they use a lot of network. This is why what you see here is that the requirements are mainly on network resources, a lot less than CPU and storage and, you know, somewhat more in memory, because of the traffic that they really all the time. We deploy these machines closer to the users in most cases. These machines are stateless in their nature, they simply pass messages around or pass data around the network. When it comes to media servers, it depends. What resources they use depends a lot of what we’re trying to achieve.

If we’re going to use a SFU, that’s a media router selective forwarding unit, then our focus is going to be group calls and broadcast sessions. A media server, media router, and SFU are going to receive traffic and then routeto other machines. So the most common thing that it will need will be network. After that, memory, then CPU, and storage. Like TURN servers or NAT traversal servers, we deploy SFU’s as close as possible to the users. And these machines are somewhat stateful. They need to understand the session, but they can be replaced in the middle of the session using the signaling.

If we’re using media mixers, and MCU, multiple and conferencing unit, then what we’re doing is receiving data or media from multiple users in a session, that can be a group call or a broadcast session, then we mix that together into a combined stream that is sent to everyone. These machines require a lot of CPU. In order to mix the data, they need to decode the inputs and then mix them and encode them and that requires a lot of CPU and a lot of memory, vendors, network and storage. We deploy these close to the users and are very stateful because they need to mix that data.

Another type of media server is the recording server when to record stuff. These ones take a lot of storage network and then CPU. We deploy them close to the users and are somewhat stateful. In terms of our deployment approaches, we can run everything on a single cloud vendor—Google Cloud, AWS, Azure—or we can go to tier two clouds—Digital Ocean, Linode, and a few others—then we can go for wholesale or colocation and carriers and put the servers in data centers. We can offer a multi cloud solution, where you pick more than a single cloud vendor.

What we pick would depend a lot of the things that we’re trying to achieve. If we look at it from the prism of costs, that means the cost of bandwidth and compute, which are the top two in the Big Three that we’ve had, then cloud is going to be the most expensive and wholesale, the cheapest. Looking at the Big Three cloud vendors, if we look at Amazon, they have 24 regions where they’re running, Google Cloud 23, Microsoft Azure 60+. There are more countries in the world than that. So if we’re going to deploy on a single cloud vendor, even if they’re one of the big three, then the end result is going to be such that our deployment is going to be lacking in not available in certain countries. These numbers are growing all the time. But they might not catch up, at least not today, with our deployment in terms of the users using our services and as we said, we want our media servers as close as possible to the users.

So, from a price perspective, wholesale is going to be the cheapest, cloud is going to be the most expensive. From a footprint perspective, tier twos are going to be the worst, then cloud, and then wholesale. Now, wholesale isn’t going to be the best tier because wholesale has such a big footprint, but because if you go to this strategy, we are in the multi cloud one where we go to different wholesale vendors, each one has its own data center in a specific location, and worked with multiple such vendors and data centers to get the footprint that we need.

In terms of the velocity, the speed at which it will take us to develop new features, cloud has the best approach because it has the best tooling, wholesale is the worst since you get only the basics there. In terms of hyper growth, the ability to add new machines fast, and dynamically allocate them or reduce them as needed in the cloud is the best solution where wholesale will be the slowest to answer these demands.

To summarize, with different types of servers that are required in real-time communication and engagement, and they require different types of resources. So we need to deploy them on different types of machines. Where exactly do we deploy them around the globe, on which type of servers, and how we deploy them is going to affect our costs, but also the quality that we’re going to give our users. Thank you.