Skip to content
What it takes to build real-time voice and video infrastructure

In this series, WebRTC expert Tsahi Levent-Levi of provides an overview of the essential parts of a real-time voice and video infrastructure—from network to software updates. Check out his informative videos and read how Agora’s platform solves the challenges so you can focus on your innovation.

4.4 Recording

Watch time:
Category: Chapter 4: Media Processing

When dealing with media processing, we need to also discuss and think about recording.

Dive deeper: Agora allows you to easily capture and record browser, audio, and video streams with our on-premise and cloud-based Recording extension.


When dealing with media processing, we need to also discuss and think about recording. Here’s what we’re going to do in this lesson: we’re going to understand how media recording works and we’re going to list the alternatives that are in front of us when we’re trying to record.  

Read the full transcript

Our focus is going to be on recording multi party sessions, sessions there where there are multiple participants. Remember what we’re doing using using an SFU architecture, selective forwarding unit, where each participant is sending his media towards the media server and the media server doesn’t mix or combine these inputs, but rather routes them around and forwards them to the other participants. So, each user is going to receive multiple video streams. Our goal is to get that data, get these streams from all participants, and send them towards a storage device or towards some kind of recording component.  

In general, there are three different server-side recording architectures or approaches that we can discuss. There is multi-stream recording, similar to what you’ve seen just now in the diagram. Then there’s combined recording where (you) actually mix all of the information in inputs into a single video file and then record metadata as well. Let’s go one by one and see what they mean.  

When we talk about multi stream recording, I’m going to do the following. I’m going to record each stream separately. So, the forwarding unit, the media server is going to send the media that it has that it received the same way as it would to any other participant. If we record each stream separately, we’re not going to stream to mix the streams. Okay, we might mix the audio only that depends on what what to do or not to. And the result for that is going to be a low-cost recording. We’re not know any mixing—there is no processing happening in the backend besides storing the data that says a few the media server receives. So, the recording is going to be cheaper than anything else. The only problem here is that we’re going to have a kind of a poor playback experience, because we will need to either try to combine them before we play it back or we need to build a player for our own that can is capable of taking these multiple streams and then playing them back together automatically. The challenge there is to synchronize them in real time and handle all the things that we have headaches with, which is you know, bitrate available, packet loss and the other network issues. This type of a solution is good for governance needs. It means that we’re not going to use recordings much only when we need to go and look at a specific recording in a large list of archived files, say, one in 1000 files that were recorded is going to actually be played back. So, we’re not looking at massive playback scenarios here.  

The second approach is combined recording, if we look at the SFU, or the SFU receives the data and then send the data to someone else. If we have two participants, it receives from each one the data and sends the data to the other participant. What we’re going to do now is add the following components into our media servers. We’re going to decode the data and mix the data. These are denoted here in red, because they take more CPU power and more memory, than just receiving and sending. After we mix the data, now we’ve got a single stream with all the data that we need. We need to encode it again, and then store it somewhere. So, during this process it’s going to be more let’s say time consuming or CPU consuming. It’s going to eat up more resources, and it’s going to be more expensive for us to deploy.  

We can do this in real time or offline. Doing so in real time means the moment I received the data, I’m creating the mixed video and storing it somewhere into storage. Offline means that I’m going to store first the streams themselves as we would have in the first approach. Later on, there’s going to be an online process that is going to take all of these separate streams and mix them and combine them into a single file. Real time is really good if what we need is immediate playback. For example, I want to take the data and shove it into YouTube Live.  

Offline is good for infrequent access to the recording. It’s a combination of the first approach and the second one, the one where we store multiple streams, and then we’re going to combine them in order to play them back when someone wants to play back. Offline implementations are slightly tricky in how do you synchronize them properly. It is possible to do, but it takes a bit more work to get done properly. The third approach that we have is recording metadata is well, this is a session where we have voice and video. But we’ve got other types of media or other types of information that we want to capture and record. This might be the text chat that is going on, the emojis that are going to be sent around, a presentation that is being displayed, and other things other data that we want to have as part of the recording.  

So, we can either put it aside next to the recording, and someone will build the player that would use that data. Or we can use an approach that uses a web browser, for example. When we use a web browser, we’re saying the following: We’re going to have a kind of a silent participant. That silent participant is going to be our browser in the cloud. He joins a session nobody knows that it’s there, but they know that there is a component that is recording something and the silent participant is receiving all of the streams as if it was a user. Now that silent participant can record whatever is on the screen. So if we’re rendering the screen chat messages, or emojis or videos—all of that will go into the recording itself.  

This solution is expensive, but it’s also very flexible. Because if you think about it, I can modify the screen itself, my web application for this recording component to show the exact layout that they want to appear in the end recording. I can add layers on top of that, like the names of the participants when their videos or logos or banners or whatever, I can put the check they’re branding, or whatever else I want within the recording itself. This is also quite a common approach to doing recording today for large scale video calls.  

So, what did we have? We’ve seen multiple recording alternatives? We’ve seen how we can record split streams, how can recombine them and how you can also record the metadata by using a browser or a silent participant. What you need to do is to understand the requirements and then fit the specific alternative to your requirements. Thank you.