Skip to content
What it takes to build real-time voice and video infrastructure

In this series, WebRTC expert Tsahi Levent-Levi of BlogGeek.me provides an overview of the essential parts of a real-time voice and video infrastructure—from network to software updates. Check out his informative videos and read how Agora’s platform solves the challenges so you can focus on your innovation.

2.2 Codecs

Watch time:
Category: Chapter 2: Challenges

The codec is an essential component of RTE and choosing the right one is important. Learn about the key considerations when selecting a codec.

Transcript

Let’s talk about codecs and where they fit into the challenges we’ve got with real time communication and engagement. Here’s what we’re going to do in this lesson: We’re going to review what codecs are exactly and we’re going to frame the conversation around the context of real time engagement.

Read the full transcript

Codecs, at the end of the day, take data from the analog world and shift it towards the digital. We’re going to look at something… We’re going to capture information in the camera, the images that we supposedly see, convert them to a digital representation, a set of frames, and in each frame, a set of pixels. We’re going to do the same for voice by using microphones. What codecs do is they take that data and then compress it.

There are different types of codecs in the world. There are lossless codecs. These are codecs that take the data that they see, and then compress it in a way that doesn’t lose any information. And then there are lossy codecs. With lossy codecs, what we do, is we take the data that we captured, and then we throw away all the faith that are not perceived by the human ear or the human eye, the things that we simply don’t see, so why use them? So we lose that data. Once we lost that data, we’re going to compress what is left as much as we can.

With real time engagement, we’re going to do one more thing. We’re going to focus on low latency for both encoding and decoding. We want the time that it takes us to compress the data to be as short as possible. If we look at the media subsystem that we have in RTE, what we have is the camera on one end, and then the device or the display on the other end. We’re going to take the capture the data from that camera or microphone, and we’re going to use an encoder to compress that data. Once we have that data compressed, we’re going to send it over the network.

On the other end, we’re going to receive the data from the network, pass it to the decoder, the decoder is going to decompress that information. Then we’re going to play it back either to a monitor or to the speakers of the machine. There are some general codec characteristics when we look at different codecs. There’s complexity. How much CPU and memory does this codec need in order to operate? The higher the complexity, the more CPU and memory we need? There is latency—how many milliseconds it is going to take you to encode and decode. With real-time engagement, we want this to be as lean as low as possible. There is resiliency. What’s the resiliency of the codec itself without any additions that we’re going to put on top of it towards packet loss.

Usually, the more complicated, the codec or the more compression that the codec brings, the less resilient it is to packet loss. With a more modern codecs—by the way, this might not be true anymore, because codec architects are building resiliency into the codec itself as part of the features that are looking for—the requirements that they need. One other important aspects of codecs are IPR (royalty patents) Codecs are usually heavily patented, which means that you will need to pay someone for using the codec.

Another important aspect is popularity. What’s the ecosystem around that codec? How many developers can you find that understands it and will be able to support it? How many vendors are out there, that you can outsource work to? Then there is support. Do we have only software support for this codec? Or does common hardware include encoding and decoding decoding capabilities? Codecs are complicated, which means that if I have hardware support for them, it might be easier to have that within my application and use less CPU for example.

One thing to remember is that the codec is defined by its decoder. By the way, we take a bitstream and unravel it. That means that if I look at the codec specification, what I will see is an explanation of what to do with the bitstream. Given this bitstream that was received over the network, this is what you do in order to play it back.

What does that mean? It means that it simply dictates a set of tools that are available on the encoder side. Okay, you can use A, B, C, and D, if you want to encode, but it doesn’t say exactly which of these tools to use at any given point in time. This means that the encoders complexity and efficiency is going to depend on the implementer. The implementer is many tools on the encoder side. They will need to pick and choose what to use for each scenario for each new input that he received in order to give better compression.

We’re going to look at two different codecs, voice and video. When we’re looking at voice or speech codecs (or as they’re sometimes called depending on what they do) we’re going to practice ryzen (sic) in two ways: first of all the quality that they offer, and the compression rates that we get from them. When you look at quality in audio codecs, what we’re talking about usually is the frequency at which the original speech, music, or sound were captured and then compressed. If we’re talking about, you know, normal or old-time telephony systems, these were narrowband, they got frequencies between 300 to 3,400 hertz. If we went higher, towards wideband, this is also known as HD Voice High-Definition voice, we’re in the 7,000 hertz frequency. On top of that, we’ve got super wideband and then full band. Both of these are good for music. What we want to achieve is wideband for audio to get high quality audio and super wide band or full band if we’re looking at music.

When we talk about video codecs, then the question is how do we rate a video codec and what is best? And this is a bigger problem with voice because video has a lot more data in it. And then the more that I have to compress the more CPU, I need to invest in more memory I need to invest. If I take the same input and put it through different codecs, let’s say H. 261, H263, H264, or HEVC— these are different codecs. We know the right most codec here at AVC is the best in terms of bitrate. It will compress the data the most while keeping the quality. But on the other hand, if you look at the complexity, V codecs that provide us better compression are more complex. They are going to take a lot more CPU. What I mean by all that is that compression is not the only measurement. I cannot say, “Let’s take take the best codec that give me the best bitrate and use that,” because this might not even work on my device, or it might take too much CPU, which will kill the battery life,for example, of a cellular phone.

There are different video codecs that we can select from. Today, we’re talking usually about H264, versus VP8 or HEVC, versus VP9. In the near future, we’re going to have AV1, which was just announced or released in the last two years. The red ones here are royalty bearing, it means that for using them, we will need to pay royalties to the MPEG LA, and maybe to other standard organizations. VP8, VP9, and AV1 are royalty-free codecs. We can use them as we see fit. Which one to use will depend on the use case that we have, and the type of solution that we’re looking for.

So what have we seen? We’ve seen that codecs are an essential essential part of an RTE solution. If we’re going to use real-time engagement, we will need codecs in order to send our voice and video. Codecs and the implementations that they have are going to vary in the performance and the quality that they give us. We’ll need to choose the codecs carefully in order to fit the use case and scenario that we want. We’ll need to find a good implementation that, again, will fit the scenario that we have. Thank you.