7.4 Updates, Upgrades, Security Patches

Watch time: 06:01

Category: Chapter 7: Maintenance

Understand the required ongoing maintenance and update procedures for an RTE platform.

Dive deeper: Read how Agora Secures Real-Time Experiences.

Transcript

We can’t talk about maintenance, without reviewing updates, upgrades and security patches. What we are going to do in this lesson is understand the minimal ongoing maintenance effort that we will need to put into a real time engagement platform. We’re going to go over each and every one of these components, and try to figure out what we need to do there. And we’ll start on the client side.

Read the full transcript

We’ve talked about browsers already, but what happens with your native solutions? There, we need to update to the latest WebRTC libraries, assuming what we are going to do is to use the WebRTC open-source code in our native platforms and we probably are. If we’re not going to upgrade the WebRTC libraries themselves, we need to deal with security patches that come with that code as well.

My suggestion: Don’t upgrade automatically. That being said, don’t wait more than three to six months between the different upgrades that you’re going to do to the WebRTC library. That means that you need to plan for anywhere between two to four different upgrade procedures to the libraries of your native clients in WebRTC. And you need on top of that, to think about security patches—if there are security patches that needs to be added, then you need to time for that as well.

On your NAT traversal side with STUN and TURN servers, what we’ll have there is going to be mostly around security patches and configuration changes that come from them. These are stateless machines, so they’re quite easy to upgrade. You can simply reboot one of them and run the next release and that’s it or even take down the process and rerun the same process for the TURN server. Now, be sure to follow up on the CVEs on your TURN server’s software. CVEs are the official security announcements that come out. In recent years, we’ve seen more and more of these on TURN servers in their configuration. So, follow these in order to understand when you need to patch security issues with TURN servers.

Let’s move on to media servers. With media servers, what we’re looking for is to improve performance and to optimize it over time. This is going to be our focus. The second the second thing that we need to do here is to keep up on interoperability with WebRTC on browsers, which is another headache. We said that every month, there is going to be a new release of Chrome browser, for example, then we might want to check for interoperability with our media servers when these come out.

We need to have a process in place on how we’re going to upgrade or update the media servers in production. Here’s a suggestion: The way most companies will do that, especially as they grow and become big, would be a rolling upgrade. With a rolling upgrade, each box here is a media server or signaling server. We are going to introduce a new release—we cannot just take down all of the system and then take it back up again because we have users that are running. Also for whenever there are two separate releases that are taking place at the same time. We’ve got the old release and the new release. We’ve got another marking for draining machines. Draining machines are machines that are running the old release, and we’re waiting to drain them up from their current sessions until we can load the new release on them and start them off as fresh machines. We’re going to do a rolling upgrade and each time, drain some of the machines and replaces them instead of doing everything at the same time. Because we cannot increase the number of machines in our back end to twice the capacity that we have. Okay, that would be just too much.

So how do we drain a machine and upgrade it on a single machine? We first mark the machines that we want to upgrade. We’re not going to mark all of them at the same time. Let’s say we do it at 1% or 10% of the machines every time. Once we mark the machines, we drain them. Draining the means that we’re not going to allow any new sessions to be placed on these machines to be allocated for these machines. Now we wait 10 minutes, half an hour, an hour, as long as it takes. In some cases, we’re going to have sessions that might be eight or 10 or 20 or more hours. In these cases, we’re not going to wait what we will do is time out after a reasonable amount of time, let’s say an hour or two. I mark a machine for draining, when it reaches zero. I can upgrade it and put it back to the active pool as a new machine. But if it didn’t get drained in the timeout that I gave like an hour, I’m going to just reboot that machine and relocate the session that was there to other machines.

Now that the machine is empty or we reached the timeout, I’m going to upgrade the machine and bring it back. So this is going to be the process: Mark machines for draining, drain the machines, wait for a given amount of time, and then upgrade and reboot. We roll that upgrade across the machines each time taking some of the machines out until we upgrade all of the service. This might take us a day or more to run such an upgrade and again, we need to decide on the tactics and the strategy of how we do that in our deployment.

To summarize, you need to think through the maintenance procedures that you have in real time engagement. You need to deal with how to upgrade or when to upgrade each one of the components. When you upgrade it, you need to think about rolling upgrades for the infrastructure. Over time, what you’re what is going to happen is that you are going to migrate and switch from simple solutions of upgrading, like just rebooting the whole system, towards sophisticated rolling up good solutions. Thank you.