7.1 Monitoring and Optimization
Understand why it’s essential to continuously monitor and optimize your RTE infrastructure.
Dive deeper: Monitor your live, interactive streams in real time with Agora Analytics.
When you think about maintenance of a real-time engagement platform, the first thing you should think about is monitoring and optimization. You cannot improve or support, something you cannot see and measure. Here’s what we’re going to do in this lesson: We’re going to understand real-time engagement dynamics, to see how they are going to affect what we need to think of in maintenance, and then plan for the changes ahead of us.
Read the full transcript
In most cases, when I see clients talking about their efforts in real-time engagement, this is what I see. They look at the timeframe of the project that they have and the effort looks like this: They start and do a lot of work and put a lot of effort and sweat into their product, and then they launch the project. Once they launched, they think that they are done with the development of that part of the real-time engagement platform itself. From there, they are reducing their effort over time. This is the worst thing that you can do. Why is that?
Because RTE is dynamic in nature, your service is going to grow. If it grows, that means that you are going to deal with scaling and architecture issues that will need to be solved. Technologies evolve. We have new things added into the requirements set and the features that we need to develop in an RTE platform. The requirements change as well, not only the technologies, and then user behavior changes as well—what users want and expect evolves over time.
So how do you keep up with all that? There are two things that you need to do. First, you need constant monitoring and maintenance of the RTE platform to understand what your users are really, really feeling in addition to what the user experience for them. Then you need to go through an ongoing optimization and tweaking of the platform at all times. This needs to be in your DNA. Let’s start with constant monitoring and maintenance. What does that mean exactly?
If I had to list the things that you need to do it start from monitoring the RTE infrastructure actively. That means collecting and measuring the server CPU memory, IO, network utilization, and application metrics—all of these things you need to collect and monitor. You need to run predictable and user and user workloads through the service in order to understand if the users are getting the service that they want and expect. Predictable end user workloads means automating mock users in order to see that the service performs as expected. Then on top of all that, you need to check for anomalies for different changes that occur over time.
You should also passively monitor the user experience by collecting data and quality metrics from client devices to understand what they are experiencing on your application level. Then you can aggregate and analyze all of these metrics to understand user experience on a global scale, not on a very specific user. Again, as with the active monitoring, you need to check for anomalies here as well. When it comes to ongoing optimization and tweaking, you first need to know where your bottlenecks are. You have developed the service; you’re running it in production. You should go to the DevOps person who will be able to tell you probably, what are the breaking breaking points of the infrastructure components.
Can your media server scale beyond 500 participants? Maybe. Can the signaling server handle 10,000 users? Maybe. What are these numbers? What are the breaking points? Then, what are the breaking point of the whole service? How many servers do you have? Can you scale to twice that size? Without having bottlenecks in terms of the algorithms that you’re using? You need to also check what are the limitations on client devices what you can and can’t do in these devices? What happens when new devices eat the market? The new iPhone? How do you support that? What are the new limitations that you have? Can you do more with that device?
Now that you know the bottlenecks and limitations, you need to have some kind of an ongoing improvement plan. You need to think about the technical debt that you have in your code. Everyone has that in all companies, things that we’ve left behind that can be improved, optimized, restructured, re-architected, you need to know that technical debt in your RTE platform. You need to prioritize them towards the bottlenecks that you have in work overtime towards correcting them. You’re not going to do everything in day one. This is going to be an ongoing task, especially if your services are successful, so plan for that.
As an added bonus, if you have these people that are doing these ongoing optimizations and changing and and continue to put effort in your platform, you will have also experienced engineers to handle roadmap requirements and changes. Hiring them later on, again, is going to be more effort than keeping them on to deal with technical depth and the ongoing maintenance and improvements.
So just remember, what you don’t measure, you don’t know. Once you’re going to measure something, you need to improve on it, because you know what is wrong. And then as you improve, you’re going to maintain and grow the core competencies of your company, and you will be able to also add and enhance the feature set, because you have skilled engineers to do that. Thank you.