Skip to content
Detecting and Extracting Objects From Your Video Call Using Featured

Detecting and Extracting Objects From Your Video Call Using Agora

By Author: Agora Superstar In Developer

This blog was written by Meherdeep Thakur, an Agora Superstar. The Agora Superstar program empowers developers around the world to share their passion and technical expertise, and create innovative real-time communications apps and projects using Agora’s customizable SDKs. Think you’ve got what it takes to be an Agora Superstar? Apply here.

Object detection is a building tool for most of the Deep Learning applications out there. Implementation of object detection can give us endless use cases. Mostly, we find object detectors in webcams, surveillance cameras, dashboard cameras or mobile phones and the list goes on. The main application of object detection is object classification and tracking.

One of the most fanatic uses of object detection is seen in football (soccer), where the position of players and football is monitored. The VAR technology implements such detections so as to make decisive calls for penalties, hand-ball, goal-line clearance or just to track the offside field.

Object Detection in football

In object detection, all these objects are bound by these rectangular frames which define the location and the object. The area of convergence of the object detection bounding to the real bounding is commonly known as Intersection over Union (IoU), and depends upon various parameters like feature extraction method, sliding window size, video quality, etc. Due to this, an IoU value over 0.5 is considered as a good detection. This value changes depending upon the severity of the situation.

YOLO — You Only Look Once

YOLO is an extremely fast real-time multi-object mapping algorithm used for object detection. YOLO algorithm divides a frame into grids which when processed gives a final output with bounding boxes over the object along with their probabilities. These probabilities are nothing but IoU which signifies the bounding area convergence.

Let’s Get Started

Now I will show you how to set up your own video call interface on Agora and how to create an object detection model over it.

First, you will need a developer account on Agora and once you’re done signing up you will be redirected to the dashboard. Navigate to the Project Management tab and create a new project. Give a suitable name to your project and this will give you an APP ID which you will be using to reference Agora’s video call to your project.

Project Structure

│   chromedriver.exe
│   image.jpg
│   imagenew.jpg
│   requirements.txt
│   resnet50_coco_best_v2.0.1.h5
│   test.png
│   test_output.png
│   │   AgoraRTCSDK-2.6.1.js
│   │   index.html
│   │
│   └───.vscode
│           launch.json
│       bus-15.jpg
│       car-12.jpg
│       car-13.jpg
│       car-14.jpg
│       car-16.jpg
│       car-17.jpg
│       car-18.jpg
│       car-19.jpg
│       car-20.jpg
│       car-21.jpg
│       car-22.jpg
│       car-23.jpg
│       car-24.jpg
│       car-25.jpg
│       car-26.jpg
│       car-28.jpg
│       car-29.jpg
│       car-30.jpg
│       car-31.jpg
│       car-33.jpg
│       car-34.jpg
│       car-35.jpg
│       car-36.jpg
│       car-37.jpg
│       car-38.jpg
│       motorcycle-32.jpg
│       motorcycle-4.jpg
│       person-1.jpg
│       person-10.jpg
│       person-11.jpg
│       person-2.jpg
│       person-3.jpg
│       person-5.jpg
│       person-6.jpg
│       person-7.jpg
│       person-8.jpg
│       person-9.jpg
│       truck-27.jpg

The above code uses the Agora community SDK to integrate the video call and get frames out of the video call. In the second line, replace the app id field with the app ID you generated. Similarly, add the path to the chromium driver and give a suitable name to your channel.“filename”) extracts the image from the call and then saves it in your local directory (here, I have used test.png)

The two major functions of this code are displayed below.

The first function is to set up a video call with a relevant app ID and channel name. It also uses chromium driver which provides the capability for navigating to web pages on execution.

The second major function is to process the frames from the video call for object detection; for this I will be using COCO dataset, which has over 1.5 million object instances. These instances help in higher accuracy of images, providing an IoU between 0.5 to 1.

The above object detection model extracts all the detected objects from the image and saves it in a local directory named test_output.png-objects.

Here is a reference example:


After processing this image with the YOLO (trained on COCO dataset), it gives us the extracted objects and classifies them.


And as I mentioned, all the objects will be extracted as a separate image in a new folder.

extracted image

There it is! We have successfully made an object detection model over Agora’s video call. You can refer the code for the above tutorial on my GitHub.