Skip to content
Build a Sign Language Recognition App Using the Agora Video SDK Featured

Build a Sign Language Recognition App Using the Agora Video SDK

By Author: Agora Superstar In Developer

This blog was written by Shriya Ramakrishnan, an Agora Superstar. The Agora Superstar program empowers developers around the world to share their passion and technical expertise, and create innovative real-time communications apps and projects using Agora’s customizable SDKs. Think you’ve got what it takes to be an Agora Superstar? Apply here.

Did you know that at least 10% of the country deals with some hearing loss, from mild to total deafness? To adapt to this, American Sign Language (ASL) is now used by around 1 million people to help communicate.

With the growing amount of video-based content and real-time audio/video media platforms, hearing impaired users have an ongoing struggle to enjoy positive and productive interactions online.

Thankfully, the Agora SDKs are just as applicable to this user test case as any others, and you don’t have to worry that a significant portion of your audience won’t be left “in the dark,” as it were.

The following is a demo application that is used for recognizing the American Sign Language, letting the platform be used by and for the mute demographic.

In this application, the user will signal the ASL gestures into the webcam. After every correct letter prediction, the key ‘N’/ ’n’ is pressed. If spaces need to be added, the key ‘M’/ ‘m’ is pressed. If the sentence needs to be cleared, the key ‘C’/ ‘c’ is pressed; and if the last written letter needs to be deleted, the key ‘D’/ ‘d’ is pressed.

I will be walking you through the step-by-step process of developing a demo application to achieve the above with the use of Tensorflow, Python, and Agora.

For anyone unfamiliar with Agora, it is a real-time engagement provider delivering voice, video, and live streaming on a global scale for mobile, native and desktop apps.

We’ll be using Agora’s live interactive video streaming for the live video input.

So let’s get started!

Step 1: Make the Connection

We need to set up a video call. To do so, sign-up on Agora here. After signing up, log in and head to the ‘project management’ tab. Create a new project with a suitable name of your choice. Procure the app-id by copying it onto your clipboard and pasting it somewhere you will be able to access it later while developing the code.

Step 2: Acquire the Code

For the complete code click on the google drive link here.

Step 3: Comprehend the Code

To begin with, we import all the necessary packages: ‘pyttsx3,’ ‘agora_community_sdk,’ and another python file saved in the same directory as called (which is used to convert an image to grayscale and into an array). is used to convert an image to grayscale and into an array. calls another python file called defines all the characters that can be detected and calls the trained machine learning model.

In all this code, we are using the functions of the AgoraRTC library from the agora_community_sdk package to connect to the video call from our remote terminal and via the internet using the chromium driver and the agora app id you created. Enter your app-id, channel-name, link to the chromium driver executable file.

The first frame from the live video will be extracted and saved.

In this part of the code, we are initializing the text to speech engine and setting the rate of speech along with choosing the voice we wish to use.

The function withoutSkinSegment() is used to perform the main functions. It starts with setting the name of the window where we will see the output of the gesture recognition along with initializing the dimensions frame_width, frame_height, roi_height, roi_width where ROI stands for the Region of Interest within which the gestures are to be done.

cv2.VideoCapture() is being used for accessing the frames that are being taken from Agora and that are being saved locally as a video file consisting of multiple frames.

cv2.namedWindow(window_name, cv2.WND_PROP_FULLSCREEN) is being used to display the output on full-screen mode.

x_start, y_start = 100, 100

are the initialized point values for the start point and endpoint of the bounding box that will be drawn subsequently.

sentence = “”

while True:

ret, frame =

if ret is None:

print(“No Frame Captured”)


The above code is used to specify that the message “No Frame Captured” will be displayed until a frame is received.

cv2.rectangle(frame, (x_start, y_start), (x_start + roi_width, y_start + roi_height), (255, 0, 0), 3)

The above line is used to draw a rectangle around the region of interest(ROI) in the color blue.

img1 = frame[y_start: y_start + roi_height, x_start: x_start + roi_width]

img_ycrcb = cv2.cvtColor(img1, cv2.COLOR_BGR2YCR_CB)

blur = cv2.GaussianBlur(img_ycrcb, (11, 11), 0)

The ROI is being cropped out and is undergoing the following transformations for image preprocessing:

  1. Change the color space of the image from RGB (Red, Green, Blue) to YCrCb color-space using cv2.cvtColor().

The YCrCb color space is derived from the RGB color space and has the following three components

Y = Luminance or Luma component obtained from RGB after gamma correction.

Cr = R — Y ( how far is the red component from Luma )

Cb = B — Y ( how far is the blue component from Luma ).

You can read more about it here.

  1. The image is denoised and any rough edges in the image are removed using cv2.GaussianBlur().

skin_ycrcb_min = np.array((0, 138, 67))

skin_ycrcb_max = np.array((255, 173, 133))

The above two lines of code are used to set the upper and lower limits of the range of skin color.

mask = cv2.inRange(blur, skin_ycrcb_min, skin_ycrcb_max)

naya = cv2.bitwise_and(img1, img1, mask=mask)

cv2.imshow(“mask”, mask)cv2.imshow(“naya”, naya)

cv2.inRange() is used to detect the hand in the Region of interest using the skin color range we just set.

cv2.dilate() function is used to fill up the gaps within the detected hand where the skin-colored pixels aren’t being detected, by using an approximation of the pixels around the void.

naya = cv2.bitwise_and(img1, img1, mask=mask)

In this line of code, we are displaying only that part of the image within the ROI which was detected as skin using the cv2.inRange() operation.

c = cv2.waitKey(1) & 0xff is being used to define that using keyboard press certain functions will be carried out. The function is assigned to variable c.

Following this, the mask, as well as the masked region of the ROI (initialized as “naya” here), is being displayed on the screen.

In the above-given code, if any gesture closest to the gestures from the trained model is detected, then the value of that gesture is inserted into the frame using cv2.putText().

And the keypress of letter ‘N/n’, ‘M/m’, ‘C/c’ and ‘D/d’ brings about the following changes respectively as given in the introduction of this read:

  1. Add detected letter to sentence
  2. Add space to sentence
  3. Clear sentence
  4. Delete last detected letter

And finally, if the escape key(Esc) is pressed the display window is closed.

There you have it! You’ve now opened up your apps to a more diverse, inclusive user audience and offered far more value through new levels of interactivity.