How To: Build an Augmented Reality Remote Assistance App

Have you ever been on the phone with customer support and struggled to describe the issue, or had the support person fail to clearly describe the solution or not understand what/where you should be looking?

Most remote assistance today is done through audio or text based chat. These solutions can be frustrating for users who may have a hard time describing their issues or understanding new concepts and terminology associated with troubleshooting whatever they need help with.

Thankfully technology has reached a point where this issue can be easily solved using Video Chat and Augmented Reality. In this guide, we’ll walk through all the steps you need to build an iOS app that leverages ARKit and video chat to create an interactive experience.


  • A basic to intermediate understanding of Swift and the iOS SDK
  • Basic understanding of ARKit and Augmented Reality concepts
  • Developer Account
  • Cocoa Pods
  • Hardware: a Mac with Xcode and 2 iOS devices
    — iPhone: 6S or newer
    iPad: 5th Generation or newer

Please Note: While no Swift/iOS knowledge is needed to follow along, certain basic concepts in Swift/ARKit won’t be explained along the way.


The app we are going to build is meant to be used by two users who are in separate physical locations. One user will input a channel name and CREATE the channel. This will launch a back-facing AR-enabled camera. The second user will input the same channel name as the first user and JOIN the channel.

Once both users are in the channel, the user that created the channel will broadcast their rear camera into the channel. The second user has the ability to draw on their local screen, and have the touch input displayed in augmented reality in the first user’s world.

Let’s take a moment to review all the steps that we’ll be going through:

  1. Download and build starter project
  2. Project structure overview
  3. Add video chat functionality
  4. Capture and normalize touch data
  5. Add data transmission
  6. Display touch data in augmented reality
  7. Add “Undo” functionality

Getting Started with the Starter Project

I have created a starter project for this tutorial that includes the initial UI elements and buttons, including the bare-bones AR and remote user views.

Let’s start by downloading the repo above. Once all the files have finished downloading, open the Terminal window to the project’s directory and run pod install to install all dependencies. Once the dependencies have finished installing, open the AR Remote Support.xcworkspace in Xcode.

Once the project is open in Xcode, let’s build and run the project using the iOS simulator. The project should build and launch without issue.

Add a channel name, then click Join and Create buttons, to preview the UI’s that we will be working with.

Project Structure Overview

Before we start coding, let’s walk through the starter project files to understand how everything is setup. We’ll start with the dependencies, then go over the required files, and lastly we’ll take a look at the custom classes that we’ll be working with.

Within the Podfile, there are two 3rd-party dependencies:’s Real-Time Communications SDK, facilitates in building video chat functionality; ARVideoKit’s open-source renderer, facilitates using the rendered AR view as a video source. The reason we need an off-screen renderer is because ARKit obfuscates the rendered view, so we need a framework to handle the task of exposing the rendered pixelbuffer.

As we move into project files, the AppDelegate.swift has the standard set up with one minor update. The ARVideoKit library is imported and there’s an added delegate function for UIInterfaceOrientationMask to return the ARVideoKit’s orientation. Within the info.plist the required permissions for Camera and Microphone access are included. These permissions are required by ARKit, Agora, and ARVideoKit.

Before we jump into the custom ViewControllers, let’s take a look at some of the supporting files/classes that we’ll be using. The GetValueFromFile.swift allows us to store any sensitive API credentials in the keys.plist so we don’t have to hard-code them into the classes. SCNVector3+Extensions.swift contains some extensions to and functions for the SCNVector3 class that make mathematical calculations simpler. The last helper file is ARVideoSource.swift, which contains the proper implementation of the AgoraVideoSourceProtocol, which we’ll use to pass our rendered AR scene as the video source for one of the users in the video chat.

The ViewController.swift is a simple entry point for the app. It allows users to input a Channel Name and then choose whether they want to: CREATE the channel and receive remote assistance; JOIN the channel and provide remote assistance.

The ARSupportBroadcasterViewController.swift handles the functionality for the user who is receiving remote assistance. This ViewController will broadcast the rendered AR scene to the other user, so it implements the ARSCNViewDelegate, ARSessionDelegate, RenderARDelegate, and AgoraRtcEngineDelegate.

The ARSupportAudienceViewController.swift handles the functionality for the user who is providing remote assistance. This ViewController will broadcast the user’s front-facing camera and will allow the user to draw on their screen and have the touch information displayed in the remote user’s augmented reality scene, so it implements the UIGestureRecognizerDelegate, AgoraRtcEngineDelegate.

For simplicity, let’s refer to ARSupportBroadcasterViewController as BroadcasterVC and ARSupportAudienceViewController as AudienceVC.

Adding Video Chat Functionality

We’ll start by adding our AppID into the keys.plist file. Take a moment to log into your Agora Developer Account, copy your App ID and paste the hex into the value for AppID within keys.plist.

An example of the keys.plist file with an Agora AppID

Now that we have our AppID set, we will use it to initialize the Agora Engine within the loadView function for both BroadcasterVC and AudienceVC.

There are slight differences in how we set up the video configurations. In the BroadcasterVC we are using an external video source so we can set up the video configuration and the source within the loadView.

the loadView function within ARSupportBroadcasterViewController

Within the AudienceVC we will init the engine and set the Channel Profile in the loadView, but we will wait to configure video settings within the viewDidLoad.

the loadView function within ARSupportAudienceViewController

Note: We’ll add in the touch gestures functionality later on in this tutorial.

Let’s also set up the video configuration within the AudienceVC. Within the viewDidLoad call the setupLocalVideo function.

override func viewDidLoad() {
    // Agora implementation
    setupLocalVideo() //  - set video configuration
    //  - join the channel

Add the code below to the setupLocalVideo function.

Next we’ll join the channels from the viewDidLoad. Both ViewControllers use the same function to join the channel. In each BroadcasterVC and AudienceVC call the joinChannel function within the viewDidLoad.

override func viewDidLoad() {
    joinChannel() // Agora - join the channel

Add the code below to the joinChannel function.

The joinChannel function will set the device to use the speakerphone for audio playback, and join the channel set by the ViewController.swift.

Note: This function will attempt to get the token value stored in keys.plist. This line is there in case you would like to use a temporary token from the Agora Console. For simplicity I have chosen to not use token security, so we have not set the value. In this case the function will return nil, and the Agora engine will not use token based security for this channel.

Now that users can join a channel, we should add functionality to leave the channel. Similar to joinChannel, both ViewControllers use the same function to leave the channel. In each BroadcasterVC and AudienceVC add the code below to the leaveChannel function.

The leaveChannel function will get called in popView and viewWillDisapear because we want to make sure we leave the channel whenever the user clicks to exit the view or if they dismiss the app (backgrounded/exit).

The last Video Chat feature we need to implement is the toggleMic function, which gets called anytime the user taps the microphone button. Both BroadcasterVC and AudienceVC use the same function, so add the code below to the toggleMic function.

Handling Touch Gestures

In our app, the AudienceVC will provide remote assistance by using their finger to draw on their screen. Within the AudienceVC we’ll need to capture and handle the user’s touches.

First we’ll want to capture the location whenever the user initially touches the screen. Set that point as the starting point. As the user drag’s their finger across the screen, we’ll want to keep track of all those points, so we’ll use touchPoints array to add each point, so we need to ensure an empty array with every new touch. I prefer to reset the array in the touchesBegan to mitigate against instances where the user adds a second finger to the screen.

Note: This example will only support drawing with a single finger. It is possible to support multi-touch drawing, it would require some more effort to track the uniqueness of the touch event.

To handle the finger movement, let’s use a Pan Gesture. Within this gesture we’ll listen for the gesture to start, change, and end states. Let’s start by registering the Pan Gesture.

Once the Pan Gesture is recognized, we’ll calculate the position of the touch within the view. The GestureRecognizer gives us the touch positions as values relative to the Gesture’s initial touch. This means that the translation from GestureRecognizer at GestureRecognizer.began is (0,0). The self.touchStart will help us to calculate the x,y values relative to the view’s coordinate system.

Once we’ve calculated the pixelTranslation (x,y values relative to the view’s coordinate system), we can use these values to draw the points to the screen and to “normalize” the points relative to the screen’s center point.

I’ll discuss normalizing the touches in a moment, but first let’s go through drawing the touches to the screen. Since we are drawing to the screen we’ll want to use the Main thread. So within a Dispatch block we’ll use the thepixelTranslation to draw the points into the DrawingView. For now don’t worry about removing the points because we’ll handle that we transmit the points.

Before we can transmit the user’s touches we need to normalize the point relative to the screen’s center. UIKit calculates with (0,0) being the upper left hand corner of the view, but within ARKit we’ll need to add the points relative to the ARCamera’s center point. To achieve this we’ll calculate the translationFromCenter using the pixelTranslation and subtracting half of the view’s height and widths.

visualization of differences between UIKit and ARCamera coordinate systems

Transmitting Touches and Colors

To add an interactive layer, we’ll use the DataStream provided as part of the Agora engine. Agora’s video SDK allows for the ability to create a data stream capable of sending up to 30 (1kb) packets per second. Since we will be sending small data messages this will work well for us.

Let’s start by enabling the DataStream within the firstRemoteVideoDecoded. We’ll do this in both BroadcasterVC and AudienceVC.

If the data stream is enabled successfully, self.streamIsEnabled will have a value of 0. We’ll check this value before attempting to send any messages.

Now that the DataStream is enabled, we’ll start with AudienceVC. Let’s review what data we need to send: touch-start, touch-end, the points, and color. Starting with the touch events, we’ll update the PanGesture to send the appropriate messages.

Note: Agora’s Video SDK DataStream uses raw data so we need to convert all messages to Strings and then use the .data attribute to pass the raw data bytes.

ARKit runs at 60 fps so sending the points individually would cause us to hit the 30 packet limit resulting in point data not getting sent. So we’ll add the points to the dataPointsArray and transmit them every 10 points. Each touch-point is about 30–50 bytes, so by transmitting every tenth point we will stay well within the limits of the DataStream.

When sending the touch data, we can also clear the DrawingView. To keep it simple we can get the DrawingView sublayers, loop through and remove them from the SuperLayer.

Lastly, we need to add support for changing the color of the lines. We’ll send the cgColor.components to get the color value as a comma delimited string. We’ll prefix the message with color: so that we don’t confuse it with touch data.

Now that we’re able to send data from the AudienceVC, let’s add the ability for BroadcasterVC to receive and decode the data. We’ll use the rtcEngine delegate’s receiveStreamMessage function, to handle all data that is received from the DataStream.

There are a few different cases that we need to account for, so we’ll use a Switch to check the message and handle it appropriately.

When we receive the message to change the color, we need to isolate the component values, so we need to remove any excess characters from the string. Then we can use the components to initialize the UIColor.

In the next section we’ll go through handling the touch-start and adding the touch points into the ARSCN.

Display Gestures in Augmented Reality

Upon receiving the message that a touch has started, we’ll want to add a new node to the scene and then parent all the touches to this node. We do this to group all the touch points and force them to always rotate to face the ARCamera.

Note: We need to impose the LookAt constraint to ensure the drawn points always face the user. Points will need to be drawn always facing the camera.

When we receive touch-points we’ll need to decode the String into an Array of CGPoints that we can then append to the self.remotePoints array.

Within the session delegate’s didUpdate we’ll check the self.remotePoints array. We’ll pop the first point from the list and render a single point per frame to create the effect that the line is being drawn. We’ll parent the nodes to a single root node that gets created in the upon receipt of the touch-start message.

Add “Undo”

Now that we have the data transmission layer setup, we can quickly keep track of each touch gesture and undo it. We’ll start by sending the undo  essage from the AudienceVC to the BroadcasterVC. We’ll add the code below to the sendUndoMsg function within our AudienceVC.

Send the string “undo” as a message

Within the BroadcasterVC we’ll check for the undo message within the rtcEngine delegate’s receiveStreamMessage function. Since each set of touch points are parented to their own root nodes, with every undo message we’ll remove the last rootNode (in the array) from the scene.

Build and Run

Now we are ready to build and run our app. Plug in your two test devices, build and run the app on each device. On one device enter the channel name and Create the channel, and then on the other device enter the channel name and Join the channel.

Thanks for following and coding along with me, below is a link to the completed project. Feel free to fork and make pull requests with any feature enhancements.

For more information about the Video SDK, please refer to the API Reference.