How To Build Hot Dog — Not Hot Dog into a Live Stream with ARKit

Are you a big TV show binge-watcher? I sure am! As a dev, one thing I really enjoy is when a particular show highlights how technology interacts with the real world in believable ways, impacting us with unexpected—and often unintentionally hilarious—results.

For instance, back in 2017, HBO’s Silicon Valley aired an episode with the “Hot dog — Not hot dog” scene, where Jin Yang creates an app that recognizes hot dogs and everything else as “not hot dog.” The scene depicts a classic first step in training an AI for visual recognition.

official clip from HBO (nsfw warning: contains graphic language)
For this tutorial, I will train a custom AI model using IBM Watson, and then use that model to detect “hot dog” or “not hot dog” within a live camera view. I’ll use augmented reality to display the result to the user. Since everything is more fun with friends, we’ll add a live streaming component to it!

Prerequisites

Please Note: While no CoreML/AI knowledge is needed to follow along, certain basic concepts won’t be explained along the way.

Device Requirements

In this project, we’ll be using ARKit so we have some device requirements:

  • iPhone 6S or newer
  • iPhone SE
  • iPad (2017)
  • All iPad Pro models

Training the AI

Before we can build our iOS app, we first need to train the computer vision AI model. I chose IBM’s Watson Studio because they provide a very simple, drag-n-drop interface for training a computer vision model.

Create the Watson Project

Once you’ve created and logged into your Watson Studio account, click Create Project. Give your project a name/description, add the storage, and then click to create the project.
Next, click Add Project and select Visual Recognition. Make sure to follow the prompts to add a Watson Visual Recognition service to the project. On the Custom Models screen we’re going to Classify Images, so click the Create Model button. Now let’s name our model. I chose the name HotDog!

Source Training Images

Now that we’ve set up our Watson instance, we need images to train our model. Sourcing images for AI training may sound like a daunting undertaking. While it is quite a heavy lift, there are tools that help make this task easier.

I chose to use the Google Images Download python script. The script makes it easy to scrape images from Google while still respecting the original owner’s copyrights.

Once you have set up the Google Images Download script, lets open up the command line and run it using:

googleimagesdownload --keywords "hotdog" --usage_rights labeled-for-reuse
Now that we have all our images downloaded, let’s take a look through. You should see mostly pictures of hot dogs, but you’ll also notice pictures of things related to hot dogs but not actually hot dogs. For example, there’s a hot dog cart, there are pictures of hot dogs with fries/chips, there’s one of just a bun, there are even illustrated hot dogs. While corn dogs may be considered a type of hot dog, for this model we’re going to exclude them.
example of hot dog cart image that was downloaded using the script

We need to remove any photos like these so we are left with only pictures of real hot dogs (images can include toppings). Once all images are removed, you’ll notice that we’re left with about 50 photos. This isn’t very many considering that you’d usually want thousands of photos to train your model. While Watson could probably work with only 50 or so photos, let’s run the script a few more times with other keywords. These are the commands I used:

googleimagesdownload --keywords "hotdog" --usage_rights labeled-for-reuse

googleimagesdownload --keywords "plain hotdog" --usage_rights labeled-for-reuse

googleimagesdownload --keywords "real hotdog" --usage_rights labeled-for-reuse

googleimagesdownload --keywords "hotdog no toppings" --usage_rights labeled-for-reuse

After running the script a few times and removing any non-real hot dog images from each set of results, I was able to source 170 images for my model.

Let’s put all our hot dog images into a single folder and name it hotdog. Now that we have our hot dog images, we need to find some not-hotdog images. Again let’s use the Google Images Download script but this time with a batch of keywords. I used:

googleimagesdownload --keywords "cake, pizza, hamburger, french fries, cup, plate, fork, glasses, computer, sandwich, table, dinner, meal, person, hand, keyboard"  --usage_rights labeled-for-reuse

IBM Watson’s free tier imposes file size limits (250mb  per round) for training models, so once we’ve downloaded all of our non-hotdog images we need to remove any images with large file sizes. Let’s move all the images into a single folder and name it nothotdog. Next zip each folder so you have hotdog.zip and nothotdog.zip.

Now, go back to the Watson Studio project, and upload the hotdog.zip file to our computer vision model. Once our zip finishes uploading you’ll notice that a new class hotdog has been created for us.

Next, upload your nothotdog.zip file. After it finishes uploading, you’ll have two classes: hotdog and nothotdog. For this example, we only need one class hotdog; the other class needs to be migrated into the existing Negative class. To do this, we need to open up the nothotdog class and select all the images. To do so, select the list view from the top, then scroll to the bottom and set the list length to 200, then scroll back to the top and click the select all button.

With all your images selected, click the Reclassify button, select the Negative class, and click Submit.
Once all the images have been reclassified, click back to the list of models to select and delete the nothotdog class. Now we are ready to click the Train Model button to get Watson trained on our images.
This process takes a while, so keep reading to see how it all comes together

Note: if you want to use a large data set you’ll have to break it up into pieces and repeat the training process (above) for each batch.

That’s about it for collecting training images. All in all, it wasn’t too bad.

Building the iOS App

Now that Watson is training the visual recognition model, we are ready to build our iOS app.

In this example we’ll build an app that allows users to create or join a channel. Users that create a channel are then able to live stream themselves while they use IBM Watson custom model to infer hotdog or not hotdog.

Let’s start by creating a new single view app in Xcode.

I named my project Agora Watson ARKit

Remove Scene Delegate

Since we are using the Storyboard interface, we can remove the SceneDelegate.swift and the Scene Manifest entry from the info.plist. Then we need to open the AppDelegate.swift and remove the Scene Delegate methods, and add the window property. Your AppDelegate.swift should look like this:

Since this project will implement ARKit and Agora, we’ll use the AgoraARKit library to simplify the implementation and UI for us.

Create a Podfile, open it and add the AgoraARKit pod.

Then run the install:

pod install

Permissions

Add NSCameraUsageDescriptionNSMicrophoneUsageDescription, NSPhotoLibraryAddUsageDescription, and NSPhotoLibraryUsageDescription to the info.plist with a brief description for each. AgoraARKit uses the popular ARVideoKit framework and the last two permissions are required by ARVideoKit because of its ability to store photos/videos.

Note: I’m not implementing on-device recording so we don’t need any of the library permissions; but if you plan to use this in production you will need to include them because they are requirements for ARVideoKit. For more information review Apple’s guidelines on permissions.

Building the UI

We are ready to start building the UI. For this app we will have to build two views, the initial view and the AR view.

Within any live streaming and communication apps, you (as the developer) have two options for setting a channel name, do it for the user or allow users to input their own. The latter is more flexible, so we’re going to extend our initial view to inherit from AgoraLobbyVC and allow users to input a channel name. Open your ViewController.swift, add import AgoraARKit just below the import UIKit line and set your ViewController class to inherit from AgoraLobbyVC.

Next, set your Agora App Id within the loadViewmethod and also set a custom image for the bannerImage property.

Next let’s override the joinSession and createSessionmethods within our ViewController to set the images for the audience and broadcaster views.

Adding in the AI

Once Watson has finished training your model, you’ll need to download the CoreML file. Open Watson Studio and select the Hotdog Model. Within the model details, select the Implementation tab, then select the Core ML tab from the sub-menu on the left side of the screen. At the top of the Core ML section is the link to download the CoreML model file.

Once you’ve downloaded the Hotdog.mlmodel file, drag the file into your Xcode project.

The computer vision will be running within our AR view, which is also the camera view being streamed into Agora, so we’ll extend the ARBroadcast class. The ARBoadcaster class is a bare-bones ARSCNView that is set up as a custom video source for Agora’s SDK.

Create a new class called arHotDogBroadcaster which inherits from ARBroadcaster. Next we need to add properties for VNRequest and the DispatchQueue. Next extend the viewDidLoad and import the coreML model.

We’ll use the currentFrame from the ARKit scene as our input for our computer vision. Use the currentFrame.capturedImage to create a CIImage that will be used as input for our VNImageRequestHandler.

How do we know what the results are? If you look at the viewDidLoad snippet above, you’ll notice we set classificationCompleteHandler as the completion block for any classification requests.

Every time the CoreML engine returns a response we need to parse it and display to the user, “Hot Dog” or “Not Hot Dog.” You’ll notice in the snippet, that once we have a result, we parse it and then check the confidence level. I set the bar fairly low with a 40% confidence. That means that the AI model only has to be 40% confident that it sees a hotdog.

During testing 40% confidence proved adequate for the intents of this project, but you may want to adjust that value depending on how sensitive you want your AI to be.

All that’s left now is to display the result to the user using augmented reality. You’ll notice in the classificationCompleteHandler, we call the function self.showResult and pass in a string with the value of either “Hot Dog” or “Not Hot Dog.” Within showResult, we need to get the estimated position of the object and add an AR text label.

Now that we have our model ready to run, we need to add a way for the user to invoke the computer vision model. Let’s use the View’s touchesBegan method to call the runCoreML method.

override func touchesBegan(_ touches: Set<UITouch>, with event: UIEvent?) {
  dispatchQueueML.async {
      self.runCoreML()
  }
}

Implement new broadcaster class

We‘re almost done. The last step (before we can start testing) is to set the ARBroadcaster in the ViewController.swift by updating line 43 to:

let arBroadcastVC = arHotDogBroadcaster()

This will set the ARbroadcaster to use our new arHotDogBroadcaster and we are ready to start testing!

Custom Lobby Scene

That's It!

The core application is done, I’ll leave it up to you to customize the UI. Thanks for following along. If you have any questions or feedback, please leave a comment.

I’ve uploaded my complete code with UI customizations, to GitHub so feel free to fork the repo and make PR’s for new features.

Other Resources


Add high-quality voice, video and streaming to any app with ease.