Webcam Object Detection with Mask R-CNN on Google Colab

Mask R-CNN algorithm in low light - thinks it sees a cat ¯\_(ツ)_/¯

There are plenty of approaches to do Object Detection. YOLO (You Only Look Once) is the algorithm of choice for many, because it passes the image through the Fully Convolutional Neural Network (FCNN) only once. This makes the inference fast. About 30 frames per second on a GPU.

Object bounding boxes predicted by YOLO (Joseph Redmon)

Another popular approach is the use of Region Proposal Network (RPN). RPN based algorithms have two components. First component gives proposals for Regions of Interests (RoI)… i.e. where in the image might be objects. The second component does the image classification task on these proposed regions. This approach is slower. Mask R-CNN is a framework by Facebook AI that makes use of RPN for object detection. Mask R-CNN can operate at about 5 frames per second on a GPU. We will use Mask R-CNN.

Why use a slow algorithm when there are faster alternatives? Glad you asked!

Mask R-CNN also outputs object-masks in addition to object detection and bounding box prediction.

Object masks and bounding boxes predicted by Mask R-CNN (Matterport)

The following sections contain explanation of the code and concepts that will help in understanding object detection, and working with camera inputs with Mask R-CNN, on Colab. It’s not a step by step tutorial but hopefully it would be as effective. At the end of this article, you will find the link to Colab notebook to try it yourself.

Matterport has a great implementation of Mask R-CNN using Keras and Tensorflow. They have provided Notebooks to play with Mask R-CNN, to train Mask R-CNN with your own dataset and to inspect the model and weights.

Why Google Colab

If you don’t have a GPU machine, or don’t want to go through the tedious task of setting up the development environment, Colab is the best temporary option.

In my case, I had lost my favorite laptop recently. So, I am on my backup machine - a windows tablet with a keyboard. Colab enables you to work in a Jupyter Notebook in your browser, connected to a powerfull GPU or a TPU (Tensor Processing Unit) virtual machine in Google Cloud. The VM comes pre-installed with Python, Tensorflow, Keras, PyTorch, Fastai and a lot of other important Machine Learning tools. All for free. Beware that your session progress gets lost due to a few minutes of inactivity.

Getting started with Google Colab

The Welcome to Colaboratory guide gets you started easily. And the Advanced Colab guide comes in handy when taking input from camera, communicating between different cells of the notebook, and communication between Python and JavaScript code. If you don’t have time to look at them, just remember the following.

A cell in Colab notebook usually contains Python code. By default the code runs inside /content directory of the connected Virtual Machine. Ubuntu is the operating system of Colab VMs and you can execute system commands by starting the line of the command with !.

The following command will clone the repository.

!git clone https://github.com/matterport/Mask_RCNN

If you have multiple system commands in the same cell, then you must have %%shell as the first line of the cell followed by system commands. Thus, the following set of commands will clone the repository, change the directory to Mask_RCNN and setup the project.

%%shell
# clone Mask_RCNN repo and install packages
git clone https://github.com/matterport/Mask_RCNN
cd Mask_RCNN
python setup.py install 

Import Mask R-CNN

The following code comes from Demo Notebook provided by Matterport. We only need to change the ROOT_DIR to ./Mask_RCNN, the project we just cloned.

The python statement sys.path.append(ROOT_DIR) makes sure that the subsequent code executes within the context of Mask_RCNN directory where we have Mask R-CNN implementation available. The code imports the necessary libraries, classes and downloads the pre-trained Mask R-CNN model. Go through it. The comments make it easier to understand the code.

import os
import sys
import random
import math
import numpy as np
import skimage.io
import matplotlib
import matplotlib.pyplot as plt

# Root directory of the project
ROOT_DIR = os.path.abspath("./Mask_RCNN/")

# Import Mask RCNN
sys.path.append(ROOT_DIR)  # To find local version of the library
from mrcnn import utils
import mrcnn.model as modellib
from mrcnn import visualize

# Import COCO config
sys.path.append(os.path.join(ROOT_DIR, "samples/coco/"))  # find local version
import coco

%matplotlib inline 

# Directory to save logs and trained model
MODEL_DIR = os.path.join(ROOT_DIR, "logs")

# Local path to trained weights file
COCO_MODEL_PATH = os.path.join(ROOT_DIR, "mask_rcnn_coco.h5")
# Download COCO trained weights from Releases if needed
if not os.path.exists(COCO_MODEL_PATH):
    utils.download_trained_weights(COCO_MODEL_PATH)

# Directory of images to run detection on
IMAGE_DIR = os.path.join(ROOT_DIR, "images")

Create Model from Trained Weights

Following code creates model object in inference mode, so we could run predictions. Then it loads the weights from the pre-trained model that we downloaded earlier, into the model object.

# Create model object in inference mode.
model = modellib.MaskRCNN(mode="inference", model_dir=MODEL_DIR, config=config)

# Load weights trained on MS-COCO
model.load_weights(COCO_MODEL_PATH, by_name=True)

Run Object Detection

Now we test the model on some images. Mask_RCNN repository has a directory named images that contains… you guessed it… some images. The following code takes an image from that directory, passes it through the model and displays the result on the notebook along with bounding box information.

# Load a random image from the images folder
file_names = next(os.walk(IMAGE_DIR))[2]
image = skimage.io.imread(os.path.join(IMAGE_DIR, random.choice(file_names)))

# Run detection
results = model.detect([image], verbose=1)

# Visualize results
r = results[0]
visualize.display_instances(image, r['rois'], r['masks'], r['class_ids'], 
                            class_names, r['scores'])

The result of prediction

Processing 1 images
image                    shape: (425, 640, 3)         min:    0.00000  max:  255.00000  uint8
molded_images            shape: (1, 1024, 1024, 3)    min: -123.70000  max:  151.10000  float64
image_metas              shape: (1, 93)               min:    0.00000  max: 1024.00000  float64
anchors                  shape: (1, 261888, 4)        min:   -0.35390  max:    1.29134  float32

Working with Camera Images

In the advanced usage guide of Colab, they have provided code that can capture image from webcam in the notebook and then forward it to the Python code.

Colab notebook has pre-installed python package called google.colab which contains handy helper methods. There’s a method called output.eval_js which helps us evaluate the JavaScript code and returns the output to Python. And in JavaScript, we know that there is a method called getUserMedia() which enables us to capture the audio and/or video stream from user’s webcam and microphone.

Have a look at the following JavaScript code. Using getUserMedia() method of WebRTC API of JavaScript, it captures the video stream of the webcam and draws the individual frames on HTML canvas. Like google.colab Python package, we have google.colab library available to us in JavaScript. This library will help us invoke a Python method using kernel.invokeFunction function from our JavaScript code.

The image captured from webcam is converted to Base64 format. This Base64 image is passed to a Python callback method, which we will define later.

async function takePhoto(quality) {
    // create html elements
    const div = document.createElement('div');
    const video = document.createElement('video');
    video.style.display = 'block';

    // request the stream. This will ask for Permission to access 
    // a connected Camera/Webcam
    const stream = await navigator.mediaDevices.getUserMedia({video: true});

    // show the HTML elements
    document.body.appendChild(div);
    div.appendChild(video);
    // display the stream
    video.srcObject = stream;
    await video.play();

    // Resize the output (of Colab Notebook Cell) to fit the video element.
    google.colab.output
        .setIfrmeHeight(document.documentElement.scrollHeight, true);

    // capture 5 frames (for test)
    for (let i = 0; i < 5; i++) {
        const canvas = document.createElement('canvas');
        canvas.width = video.videoWidth;
        canvas.height = video.videoHeight;
        canvas.getContext('2d').drawImage(video, 0, 0);
        img = canvas.toDataURL('image/jpeg', quality);

        // Call a python function and send this image
        google.colab.kernel.invokeFunction('notebook.run_algo', [img], {});

        // wait for X miliseconds second, before next capture
        await new Promise(resolve => setTimeout(resolve, 250));
    }

    stream.getVideoTracks()[0].stop(); // stop video stream
}

We already discussed that having %%shell as the first line of the Colab notebook cell makes it run as terminal commands. Similarly, you can write JavaScript in the whole cell by starting the cell with %%javascript. But we will simply put the JavaScript code we wrote above, inside the Python code. Like this:

from IPython.display import display, Javascript
from google.colab.output import eval_js
from base64 import b64decode

def take_photo(filename='photo.jpg', quality=0.8):
    js = Javascript('''
        // ...
        // JavaScript code here <<==
        // ...
        ''')
    # make the provided HTML, part of the cell
    display(js) 
    # call the takePhoto() JavaScript function
    data = eval_js('takePhoto({})'.format(quality)) 

Python - JavaScript Communication

The JavaScript code we wrote above invokes notebook.run_algo method of our Python code. The following code defines a Python method run_algo which accepts a Base64 image, converts it to a numpy array and then passes it through the Mask R-CNN model we created above. Then it shows the output image and processing stats.

Important! Don’t forget to surround the Python code of your callback method in try / except block and log it. Because it will be invoked by JavaScript and there will be no sign of what error occurred while calling the Python callback.

import IPython
import time
import sys
import numpy as np
import cv2
import base64
import logging

from google.colab import output
from PIL import Image
from io import BytesIO

def data_uri_to_img(uri):
    """convert base64 image to numpy array"""
    try:
        image = base64.b64decode(uri.split(',')[1], validate=True)
        # make the binary image, a PIL image
        image = Image.open(BytesIO(image))
        # convert to numpy array
        image = np.array(image, dtype=np.uint8); 
        return image
    except Exception as e:
        logging.exception(e);print('\n')
        return None

def run_algo(imgB64):
    """
    in Colab, run_algo function gets invoked by the JavaScript, 
    that sends N images every second, one at a time.

    params:
        image: image
    """
    image = data_uri_to_img(imgB64)  
    if image is None:
        print("At run_algo(): image is None.")
        return

    try:
        # Run detection
        results = model.detect([image], verbose=1)
        # Visualize results
        r = results[0]    
        visualize.display_instances(
            image, 
            r['rois'], 
            r['masks'], 
            r['class_ids'], 
            class_names, 
            r['scores']
            )
    except Exception as e:
        logging.exception(e)
        print('\n')

Let’s register run_algo as notebook.run_algo. Now it will be invokable by the JavaScript code. We also call the take_photo() Python method we defined above, to start the video stream and object detection.

# register this function, so JS code could call this
output.register_callback('notebook.run_algo', run_algo)

# put the JS code in cell and run it
take_photo()

Try it yourself

You are now ready to try Mask R-CNN on camera in Google Colab. The notebook will walk you step by step through the process.

(Optional) For Curious Ones

The process we used above converts the camera stream to images in browser (JavaScript) and sends individual images to our Python code for object detection. This is obviously not real-time. So, I spent hours trying to upload the WebRTC stream from the JavaScript (peer A) to the Python Server (peer B) without success. Perhaps my unfamiliarity with the combination of async / await with Python Threads was the main hindrance. I was trying to use aiohttp as Python server that will handle WebRTC connection using aiortc. The Python library aiortc makes it easy to create Python as a peer of WebRTC. Here is the link to the Colab notebook with incomplete effort of creating WebRTC server.

What’s next

Facebook AI Research is working on almost similar problem, but through simulations. The algorithm that can solve the navigation problem will have applications in Augmented / Virtual Reality, Security, Robotics, and tools for Visually Impaired people.

Facebook AI has effectively solved the task of point-goal navigation by AI agents in simulated environments, using only a camera, GPS, and compass data. Agents achieve 99.9% success in a variety of virtual settings, such as houses and offices. - @facebookai

In Part 2, we will use PlaneRCNN to extract the depthmap and surface planes from a single image. The next step will be to use this depthmap information to understand the environment around us, using the smartphone camera.

A Tiny Request

I am pretty new to Deep Learning and my approach to the problem could be very naive. If you think you or someone you know can advise me on how to approach the “Navigation for Visually Impaired” problem, please connect at emadulehsan@gmail.com.