I had a relative who lost their eyesight due to Diabetic Retinopathy. Diabetic retinopathy affects up to 80% of those who have had diabetes for 20 years or more. It occurs when blood vessels in the back of the eye, the retina, become damaged and slowly cause the loss of vision. It is painful to watch your loved ones go through this experience. Tony Stark, Dr Jessica Mega and team did a great job of covering Diabetic Retinopathy in ‘Age of AI’.

View for a person with clear vision vs for a person with Diabetic Retinopathy (Wikipedia)

My relative is not with us today. Diabetes is common in our area (Bhakkar). About 30% of people above 40 are diagnosed with it. Many probably don’t know it yet.

If you know a visually impaired person, you are probably familiar with the constant struggle they and their family goes through every day.

Problem Statement

How can we make visually impaired people independent to move around

I’ve not been a student of biology, so my first instinct was to employ technology to solve this problem. I envisioned a smartphone app that would assist a visually impaired person in navigation, in real time. Microsoft had released a tremendous app called SeeingAI - only for iOS - that helps one make sense of their environment and recognize things. The app ‘Be My Eyes’ employs a network of volunteers who assist visually impaired people in small tasks, by watching live video from their cameras.

The Plan

I named this project Bhakkar. Even though I was planning to work on this problem for a long time, I completed DeepLearning.ai specialization very recently. My naive plan for Bhakkar is

Take a model that can detect plane and depthmap from a single photo, tweak it to work on mobile, in real time

Thus, I went through many papers, barely understanding them. Then I decided to focus on PlaneRCNN by Liu et al. which seemed to do comparatively better at planar surface detection and depthmap estimation. PlaneRCNN is based on Mask R-CNN - an object detection and instance segmentation framework by Facebook AI Research. Conceptually, MaskRCNN is similar to Faster-RCNN. But in addition to Object Detection, it also outputs a mask of where exactly the detected object is in the image.

Object detection and masks predicted by Mask R-CNN (Matterport)

The authors of PlaneRCNN treat the planes of scene in an image as object instances. They use Mask R-CNN for detection and segmentation of these objects (i.e. planes). They train the model using a dataset they created from ScanNet. ScanNet is an RGB-D video dataset. An RGB-D image is an RGB image plus information about depth / distance of each pixel from image plane. They took every 20th frame from the video as an RGB-D image, to create an annotated dataset of RGB-D images. They skipped the frames in between because consecutive frames often aren’t much different. Although ScanNet is a dataset of indoor scenes, PlaneRCNN generalizes to outdoor scenes comparetively well.

PlaneRCNN output (NVIDIA Labs)

Woohoo… an algorithm that can detect surfaces + objects and can output the depth information of the scene. We are good to go.

Except the problem is that MaskRCNN can run at only 5 frames per second on a GPU. PlaneRCNN adds a component on top of it for depth estimation which adds overhead to an already slow algorithm.

A naive solution: swap the backend of Mask R-CNN from ResNet to MobileNet. Trade accuracy with speed.

Ignorance is a bliss


See the project status at Bhakkar homepage.