Building deep learning model to perform image segmentation of Indian roads

6 min readJun 15, 2020

What is image segmentation?

Image segmentation is a process of taking a image as pixels and identifying the type of object in each pixel.

Let’s take an example

src : https://www.google.com/search?q=what+is+image+segmentation&sxsrf=ALeKk01237XwREoTrleE4mCIcvoAPmwV-g:1592027939210&tbm=isch&source=iu&ictx=1&fir=TDY5QpYZaP2KkM%253A%252CYp4OL25JOyw41M%252C%252Fm%252F02jj2w&vet=1&usg=AI4_-kTob2AVP9gIniiPxCB1r3hyOk2PSQ&sa=X&ved=2ahUKEwjLzaeejv7pAhXEzzgGHUSdBkUQ_B0wEnoECAwQAw&biw=1366&bih=657#imgrc=TDY5QpYZaP2KkM

Segmentation model would take the top most image, a color image as input and would identify the type of object in each pixel.

Segmentation can be treated as multi-class problem. For each pixel we need to find the class to which it belongs.

Let’s say in the above image label 1 is given to road, label 2 to sidewalk, label 3 to building, label 4 to fence, label 5 to pole, label 6 to vegetation,label 7 to vehicle and label 8 to unknown.

Now given a color image we need to predict the mask. Mask is nothing but label given to each pixel. Now based on mask of each pixel value we can color each mask uniquely and get the output as shown above.

DATA SET

The data-set consists of images obtained from a front facing camera attached to a car. The car was driven around Hyderabad, Bangalore cities and their outskirts.The data-set is divided into train, val and test splits.

Further train and val data set has two folders further one is “leftImg8bit” and one more is “gtFine”. Now “leftImg8bit” has images captured from the camera which need’s to be segmented. “gtFine” has the actual labels(i.e. mask) for the corresponding images in “leftImg8bit” in train and val.

“test” folder has only “leftImg8bit” folder but no “gtFine”.

Data Set link — http://idd.insaan.iiit.ac.in/dataset/download/

I have used “Dataset Name: IDD — Segmentation (IDD 20k Part II) (5.5 GB)”

If you have sufficient compute resource then feel free to download both part 1 and part 2 and train it.

“gtFine” doesn’t have mask in form of image or matrix it has information of mask in json. In json files there are polygons information, we need to extract that info from the json file then create the mask.

Now to simplify the above process let’s understand something called as rasterization.

Line rasterization — It is a process a taking two points as input and glow all pixels between the points. Here glowing happens to fill color or either fill with mask value.

Polygon rasterization — Is a process of taking the endpoints of polygon as input. And glowing all the pixels between the endpoints of polygon.

Here glowing can be either coloring or it can be filling it with mask value.

Now to fill polygon with mask we can do it by using python libraries. Please refer the file “Json_TO_Image.ipynb” pointed at the end of this section. Code is self explanatory you just have to follow function reference and do the above mentioned task.

Segmentation models

UNET

I’m not going to explain each layer of UNET. You can understand my seeing the above diagram or by reading many existing blogs.Many other blogs even explain what is being done at each stage but they fail to give the big picture. I will try to convey UNET in an intuitive manner. I would never remember UNET blog diagram if I need to just to know about its architecture to implement it I would just google search for its architecture so no need to memorize it.

UNET has two parts as shown in the figure. The left part contracting part(aka encoder) and the right part is Expansive part(aka decoder).

Now the input image read as matrix of pixel and feed as input to encoder layer of UNET. Here in encoder image we try to perform set of convolutions, pooling operations. Now in encoder part we try to squeeze the size of input matrix.

We have multiple layers in encoder part, each of encoder layers to reduce the size of input image.

Now finally in decoder part we try to expand size of image.

The first decoder layer would take output of final encoder layer as input along with its corresponding compressed matrix as input from encoder stage.

UNET architecture is symmetric i.e. the number of layers in encoder = number of layers in decoder.

The intermediate decoder layers would take input from previous decoder layer and its corresponding compressed matrix as input.

The last decoder layer would take the previous output of decoder layer along with input image as input and would generate image mask.

I hope this is crisp understanding of UNET architecture.

How ever I have tried UNET architecture with and without batch normalization. UNET + batch normalization seems to work well.

Please note that you can find the code in git hub repository mentioned in the end.

UNET + Resent34 as the backbone

Here we replace encoder of UNET with Resnet34 as encoder.let’s not train the encoder Resnet blocks and train only the decoder layers.Some times, it is useful to train only randomly initialized decoder in order not to damage weights of properly trained encoder with huge gradients during first steps of training.

This seems to work well than just having UNET. Resnet seems to work well in encoder part since it contains skip connection which would skip or bypass results through some layers which are useless.

Please refer https://segmentation-models.readthedocs.io/en/latest/install.html

NOTE: I have tried multiple modelling however I just mentioned the one which is working the best for this task. Their are some techniques like CANET also been tried but those doesn’t seem to work well.

All experiments performed is present in my git hub repository. You can go through it and get to know about experiment details.

We not only try to minimize log loss. But our objective is to have higher IoU(Intersection over union) or jaccard distance.

Intuition behind IoU is that we try to increase the area of correctly classified objects over total region ( miss classified objects + correctly classified object).

Visualization of Segmentation results of Unet_Resnet and Canet on validation data

Unet_Resnet results

Canet results

Canet

Flask App

I have created a flask app which choose the best model as input and would break video into frame and segment each frame and now the results of segmentation would be converted into video. Now this flask app would show a response to play the video once it is generated.