Image Segmentation with Deep Learning

Andi Sama
23 min readNov 20, 2019


Enabled by framework

A Cognitive use-case, Semantic Segmentation based on CamVid dataset

Andi Sama CIO, Sinergi Wahana Gemilang, with Arfika Nurhudatiana, Ph.D.

Ever wonder how does an intelligent machine see the world? An intelligent robot that can navigate the environment for example, avoiding obstacles while walking around and going through the path carefully without explicitly programmed towards achieving just one goal to arrive in a predefined destination — and these all should be with special safety caution: not to harm any living things like human and animal.

Well, with Artificial Intelligence (AI) and especially Deep Learning, this is becoming more possible in recent years. The robot can be in the form of drone, or a autonomous vehicle (e.g. self driving car) for instance.

Human & Machines

Human can naturally sense the surrounding areas through various biological sensors such as eye for vision, ear for hearing, nose for smelling, as well as skin for sensing. That incredible embedded capabilities have been integrated within our body since the day we’re first born, so we have been using all these mostly unconsciously everyday. Take it for granted, these all are just there, ready for us to enjoy.

Machines on the other hand, can not be just designed and implemented to mimic all those things like any normal humans can do. Years of research have been devoted to this, and many new advanced developments have emerged and keep coming in the last few years, especially in computer vision through invention of new algorithms & new optimization methods. The advancements of high-speed hardware and availability of bigdata, have been accelerating this area of study with successful selected implementations in the real world with many more potential practical applications in the future.

This article “Image Segmentation with Deep Learning, enabled by framework: A Cognitive use-case, Semantic Segmentation based on CamVid dataset” discusses Image Segmentation — a subset implementation in computer vision with deep learning that is an extended enhancement of object detection in images in a more granular level. The companion article “Image Classification with Deep Learning, enabled by framework: A Cognitive use-case, 4-classes Image Classification” discusses Image Classification.

Specifically, this article discusses Semantic Image Segmentation rather than Instance Image Segmentation. In Semantic Segmentation, the pixel-wise prediction applies to different objects such as person, car, tree, building, etc. A more granular level of Image Segmentation is Instance Segmentation in which if there are multiple persons in an image, we will be able to differentiate person-1, person-2, person-3 for example along with other objects such car-1, car-2 and tree-1, tree-2, tree-3, tree-4 and so on.

Illustration-1.a and 1.b shows the predicted output of two images using deep learning for Semantic Segmentation. Each pixel of those images is recognized as either one in 32 trained classes (categories), along with its probability.

Illustration-1b: An original and segmented images, processed using semantic image segmentation in deep learning. Segmented image is visualized with 92.15% training accuracy at 512x512 pixels image resolution. Recognized multiple labels can be within a set of labels at every processed pixel, with certain probabilities

The model itself, was previously trained using deep learning, with 92.15% training accuracy. Deep learning is a type of machine learning that is so happening in recent years.

Machine Learning & Deep Learning

The notable breakthrough of advancement in the field of computer vision using deep learning was in 2012 when an applied algorithm called Convolutional Neural Network (a.k.a. CNN) with back-propagation algorithm won the ImageNet competition “Large scale Visual Recognition Challenge on Image Classification” by achieving error rate of 16.4%, a significant improvement from 2011’s result which was at 25.8% (Fei-Fei Li, Justin Johnson, Serena Yeung, 2017). Since then (2012), that neural-network algorithm is known as Alexnet.

Subsequent results in 2013, 2014 and 2015 were at 11.7%, 6.7%, and 3.57% respectively. The 2015 ImageNet’s result has surpassed human expert that could achieve it at only 5.1%.

Illustration-2 shows a brief overview on the evolution and advancements in AI since 1950s. Deep Learning that is powered by backpropagation algorithm as part of Machine Learning within AI (with its approaches such as supervised learning, unsupervised learning and reinforcement learning) has been the key factor in current exciting AI’s advancements, supported by availability of huge dataset (bigdata), as well as hardware accelerators such as GPU (Graphic Processing Unit) especially from NVidia.

Illustration-2: A brief overview on the evolution and advancements in Artificial Intelligence since 1950.

SWG Insight previous edition (Andi Sama et al., 2017) had quickly discussed about the state of future advancements that are possible in Machine Learning, especially with Deep Learning. Follow-on articles have more discussions on the topic (2018–2019), and it will continue to do so for a few more years to come as the field is still exciting with many new developments and breakthroughs.

Deep Learning is all about Neural Network. It uses a lot of data to teach the machine to enable machine to do things that human can do, see things and be able to recognize objects for example.

Arfika Nurhudatiana, Ph.D — an AI-Practitioner who has a Data Scientist role and based in Jakarta, Indonesia emphasizes on this “Deep learning extends machine learning by excluding manual feature extraction and directly learns from raw input data.”

Commenting further on Image Classification & Image Segmentation, she continues “One of the reasons for the rising popularity of R-CNN-based (Region-based Convolutional Neural Network) approach for object detection is due to its sweet combination of image segmentation and image classification. Among many others, several fields which require high precision image segmentation include medical imaging, manufacturing, and agricultural technology”

Machine Learning is a subset of AI. Wikipedia defines AI as “Intelligence exhibited by machines, rather than humans or other animals.” One of sub-branches of Machine learning is Artificial Neural Network (ANN), which is a “mathematical model” of human biological brain. The simplest ANN (or just Neural Network) has 1 input layer, 1-hidden layer and 1 output layer.

Each node in the hidden layer basically consists of quite simple operations (mainly matrix multiplications and additions). It takes inputs from previous nodes — adjusted with unique biases and weights (also coming from previous nodes), then do some calculations (and measurements) to produce output to solve a problem by approximation. In Deep Learning, there are many hidden layers (more than one, can be tenth or hundred of hidden layers) depending on which neural network architecture we are discussing about.

Deep Learning is the current name of ANN in which it involves learning by utilizing more than 1-hidden layer (8 layers in AlexNet, and 34, 50 & 101 layers in Resnet-34, Restnet-50 & Resnet-101 respectively). Initially, machine learning can be categorized as Supervised Learning (labelled data) and Unsupervised Learning (non-labelled data). Recently, the 3rd category emerges: Reinforcement Learning (action-based learning based on certain defined rewards).

Those first three categories of Machine Learning are quickly summarized in table-1.

Table-1: Three categories of Machine Learning.

Latest advancement includes MAML, Model Agnostic Meta Learning (Pieter Abbeel, 2019), in which that the model can learn new things from just a few new samples, given that it has been trained with similar ones before (whether it is classification, object recognition, action recognition or others).

Machine learning offers the ability to extract certain knowledge and patterns from a series of observations. It’s done through mathematical optimization through approximation (pattern recognition or exploration of many possibilities). In supervised learning, minimizing the error (calculate the mean differences across all expected results ands actual observations according to selected measurement metric for example) is very important to get the best possible learning result.

Deep Learning, as subset of Machine learning enables machine to have better capability to mimic human in recognizing images (image classification in supervised learning), seeing what kind of objects are in the images (object detection in supervised learning), as well as teaching the robot (reinforcement learning) to understand the world around it and interact with it for instance. Deep learning is the state of the art and emerging technology in Machine Learning. Many applications are possible, including areas such as Computer Vision and Natural Language Processing/Understanding that have achieved high degree of accuracy.

A Cognitive Deep Learning Use-Case

Semantic Image Segmentation of 32 classes based on CamVid database, a Supervised Learning

The discussion in this article is organized into three sections as follows. It discusses a use-case in processing CamVid dataset to train a model for Semantic Image Segmentation to recognize each pixel in the image, that is belong to either one of 32-classes (categories), by using libraries.

Section-1: Environment & Dataset Preparation.

a. Environment Preparation in Google Cloud Platform (GCP)

b. Dataset from CamVid database

Section-2: Modeling

Section-3: Inferencing

The 32-classes are defined as ‘Animal’, ‘Archway’, ‘Bicyclist’, ‘Bridge’, ‘Building’, ‘Car’, ‘CartLuggagePram’, ‘Child’, ‘Column_Pole’, ‘Fence’, ‘LaneMkgsDriv’, ‘LaneMkgsNonDriv’, ‘Misc_Text’, ‘MotorcycleScooter’, ‘OtherMoving’, ‘ParkingBlock’, ‘Pedestrian’, ‘Road’, ‘RoadShoulder’, ‘Sidewalk’, ‘SignSymbol’, ‘Sky’, ‘SUVPickupTruck’, ‘TrafficCone’, ‘TrafficLight’, ‘Train’, ‘Tree’, ‘Truck_Bus’, ‘Tunnel’, ‘VegetationMisc’, ‘Void’, and ‘Wall’.

It is expected that, by being aware and having certain basic understanding, both on the basic concept and practicability to some extend, we can appreciate and understand better on AI-related products & solutions that are available in the market, like various IBM Watson offerings known as “the Artificial Intelligence for Business”.

Let’s start with the walkthrough.


1.a. Environment Preparation in Google Cloud Platform

This article is based on recent deep learning class (late 2018-early 2019, course v3) taught in University of San Francisco by Jeremy Howard, a Kaggle’s #1 competitor for 2 years in a row and founder of, one of a leading deep learning libraries. Kaggle is a recognized place for competing for the best in the world in the area of deep learning by continuing to improve and invent the better algorithms (with million dollars reward for selected world-class’s tough challenges). Jeremy delivered the course along with Rachel Thomas, Director of USF Center for Applied Data Ethics and also co-founder of

The class suggests to utilize GPU (Graphic Processing Unit) to run our deep learning modeling and with this approach, we are using the one (virtual server) that is available in the cloud illustration-3a): a Google Cloud Platform (GCP) Compute Engine (with per hour-based charging). The configuration for NVIDIA GPU is shown in illustration-4a (idle) and illustration-4b (doing modeling, processing neural network) by running ‘nvidia-smi’ command at the remote virtual server, once we have logged-in.

Illustration-3a: A Virtual Server for this article that is running on Google Cloud Platform (GCP) Compute Engine equipped with GPU, that is ready and operational (started).
Illustration-3b: from local computer (Windows Subsystem for Linux) on Windows 10, connect to Google Cloud Platform (GCP) compute engine using ssh (secure shell)

As we can see, it is using a Debian distribution of linux operating system as the platform for us to experiment, equipped with one quite high-end NVidia Tesla P4 GPU running on GCP Compute Engine. We can use similar IaaS-cloud services (Infrastructure as a Service) such as IBM Watson Studio on IBM Cloud Platform, Amazon Web Services Elastic Compute Cloud (AWS EC2) or Microsoft Azure Cloud Compute platform.

Illustration-4a: State of single NVIDIA GPU: Tesla P4, with no process running — GPU Utilization is at 5%. The temperature is measured at 35oC, power consumption at 22W of 75W max, and GPU RAM usage at 0MB of 7611MB.
Illustration-4b: State of single NVIDIA GPU: Tesla P4, when running neural network computations — GPU Utilization is at 99%. The temperature is measured at 42oC, power consumption at 58W of 75W max, and GPU RAM usage at 3577MB of 7611MB.

Non-cloud (on-premise) solution is also available — such as IBM POWER (Performance Optimized with Enhanced RISC) Accelerated Computing (AC922) platform equipped with NVidia high-end Tesla V100 GPU, as well as variety of GPU-powered Intel x86 CPU-based platform. RISC (Reduced Instruction Set Computing) is a type of computer architecture.

The process starts by powering-up our defined server in GCP Compute Engine (illustration-3a), and once it is started we can do ssh (secure shell) login to our virtual server that is running on GCP (illustration-3b).

Once everything is setup, we can start using Jupyter Notebook to enter our python code to experience deep learning by pointing our browser to http://localhost:8080/tree/ then navigate to a directory where our .ipynb file resides (as in illustration-5). Jupyter Notebook is an interactive development environment typically used by data scientist to do machine learning, while python programming language is popular among data scientists.

Illustration-5: A quick overview of the purpose of doing Semantic Image Segmentation (based on CamVid database) with deep learning. A single library with multiple functionalities (in this case we are using: for computer vision functionalities with callbacks and some utilities) are loaded by doing import by using Python programming language in Jupyter Notebook Interactive Development Environment. The backend system runs on a Public Cloud: GCP Compute Engine, being accessed locally through secure shell login (ssh) from a local computer running on Windows Subsytems for Linux (WSL) on Windows 10. By channeling the backend system through ssh, we are able to access Jupyter Notebook locally from the browser, as if the server runs locally.

Note that we can choose to use our existing CPU (Central Processing Unit)-only laptop — it’s perfectly fine. However, the process will be significantly slow (about 10–20 times slower or more depending on which pair of CPU-GPU we are comparing with). The modeling that can take just a few minutes on GPU, can take hours if using CPU. Imagine the modeling that takes a few hours or days or even weeks on GPU, it can take days or weeks or even months if using CPU.

This is the power of parallel processing embedded in GPU for processing complex computations, that consists mostly of matrix operations (matrix multiplications & additions as in linear algebra) as well as 1st degree partial differential processing in back-propagation algorithm. In an enterprise-level configuration such as with IBM POWER AC922 server, we can enable even more scalable multiple servers with multiple GPUs configuration to significantly speed up the modeling. At least, one configuration has been tested with 64 servers with 4 GPUs each (in 2018), resulting in 256 GPUs in total, configured using DDL (Distributed Deep Learning) for HPC (High Performance Computing).

A different hardware approach is to use Tensor Processing Unit (TPU), that is developed by Google.

In recent IBM’s approach for a High Performance Computing (HPC) environment, an IBM Global Solution Architect & Data Scientist: Renita Leung, P. Eng shared the news on the availability of EDT & EDI (Renita, 2019). An approach called Elastic Distributed Training (EDT) is available to do large scale modeling across many GPUs. Aligned with that, for Inference (runtime) across many GPUs, IBM’s approach also includes Elastic Distributed Inference (EDI).

1.b. Preparing Dataset

Now, as the environment is ready, we need to prepare the dataset. We use training dataset from CamVid database (Brostow, Shotton, Fauqueur, Cipolla, 2008a) and test dataset from Google Images Search (manually generated).

Training (and Validation) Dataset from CamVid database

From its site, CamVid dataset is described as follows.

The Cambridge-driving Labeled Video Database (CamVid) is the first collection of videos with object class semantic labels, complete with metadata.

The database provides ground truth labels that associate each pixel with one of 32 semantic classes.

The database addresses the need for experimental data to quantitatively evaluate emerging algorithms.

While most videos are filmed with fixed-position CCTV-style cameras, our data was captured from the perspective of a driving automobile.

The driving scenario increases the number and heterogeneity of the observed object classes.

Illustration-6a and illustration-6b shows the Python codes on how an image is retrieved from CamVid database, then visualized. The size of data to be processed is set at 50% of the total src_size.

Illustration-6a: Preparing and visualizing an image with its segmented data from CamVid database, for training dataset. (cont.).
Illustration-6b: Preparing and visualizing an image with its segmented data from CamVid database, for training dataset.

Illustration-7 is visualizing images in CamVid database along with its valid labels.

In CamVid database: each Image file has its corresponding label file, a semantic image segmentation definition for that image at every pixel. The label file consists of index values that act like pointers, referring to each pixel in the segmented image. Each pixel has an index value of either one of 32 codes (defined classes), as defined in codes.txt files.

Illustration-7: Visualizing data from CamVid database.
Illustration-8: Javascript code to run on browser to download URLs for Testing dataset. As well as the codes to download the images.

Test Dataset

Google Images for test dataset are selected using search keywords (in Indonesian language): “jakarta kondisi jalan utama mobil motor sepeda orang”, which is translated to be “jakarta condition street main car motorcycle bicycle person”.

We do this by first generating our list of URL of Google Images by first doing Google Image Search from a browser, then download the URLs using Javascript (use ctrl-shift-j in browser to open a new window in which we can enter javascript commands as in illustration-8). We select the list to only contain 500 URLs at max.

Once list of files for test dataset has been created, it is processed by the Python codes to download the actual images. Those images can be manually edited to remove unwanted files. The process (use vnc to remote login and browse the images) to remove unwanted images is simple, we just remove all images that we think are not suitable for testing by referring to the CamVid database.

The data preparation is done. We are now ready to move to the next stage: Modeling.


First of all, we define how the accuracy for the model will be computed, then define the neural network architecture. Ilustration-9a shows the python code within Jupyter Notebook, in which we define acc_camvid() function to calculate accuracy for our model and prepare the base model for training by calling unet_learner()’s function then assign it to the object called learn (using Convolutional Neural Network (CNN)-based neural network architecture called Resnet-34).

Then, we can start training the dataset (modeling), in this case for Semantic Image Segmentation. The modeling will produce a model, such that when given an image, it can predict an expected segmentation (output) within a certain confidence level.

A model is an approximation on the relationship between input and output, based on dataset.

Training is an iterative time-consuming process (and costly, especially the cost of GPUs), a process that needs to be repeated again and again until we get a satisfactory result. Between these trials, we can adjust a few parameters (the one that we call as hyperparameters, with the expectation to minimize the error between expected result (prediction during modeling) and the observable output (label from dataset, the ground truth), hence increasing accuracy — at least one of the measurement metrics that we need to pay attention to, in Image Segmentation.


Learning rate to determine how fast gradient decent algorithm learns, number of layers, number of neurons in each layers, number of epochs, number of mini-batches, and lambda for regularization to minimize the given cost function are some of variables known as hyper-parameters (like independent variables in statistics) in neural network (or deep neural network/deep learning). In short, they are the external variables that are set before the training to generate optimized dependent variables in neural network structure “model”: namely weights & biases.

As a data scientist, one of the best practices to follow when doing experimentation is to use small set of data at the beginning for efficiency (time & cost), then apply our algorithm to a larger full dataset (as available) once we have satisfied with the code that we are working on. Modeling using training and validation data with a full dataset would typically require a great amount of time, meaning more GPU time to spend.

The longer we use GPU time, the more the processing cost. The practice to initially experiment with a smaller set of dataset (a subset of a full dataset) while adjusting a few hyperparameters will make an effective use of GPU time, hence reducing the cost/hour if we are “renting” a cloud-based GPU-equipped virtual server on cloud, for example.

As we are using high-level neural network library (based on Facebook’s PyTorch), the code is greatly simplified rather than directly using the base framework. We just need to focus on the problem, then let the appropriate functions available in’s library to do the necessary complex processing in doing modeling (means by training with our training dataset, validating with validation dataset and finally generating a model).

2.a. Training (Initial, with the Part of Dataset)

In, there is a function called lr_find() to find a range of possible learning rate values that are suitable for minimizing our error_rate. Note that, default learning rate in has been set to 0.003 (3x10–3), and in this case we can run fit_one_cycle() function for a few epochs before using lr_find().

We call a’s function to find a learning rate to start with as in illustration-9.b. We then select our initial learning rate to be 3x10–3 based on the result of lr_find() function.

Illustration-9a: Defining accuracy function for modeling, along with chosen neural network architecture (unet) for semantic segmentation.
Illustration-9b lr_find to find range of suitable learning rate values, for next training (optimization).
Illustration-9c: Train our neural network for Image Classification with’s fit_one_cycle() function using resnet-34 neural network architecture, with initial 10 epochs. The measurement metric is accuracy with acc_camvid(). Save the current training stage and name it as “stage-1”, so we can reload it later. 1st hyperparameter is 10 (# of epoch — full iteration of the whole dataset), 2nd hyperparameter is lr that is set at previously defined learning rate value and the 3rd hyperparameter is 0.9 meaning that lr_rate will gradually be increasing until reaching 90% at each epoch, then decreasing.

Learning rate (Wikipedia) is a step size in machine learning, which is a hyperparameter to determine to what extent a newly acquired information overrides old information. A too high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in an undesirable local minimum.

The result of lr_find() shows that we are suggested to set our learning rate range between 3x10–4 to 3x10–3 (the stable value range in the graph just before it is going up).

Once the base model for training is defined, we can start the training (illustration 9-c) by calling’s fit_one_cycle() function with hyperparameters: 10, lr and 0.9. It means it will run with 10 epochs with defined learning rate at lr, and lr_rate will gradually be increasing until reaching 90% at each epoch, then decreasing.

One cycle of training neural network with a full dataset is called as 1 epoch — initially, it’s 50% of src_size. The training and validation can be repeated several times to improve the accuracy, although at some point the accuracy may be decreased. It is suggested then, to save the generated file (model) for each epoch (the saved model contains learned weights and network architecture for all connected layers in the model).

We observe that, with all the base hyperparameters set (such as learning rate & measurement metrics), for the first 10 epochs: 1st (epoch 0), 3rd, 5th, 7th,8th , 9th and 10th, we get 82.81%, 83.30%, 86.97%, 86.40%, 89.04%, 85.54% and 87.04% accuracies (acc_camvid()) respectively.

We save our current generated result at this stage, and call it as “stage-1”. We review on how are we doing so far (illustration-10).

Illustration-10: Following initial 10 epochs (stage-1), we see how our model are performing.

2.b. Training Optimization

Well, maybe we can improve more by pushing our last accuracy 87.04% to be better. Then, with pct_start now sets at 80% with adjusted learning rate, we continue training our dataset with 12 subsequent more epochs (illustration-11) based on saved stage-1 before. We get the accuracies for the last the 5 epochs as follow: 90.17%, 89.83%, 86.02%, 88.07% and 89.77% respectively.

We save our current generated result at this stage, and just call it as “stage-2”. We review on how we are doing so far (illustration-11).

Illustration-12 shows the paired result, ground truth (images from our data) and predicted images (the result of prediction with our current model at this stage). Are we satisfied? It seems that we can still improve our model to be better.

Illustration-11: Further optimize our neural network with’s fit_one_cycle() function using resnet-34 neural network architecture, with additional 12 epochs. The measurement metric is accuracy with acc_camvid(). Save the current training stage and name it as “stage-2”, so we can reload it later.
Illustration-12: Comparing ground truth & predicted segmentation following further optimization with 12 epochs (stage-2), we see how our model are performing so far.

2.c. Training with full Dataset

As we are wrapping-up our initial findings with a subset of dataset, we are ready to go with all the dataset that we have. We then prepare the training with the full dataset (size = src_size), maximize the batch size as allowed by our current GPU configuration, then load the previously saved stage, stage-2 (as shown in Illustration-13). To see how we should set our lr this time, we run lr_find() again (illustration-14).

Illustration-13: This time we are going big with full available dataset by setting the size of data to be trained to be full src_size rather than only half (50%) as before (noted size = src_size // 2 is now becoming size = src_size). Prepare to do training starting with previously saved stage by reloading “stage-2”.
Illustration-14: Run learning rate finder again, to determine the value for learning rate that we should set for next. Based this, we decide to set our next learning rate to 1x10–3.

Based on the result of lr_find(), we decide to set the learning rate to 1x10–3 (illustration-15). Start the learn.fit_one_cycle() again, save the result to stage-1-big, modify learning rate for fine tuning then re-run learn.fit_one_cycle() again.

Illustration-15: Further optimize our neural network with additional 10 epochs (stage-2-big).

We observe that, by referring to all 10 epochs: 1st (epoch 0), 2nd, 3rd, 8th , 9th and 10th, we get 91.91%, 92.47%, 91.09%, 91.72%, 92.21%, and 92.21% accuracies respectively. Quite a significant improvement from the last run.

We save our current generated result at this stage, and just call the saved filename as “stage-2-big”. We review on how are we doing so far in illustration-16.

Illustration-16: Comparing ground truth & predicted segmentation following further optimization, we see how our model are performing so far (stage-2-big).


While it’s good to finally have a trained model, the process does not stop here. The model needs to be put in actual work by feeding new data, then do prediction (expecting that the predicted result will be aligned with the data that we trained the model on). This is called as Inferencing stage (run time).

AI, including inferencing can be part of a large business process such as Business Process Management (BPM) within an Enterprise AI or run as a server process accessed by external applications like mobile app or web-based app or even accessed by a subprocess within an external application somewhere within multi-clouds or hybrid cloud environment.

A generated neural network deep learning model (we can just say: model) can be deployed (inferencing) in many ways (in Cloud, At the Edge <AI on IoT edge>, On Mobile devices, etc.) depending on what kind of applications that we are going to target. Illustration-22 shows a typical AI data pipeline, where data flows through 3-stages: 1. data preparation, 2. modeling as well 3. deployment/inferencing. Stage-1 and stage-2 are basically development-stage while stage-3 is runtime-stage. Running a model (inferencing) is the final stage in which we can select type of deployment according to requirements.

Illustration-17 and illustration-18 show a few sample test images (different set of images, not included in dataset) that we pass through the model in which they are identified by segment according to classes they should belong to.

Illustration-17: An original and segmented images, processed using semantic image segmentation in deep learning. Segmented image is visualized with 92.15% training accuracy at 512x512 and 1024x1024 pixels image resolution, for comparison purposes.
Illustration-18: An original and segmented images, processed using semantic image segmentation in deep learning. Segmented image is visualized with 92.15% training accuracy at 512x512 and 1024x1024 pixels image resolution, for comparison purposes.

Well, it was mentioned before that each pixel of a segmented image contains class information in either one of 32 defined classes ‘Animal’, ‘Archway’, ‘Bicyclist’, ‘Bridge’, ‘Building’, ‘Car’, ‘CartLuggagePram’, ‘Child’, ‘Column_Pole’, ‘Fence’, ‘LaneMkgsDriv’, ‘LaneMkgsNonDriv’, ‘Misc_Text’, ‘MotorcycleScooter’, ‘OtherMoving’, ‘ParkingBlock’, ‘Pedestrian’, ‘Road’, ‘RoadShoulder’, ‘Sidewalk’, ‘SignSymbol’, ‘Sky’, ‘SUVPickupTruck’, ‘TrafficCone’, ‘TrafficLight’, ‘Train’, ‘Tree’, ‘Truck_Bus’, ‘Tunnel’, ‘VegetationMisc’, ‘Void’, and ‘Wall’ — along with its probabilities.

We then write a custom pyhton function to extract this information (as shown in Illustration-19a, illustration-19b). The function allows to set any point (coordinate) in a segmented image, then it will extract 10 classes information before and after that defined point.

Illustration-19a: The Python code to print label, probability of selected 20 pixels staring from it’s middle point within a segmented image.
Illustration-19b: The Python code to view segmented images along with its selected 20 pixels.

When we run the function with a defined point, we can visualize the pixels that being extracted as well as the classes information from each of extracted pixel. Illustration-20a and Illustration-20b show one segmented image, while Illustration-21a and Illustration-21b show another segmented image being visualized and extracted.

Illustration-20a: A sample of selected 20 pixels as part of 512x512 pixels within a segmented image (identified as which object per pixel at certain row & column along with its probability) , at 92.15% accuracy.
Illustration-20b: An original and segmented images, processed using semantic image segmentation in deep learning. A sample of selected 20 pixels as part of 512x512 pixels within the segmented image (92.15% accuracy).
Illustration-21a: A sample of selected 20 pixels as part of 512x512 pixels within a segmented image (identified as which object per pixel at certain row & column along with its probability) , at 92.15% accuracy.
Illustration-21b: An original and segmented images, processed using semantic image segmentation in deep learning. A sample of selected 20 pixels as part of 512x512 pixels within the segmented image (92.15% accuracy).

Inferencing at a glance

There are many ways for doing inferencing. Although the tools like IBM PowerAI Vision on IBM WMLA has an integrated deployment engine out-of-the-box, a typical process would be to export the trained model to an external environment, then do inferencing. Inferencing can be done either on-premise or on-cloud or in combination, it is just deployment options that we need to select considering reliability and scalability that fit to the purpose of deployment (of course, cost factor is also one of the important factors to consider here).

Illustration-22: A typical AI data pipeline stages: 1. Data Preparation, 2. Modeling, 3. Inferencing.

A typical deployment approach is something like, given a model — an external application passes the new data to predict. Prior to be given access to the inference engine, an external application can be authenticated somehow, e.g. through an assigned API-key (Application Programming Interface) typically generated by a server running in the same environment as the inference engine.

External Application to Inference Engine Before reaching the inference engine, incoming data (compressed) typically passes through the message pooling/queuing subsystem (we can deploy this in an asynchronous messaging platform using publish/subscribe methods for example to promote scalability).

We can use “publish to a topic, e.g. to request_message topic” when sending the data from an external application to the messaging platform. At the other end, the application logic “subscribes to the request_message topic”, so it will receive the data as soon as the data arrives to be passed to inference engine (after data has been decompressed). Once predicted outcome is generated by inference engine, the application logic then “publishes the result back to a response topic, e.g. to response_message topic in the messaging platform”.

Inference Engine to External Application Once the result reaches messaging platform, it is then passed back to the external application that “subscribes to response_message topic” for further processing, e.g. by combining the result from inference engine with other application states to execute some actions.

Note that, the use of messaging platform with asynchronous mode promotes scalability in handling multiple requests. In certain situation where a high performance with low latency between requests and responses are really required, we may also doing it synchronously rather than asynchronously. However, the use of synchronous mode must be exercised carefully as we may also need to build the reliable application logic for handling message resend & recovery that are provided out-of-the-box in asynchronous mode with its queuing mechanism.

The set of application logic + inference engine may also be configured as multi-threads in which it can handle multiple requests and perform multiple inferences in one pass within a process. Multiple application logic + inference engines may also be configured as containers to promote scalability in processing multiple parallel requests. The limited set of multi-threads within one virtual machine or within one container is meant to prevent the system’s resources (CPU, RAM, GPU) to be exhausted within that virtualized environment.

What’s Next?

Adoption for Machine Learning (ML) is accelerating rapidly especially with the availability of cloud-based platform to experiment (with GPU). Common steps for doing Deep Learning are quite simple actually:

  1. Prepare the right data sets, then split data set to training & validation data.

2. Modeling: Select neural network architecture, train using dataset, then generate model.

3. Inferencing: Deploy the model.

Preparing the right data sets has always been the challenge in doing deep learning, this can take weeks or even months. Providing the right resource & skill set (data scientist and computing power), modeling should be a straightforward task, e.g. can be done in hours, days or just a few weeks for a very complex big model. Once a model has been created, deployment should be “easier” to implement — e.g. to deploy in web or mobile apps.

In doing modeling, cloud-based servers that support GPUs have been available for sometime, starting with as low as just about USD 1 per-hour for entry-level configuration and to a few thousands USD per-hour for very high-end configuration (multiple GPUs for parallelism). Note that although you can use CPU-only, the training time will be significantly slower. It can be about 10 times slower. The speed improvement (especially with large dataset) with GPU may vary, however in general it can range from 10–20 times.

To start exploring, especially for Inferencing — there are a few ways for us to experience. Take for example a announcements of NVidia Jetson TX2 (in May 2017) that enables us to start using GPUs for Deep Learning (for USD 599) or the recent NVidia Jetson Nano (announced in March 2019 for just USD 129) or the higher version like NVidia Jetson AGX Xavier (for USD 999 to get better performance).

What are you waiting for then? Let’s start by exploring some use-cases in this exciting area of AI.


Andi Sama, 2019, “AI Model Inferencing, Practical deployment approaches & considerations”, SWG Insight, Edisi Q4 2019, page 3–9.

Andi Sama et al., 2019a, Image Classification & Object Detection.

Andi Sama et al., 2019b, Think like a Data Scientist.

Andi Sama, 2019c, “Guest Lecturing on AI: Challenges & Opportunity”, Lecture to FEBUI — University of Indonesia”.

Andi Sama et al., 2018, Deep Learning — Image Classification, Cats & Dogs — A Cognitive use-case: Implement a Supervised Learning for Image Classification, SWG Insight, Edisi Q1 2018.

Andi Sama et al., 2017, The Future of Machine Learning: The State of Advancements in Deep Learning”, SWG Insight, Edisi Q4 2017, page 6–17.

Andrew Widjaya, Cahyati S. Sangaji, 2019, Face Recognition, Powered by IBM Cloud, Watson & IoT on Edge”, SWG Insight, Edisi Q2 2019.

Brostow, Shotton, Fauqueur, Cipolla, 2008a, “Segmentation and Recognition Using Structure from Motion Point Clouds, ECCV 2008.

Brostow, Shotton, Fauqueur, Cipolla, 2008b, “Semantic Object Classes in Video: A High-Definition Ground Truth Database.

Fei-Fei Li, Justin Johnson, Serena Yeung, 2017, “CS231n: Convolutional Neural Networks for Visual Recognition”, Stanford University, Spring 2017.

Jeremy Howard, 2018, “Practical Deep Learning For Coders — v3.

Pieter Abbeel, 2019, “Full Stack Deep Learning — Lecture 10: Research Directions”, Deep Learning Bootcamp, March 2019, Berkeley.

Renita Leung, 2019, “Watson ML Accelerator + Watson Studio + Watson ML Integrated Solution Example”, IBM Technical University, November 2019 @Bali, Indonesia.

Wikipedia, “Learning rate”.