Image Classification with Deep Learning, enabled by fast.ai framework
A Cognitive use-case, 4-classes Image Classification
Andi Sama — CIO, Sinergi Wahana Gemilang, with Arfika Nurhudatiana, Ph.D.
Human can naturally sense the surrounding areas through various biological sensors such as eye for vision, ear for hearing, nose for smelling, as well as skin for sensing. That incredible embedded capabilities have been integrated within our body since the day we’re first born, so we have been using all these mostly unconsciously everyday. Take it for granted, these all are just there, ready for us to enjoy.
Machines on the other hand, can not be just designed and implemented to mimic all those things like any normal humans can do. Years of research have been devoted to this, and many new advanced developments have emerged and keep coming in the last few years, especially in computer vision through invention of new algorithms, new optimization methods, new high-speed hardware and availability of bigdata that have been accelerating this area of study with successful selected implementations in the real world with many more potential practical applications in the future.
This article “Image Classification with Deep Learning, enabled by fast.ai framework: A Cognitive use-case, 4-classes Image Classification” discusses Image Classification, while the companion article later “Image Segmentation with Deep Learning, enabled by fast.ai framework: A Cognitive use-case, Semantic Segmentation based on CamVid dataset” will discuss about Image Segmentation — a subset implementation in computer vision with deep learning that is an extended enhancement of object detection in images in a more granular level.
Illustration-1 shows the output of correct prediction of a few images using deep learning for Image Classification. Those input images are recognized as either ‘animeman’, ‘animewoman’, ‘realman’ or ‘realwoman’ based on previously trained-deep learning model, with 86.07% accuracy. Deep learning is a type of machine learning that is so happening now.
Machine Learning & Deep Learning
The notable breakthrough of advancement in the field of computer vision using deep learning was in 2012 when an applied algorithm called Convolutional Neural Network (a.k.a. CNN) with back-propagation algorithm won the ImageNet competition “Large scale Visual Recognition Challenge on Image Classification” by achieving error rate of 16.4%, a significant improvement from 2011’s result which was at 25.8% (Fei-Fei Li, Justin Johnson, Serena Yeung, 2017). Since then (2012), that neural-network algorithm is known as Alexnet. Subsequent results in 2013, 2014 and 2015 were at 11.7%, 6.7%, and 3.57% respectively. The 2015 ImageNet’s result has surpassed human expert that could achieve it at only 5.1%.
Illustration-2 shows a brief overview on the evolution and advancements in Artificial Intelligence (AI) since 1950s. Deep Learning that is powered by backpropagation algorithm as part of Machine Learning within AI (with its approaches such as supervised learning, unsupervised learning and reinforcement learning) has been the key factor in current exciting AI’s advancements, supported by availability of huge dataset (bigdata), as well as hardware accelerators such as GPU (Graphic Processing Unit) especially from NVidia.
SWG Insight previous edition (Andi Sama et al., 2017) had quickly discussed about the state of future advancements that are possible in Machine Learning, especially with Deep Learning. Follow-on articles have more discussions on the topic (2018–2019), and it will continue to do so for a few more years to come as the field is still exciting with many new developments and breakthroughs.
Machine Learning is a subset of AI. Wikipedia defines AI as “Intelligence exhibited by machines, rather than humans or other animals.” One of sub-branches of Machine learning is Artificial Neural Network (ANN), which is a “mathematical model” of human biological brain. The simplest ANN (or just Neural Network) has 1 input layer, 1-hidden layer and 1 output layer.
— — — — — — — — — — — — -
Deep Learning is all about Neural Network. It uses a lot of data to teach the machine to enable machine to do things what human can do, see things and be able to recognize objects for example.
Arfika Nurhudatiana, Ph.D — a Data Scientist in Jakarta, Indonesia emphasizes on this “Deep learning extends machine learning by excluding manual feature extraction and directly learns from raw input data.”
— — — — — — — — — — — — -
Each node in the hidden layer basically consists of quite simple operations (mainly matrix multiplications and additions). It takes inputs from previous nodes — adjusted with unique biases and weights (also coming from previous nodes), then do some calculations (and measurements) to produce output to solve a problem by approximation. In Deep Learning, there are many hidden layers (more than one, can be tenth or hundred of hidden layers) depending on which neural network architecture we are discussing about.
Deep Learning is the current name of ANN in which it involves learning by utilizing more than 1-hidden layer (8 layers in AlexNet, and 34, 50 & 101 layers in Resnet-34, Restnet-50 & Resnet-101 respectively). Initially, machine learning can be categorized as Supervised Learning (labelled data) and Unsupervised Learning (non-labelled data). Recently, the 3rd category emerges: Reinforcement Learning.
Those first three categories of Machine Learning are quickly summarized in table-1.
Supervised Learning (Supervised Algorithm)
We teach the Machine Learning, what are expected outputs given a set of inputs (training data). We do expect following number of sufficient training data, it will be able to predict the output with certain degree of accuracy (confidence level). The accuracy of the model is calculated based on the # of correct prediction of untrained data, compared to the actual prediction that it should be predicting based on training data. In this case, the expected output has already been known.
Unsupervised Learning (Unsupervised Algorithm)
Using various approaches, Machine Learning itself will try to recognize and learn from the input data (find the relationship) without us telling it what to expect. i.e. Automatic data classification. In this case, we do not know what output we will be getting from the generated model. It’s all up to us, human — to make sense of the new information.
Reinforcement Learning
The model learns to improve itself by being able to “sense” signals, automatically decided on an action, then compare the outcome against “rewards” definition.
Latest advancement includes MAML, Model Agnostic Meta Learning (Pieter Abbeel, 2019), in which that the model can learn new things from just a few new samples, given that it has been trained with a similar ones before (whether it is classification, object recognition or action recognition for example).
Machine learning offers the ability to extract certain knowledge and patterns from a series of observations. It’s done through mathematical optimization through approximation (pattern recognition or exploration of many possibilities). In supervised learning, minimizing the error (calculate the mean differences across all expected results ands actual observations according to selected measurement metric) is very important to get the best possible learning result.
Deep Learning, as subset of Machine learning enables machine to have better capability to mimic human in recognizing images (image classification in supervised learning), seeing what kind of objects are in the images (object detection in supervised learning), as well as teaching the robot (reinforcement learning) to understand the world around it and interact with it for instance. Deep learning is the state of the art and emerging technology in Machine Learning. Many applications are possible, including the rapid advancements in Computer Vision and Natural Language Processing/Understanding with high degree of accuracy.
A Cognitive Deep Learning Use-Case
Image Classification of AnimeMan/Woman and RealMan/Woman, a Supervised Learning
The discussion in this article is organized into three sections as follows, it discusses a use-case in processing Google-Images dataset to train a model for Image Classification to recognize images in either one of 4-classes: ‘animeman’, ‘animewoman’, ‘realman’ or ‘realwoman’ by using fast.ai libraries.
Section-1: Dataset & Environment Preparation.
a. Environment Preparation in Google Cloud Platform
b. Dataset from Google Images
Section-2: Modeling
Section-3: Inferencing
It is expected that, by being aware and having certain basic understanding both on the basic concept and practicability to some extend, we can appreciate and understand better on AI-related products & solutions available in the market, like various IBM Watson offerings known as “the Artificial Intelligence for Business”.
Let’s start with the walkthrough.
SECTION 1: DATASET & ENVIRONMENT PREPARATION
1.a. Environment Preparation in Google Cloud Platform
This article is based on recent deep learning class (late 2018-early 2019) taught in University of San Francisco by Jeremy Howard, a Kaggle’s #1 competitor for 2 years in a row and founder of fast.ai, one of a leading deep learning libraries. Kaggle is a recognized place for competing for the best in the world in the area of deep learning by continuing to improve and invent the better algorithms (with million dollars reward for selected world-class’s tough challenges). Jeremy delivered the course along with Rachel Thomas, Director of USF Center for Applied Data Ethics and also co-founder of fast.ai. This article is based on fast.ai course v3.
The class suggests to utilize GPU (Graphic Processing Unit) to run our deep learning modeling and with this approach, we are using the one (virtual server) that is available in the cloud illustration-3a): a Google Cloud Platform (GCP) Compute Engine (with per hour-based charging). The configuration for NVIDIA GPU is shown in illustration-4a (idle) and illustration-4b (doing modeling, processing neural network) by running ‘nvidia-smi’ command at the remote virtual server, once we have logged-in.
As we can see, it is using a Debian distribution of linux operating system as the platform for us to experiment, equipped with one quite high-end NVidia Tesla P4 GPU running on GCP Compute Engine. We can use similar IaaS-cloud services (Infrastructure as a Service) such as IBM Watson Studio on IBM Cloud Platform, Amazon Web Services Elastic Compute Cloud (AWS EC2) or Microsoft Azure Cloud Compute platform.
Non-cloud (on-premise) solution is also available — such as IBM POWER (Performance Optimized with Enhanced RISC) Accelerated Computing (AC922) platform equipped with NVidia high-end Tesla V100 GPU, as well as variety of GPU-powered Intel x86 CPU-based platform. RISC (Reduced Instruction Set Computing) is a type of computer architecture.
The process starts by powering-up our defined server in GCP Compute Engine (illustration-3a), and once it is started we can do ssh (secure shell) login to our virtual server that is running on GCP (illustration-3b).
Once everything is setup, we can then start to use Jupyter Notebook to enter our python code to experience deep learning by pointing our browser to http://localhost:8080/tree/ then navigate to a directory where our .ipynb file resides (as in illustration-5). Jupyter Notebook is an interactive development environment typically used by data scientist to do machine learning, while python programming language is popular among data scientists.
Note that we can choose to use our existing CPU (Central Processing Unit)-only laptop — it’s perfectly fine. However, the process will be significantly slow (about 10–20 times slower or more depending on which pair of CPU-GPU we are comparing with). The modeling that can take just a few minutes on GPU, can take hours if using CPU. Imagine the modeling that takes a few hours or days on GPU, it can take days or weeks if using CPU. This is the power of parallel processing embedded in GPU for processing complex computations, that consists mostly of matrix operations (matrix multiplications & additions as in linear algebra) as well as 1st degree partial differential processing in back-propagation algorithm.
1.b. Preparing Dataset from Google Images
Now, as the environment is ready, we need to prepare the dataset. We do this by generating our own dataset taken from Google Images by first doing Google Image Search from a browser, then download the URLs using Javascript (use ctrl-shift-j in browser to open a new window in which we can enter javascript commands as in illustration-6a) for each category that we want to create by limiting it to be 500 URLs at max. We categorize our dataset into 4-classes: ‘animeman’, ‘animewoman’, ‘realman’ and ‘realwoman’.
Once all dataset categories have been created, we will have 4 files (illustration-6b) consisting of 500 URLs each, which are then uploaded to GCP and processed (steps 1.1–1.3 in illustration-7) to download the actual images. Those images in each directory are edited to remove unwanted files not belong to the categories.
The manual process (use vnc to remote login and browse the images) to remove unwanted images is simple, remove all images that do not belong to the category they are supposed to e.g. there are both man & woman in the same image, image is too small, image contains too many backgrounds, and objects such as ‘animeman’, ‘animewoman’, ‘realman’, ‘realwoman’ that are not fully seen differentiable as single object, etc.).
In our case, the final dataset for each category following the cleaning process will have about half the original of downloaded number of files (about 50% of 500 images). Illustration-6c shows the actual number of files. Illustration-8 visualizes a few random images from each of category.
The data preparation is done. We are now ready to move to the next stage: Modeling.
SECTION 2: MODELING
First of all, we load all the images from storage (for all categories) into memory, do image scaling such that the dimension of all images are adjusted into 224x224 pixels (color space is still in RGB of course), split the dataset (training:validation) to 80%:20% split ratio and set number of workers to 4, meaning number of CPUs to use (illustration-9). Note that if you run out of memory, this number of workers parameter can be reduced. ImageDataBunch function as part of fast.ai will do al these for us in just one line of code.
Then, we can start training the dataset (modeling), in this case for Image Classification. The modeling will produce a model, such that when given an image, it can predict an expected classification (output) within a certain confidence level. A model is an approximation on the relationship between input and output, based on dataset.
Training is an iterative time-consuming process (and costly, especially the cost of GPUs), a process that needs to be repeated again and again until we get a satisfactory result. Between these trials, we adjust a few parameters (the one that we call as hyperparameters, with the expectation to minimize the error between expected result (prediction during modeling) and the observable output (label from dataset, the ground truth), hence increasing accuracy — at least one of the measurement metrics that we need to pay attention to in Image Classification.
As a data scientist, one of the best practices to follow when doing experimentation is to use small set of data at the beginning for efficiency (time & cost), then apply our algorithm to a larger full dataset (as available) once we have satisfied with the code that we are working on (modeling using training and validation data with a full dataset requires a great amount of time, meaning more GPU time. The longer we use GPU time, the more the processing cost. The practice to experiment with small set of dataset will make an effective use of GPU, hence reducing the cost/hour if we are using cloud-based virtual server on cloud equipped with GPU for example).
As we are using high-level fast.ai neural network library (based on Facebook’s PyTorch), the code is greatly simplified. We just need to focus on the problem, then let fast.ai does the necessary complex processing todo modeling (train our dataset and generates the model).
Ilustration-9a shows the python code within Jupyter Notebook, in which we prepare the base model for training by calling cnn_learner fast.ai’s function then assign it to the object called learner (using Convolutional Neural Network (CNN)-based neural network architecture call Resnet-50).
2.a. Training
Once the base model for training is defined, we can start the training (illustration 9-b) by calling fast.ai’s fit_one_cycle() function with parameter 5, meaning it will run with 5 epochs: learner.fit_one_cycle (5).
One cycle of training neural network with a full dataset is called as 1 epoch — in this case, our all images from all 4 categories is our one full dataset. The training and validation can be repeated several times to improve the accuracy, although at some point the accuracy may be decreased. It is suggested then, to save the generated file (model) for each epoch (the saved model contains learned weights and network architecture for all connected layers in the model).
Hyper-parameters
Learning rate to determine how fast gradient decent algorithm learns, number of layers, number of neurons in each layers, number of epochs, number of mini-batches, and lambda for regularization to minimize the given cost function are some of variables known as hyper-parameters (like independent variables in statistics) in neural network (or deep neural network/deep learning). In short, they are the external variables that are set before the training to generate optimized dependent variables in neural network structure “model”: namely weights & biases.
Arfika Nurhudatiana, Ph.D added when doing peer-review “Different machine learning algorithm has different set of hyper-parameters. For example, boosted decision trees will have number of trees and layer depth among others, as hyper-parameters.”
— — — — — — — — — — — — -
Hyper-parameters
Learning rate to determine how fast gradient decent algorithm learns, number of layers, number of neurons in each layers, number of epochs, number of mini-batches, and lambda for regularization to minimize the given cost function are some of variables known as hyper-parameters (like independent variables in statistics) in neural network (or deep neural network/deep learning). In short, they are the external variables that are set before the training to generate optimized dependent variables in neural network structure “model”: namely weights & biases.
Arfika Nurhudatiana, Ph.D added when doing peer-review “Different machine learning algorithm has different set of hyper-parameters. For example, boosted decision trees will have number of trees and layer depth among others, as hyper-parameters.”
— — — — — — — — — — — — -
We observe that, with all the default hyperparameters set (such as learning rate & measurement metrics), at 1st, 2nd, 3rd, 4th and 5th epoch we get 18.40%, 15.92%, 16.91%, 17.41% and 17.91% error_rate respectively.
We then continue to adjust the learning rate for next subsequent training epochs. Ad error_rate is defined as (1-accuracy), we get the accuracies of all the first 5 epochs as follows: 81.59%, 84.07%, 83.08%, 82.58%, 82.08% respectively.
Learning rate (Wikipedia) is a step size in machine learning, which is a hyperparameter which determines to what extent newly acquired information overrides old information. A too high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in an undesirable local minimum
2.b. Training Optimization
In fast.ai, there is a function called lr_find() to find a range of possible learning rate values that are suitable for minimizing our error_rate (Illustration-9c). Note that, default learning rate in fast.ai has been set to 0.003 (3x10–3), and in this case we run fit_one_cycle() function for a few epochs before using lr_find().
The result of lr_find() shows that we are suggested to set our learning rate range between 3x10–5 to 3x10–4 (the stable value range in the graph just before it is going up). Then, we call fit_one_cycle() with 2 epochs for a few times and save the result of each stage (stage-2 to stage-5 in illustration-9d).
After a few stages of training with additional 2 epochs each with adjusted learning rate, we have multiple saved stages. We choose stage-4 with the smallest error_rate at 13.93% (hence the highest accuracy <1-error_rate> so far: 86.06%) for inferencing by exporting the file (this is our model, the file name is export.pkl (illustration-11)).
List of file names from each stage is shown in illustration-10. File sizes grow from 120MB in stage-1 to about 300MB in subsequent stages (stage-2 to stage-5).
SECTION 3: INFERENCING
While it’s good to have a trained model, the process does not stop here. The model needs to be put to work by feeding new data, then do prediction. This is the Inferencing stage.
AI, including inference can be part of a large business process such as Business Process Management (BPM) within an Enterprise AI or run as a server process being accessed by external applications like mobile app or web-based app or even accessed by a subprocess within an external application somewhere within multi-clouds environment.
A generated neural network deep learning model (we can just say: model) can be deployed (inferencing) in many ways (in Cloud, At the Edge <AI on IoT edge>, On Mobile devices, etc.) depending on what kind of applications that we are going to target. Illustration-12 shows a typical AI data pipeline, where data flows through 3-stages: during 1. data preparation, 2. modeling as well 3. deployment/inferencing. Running a model (inferencing) is the final stage in which we can select type of deployment according to requirements.
Illustration-13 shows a few sample test images (different set of images, not included in dataset) that we pass through the model in which they are correctly identified according to classes they should belong to, while illustration-14 shows a sample of incorrectly identified image as it should belong to animewoman class while in this case the model predicts it belong to realwoman class.
Inferencing at a glance
There are many ways for doing inferencing. Although the tools like IBM PowerAI Vision on IBM WMLA has an integrated deployment engine out-of-the-box, a typical process would be to export the trained model to an external environment, then do inferencing. Inferencing can be done either on-premise or on-cloud or in combination, it is just deployment options that we need to select considering reliability and scalability that fit to the purpose of deployment (of course, cost factor is also one of the important factors to consider here).
A typical deployment approach is something like, given a model — an external application passes the new data to predict. Prior to be given access to the inference engine, an external application can be authenticated somehow, e.g. through an assigned API-key (Application Programming Interface) typically generated by a server running in the same environment as the inference engine.
External Application to Inference Engine Before reaching the inference engine, incoming data (compressed) typically passes through the message pooling/queuing subsystem (we can deploy this in an asynchronous messaging platform using publish/subscribe methods for example to promote scalability).
We can use “publish to a topic, e.g. to request_message topic” when sending the data from an external application to the messaging platform. At the other end, the application logic “subscribes to the request_message topic”, so it will receive the data as soon as the data arrives to be passed to inference engine (after data has been decompressed). Once predicted outcome is generated by inference engine, the application logic then “publishes the result back to a response topic, e.g. to response_message topic in the messaging platform”.
Inference Engine to External Application Once the result reaches messaging platform, it is then passed back to the external application that “subscribes to response_message topic” for further processing, e.g. by combining the result from inference engine with other application states to execute some actions.
Note that, the use of messaging platform with asynchronous mode promotes scalability in handling multiple requests. In certain situation where a high performance with low latency between requests and responses are really required, we may also doing it synchronously rather than asynchronously. However, the use of synchronous mode must be exercised carefully as we may also need to build the reliable application logic for handling message resend & recovery that are provided out-of-the-box in asynchronous mode with its queuing mechanism.
The set of application logic + inference engine may also be configured as multi-threads in which it can handle multiple requests and perform multiple inferences in one pass within a process. Multiple application logic + inference engines may also be configured as containers to promote scalability in processing multiple parallel requests. The limited set of multi-threads within one virtual machine or within one container is meant to prevent the system’s resources (CPU, RAM, GPU) to be exhausted within that virtualized environment.
What’s Next?
Adoption for Machine Learning (ML) is accelerating rapidly especially with the availability of cloud-based platform to experiment (with GPU). Common steps for doing Deep Learning are quite simple actually:
- Prepare the right data sets, then split data set to training & validation data.
2. Modeling: Select neural network architecture, train using dataset, then generate model.
3. Inferencing: Deploy the model.
Preparing the right data sets has always been the challenge in doing deep learning, this can take weeks or even months. Providing the right resource & skill set (data scientist and computing power), modeling should be a straightforward task, e.g. can be done in hours, days or just a few weeks for a very complex big model. Once a model has been created, deployment should be “easier” to implement — e.g. to deploy in web or mobile apps.
In doing modeling, cloud-based servers that support GPUs have been available for sometime, starting with as low as just about USD 1 per-hour for entry-level configuration and to a few thousands USD per-hour for very high-end configuration (multiple GPUs for parallelism). Note that although you can use CPU-only, the training time will be significantly slower. It can be about 10 times slower. The speed improvement (especially with large dataset) with GPU may vary, however in general it can range from 10–20 times.
To start exploring, especially for Inferencing — there are a few ways for us to experience. Take for example the announcements of NVidia Jetson TX2 (in May 2017) that enables us to start using GPUs for Deep Learning (for USD 599) or the recent NVidia Jetson Nano (announced in March 2019 for just USD 99).
What are you waiting for then? Let’s start by exploring some use-cases in this exciting area of AI.
References:
Andi Sama, 2019a, “AI Model Inferencing, Practical deployment approaches & considerations”, SWG Insight, Edisi Q4 2019, page 3–9.
Andi Sama et al., 2019a, “Image Classification & Object Detection”.
Andi Sama et al., 2019b, “Think like a Data Scientist”.
Andi Sama et.al, 2019c, “Guest Lecturing on AI: Challenges & Opportunity”, Lecture to FEBUI — University of Indonesia”.
Andi Sama et al., 2018, “Deep Learning — Image Classification, Cats & Dogs — A Cognitive use-case: Implement a Supervised Learning for Image Classification”, SWG Insight, Edisi Q1 2018.
Andi Sama et al., 2017, “The Future of Machine Learning: The State of Advancements in Deep Learning”, SWG Insight, Edisi Q4 2017, page 6–17.
Andrew Widjaya, Cahyati S. Sangaji, 2019, “Face Recognition, Powered by IBM Cloud, Watson & IoT on Edge”, SWG Insight, Edisi Q2 2019.
Fei-Fei Li, Justin Johnson, Serena Yeung, 2017, “CS231n: Convolutional Neural Networks for Visual Recognition”.
Jeremy Howard, 2018, “Practical Deep Learning For Coders — v3”.
Pieter Abbeel, 2019, “Full Stack Deep Learning — Lecture 10: Research Directions”, Deep Learning Bootcamp, March 2019, Berkeley.
Wikipedia, “Learning rate”.