Image Classification & Object Detection on IBM PowerAI Vision

Andi Sama
13 min readMay 25, 2019


Image Classification & Object Detection with IBM PowerAI Vision on IBM POWER AC922

This is our 1st attempt to write an article in We used to post quarterly articles related to Information Technologies in our facebook page.

Andi Sama — CIO, PT. Sinergi Wahana Gemilang with Andrew Widjaja, and Cahyati S. Sangaji

Human can naturally sense the things around them through many biological sensors like eye for vision, ear for hearing, nose for smelling, as well as skin for sensing heat and cold. That capabilities are just like embedded within us so we are just using those mostly unconsciously. They are all just there for us to use.

Illustration-1: Object Recognition result of “Nasi Rames” (Rice with various mixed toppings) on IBM PowerAI Vision (deep learning model was trained in 8000 epochs, while the rest of hyperparameters are left at its default). The modeling (training) is based on Detectron object recognition deep learning architecture, one of supported base models in PowerAI Vision. Provided dataset to train the model is only 38 labelled images with various objects within each of the images and then all images are augmented (e.g. blur, rotation, sharpen, flip, etc.) to generate 1400+ images which become the base dataset for the modeling process.

Machines on the other hand, can not be just designed to have all those similar things like any normal humans can do. Years of research have been devoted to this, and many new advanced developments have emerged within just the last few years.

The invention of new algorithms, new optimization methods, new hardware are accelerating in this area of study with many potential practical applications. Illustration-1 shows the output of prediction of objects within an image using one of state-of-the-arts object recognition’s neural network architecture called detectron. Those individual objects within an image are recognized at certain confidence levels by its individual object names based on previously trained-model with 89% accuracy as shown in illustration-2.

Illustration-2: Training/Modeling result on Object Detections for “Nasi rames” dataset on IBM PowerAI Vision running on IBM Watson Machine Learning Accelerator (PowerAI). Base model is Detectron, with 8000 epochs (iterations) and other hyper-parameters as illustrated above.

Image Classification & Object Detection

Image Classification

The notable breakthrough of advancement in the field of computer vision using deep learning was in 2012 when a neural network architecture (NN architecture) called Convolutional Neural Network (a.k.a. CNN) won the ImageNet competition “Large scale Visual Recognition Challenge on Image Classification” by achieving error rate of 16.4%, significantly improved from 2011’s result which was at 25.8% (CS231n lecture on Computer Vision’s @Stanford University in April 2017).

That initial CNN-based NN architecture is known as Alexnet. It was then evolved to become Inception/GoogleNet & VGG (2014), and ResNet (2015). Subsequent error rate results in 2013, 2014 and 2015 were achieved at 11.7%, 6.7%, and 3.57% respectively.

The 2015 ImageNet’s result has surpassed human expert that could achieve the error rate at 5.1%. One of the latest development is NASNet (2017) which further reduce error rate to be at 2.4%.

To date, there have been no significant breakthrough yet. We may have already reached the saturation level in the area of computer vision for large scale Image Classification.

SWG Insight previous edition (Q1 2018) had shown one possible way to do Image Classification to recognize dogs and cats using a deep learning based state-of-the art NN Architecture in 2017 in which dataset are trained (modeled) using tools like Jupyter Notebooks and command line interface (CLI) running Linux Operating System with installed NVidia K80 GPU on cloud.

Object Detection

While many approaches of doing image classifications have been around for 6+ years since 1st ImageNet challenge with Alexnet in 2012, digging more into recognizing objects within an image (or video streams) just emerged in recent years. NN Architectures for object recognitions that have been available include RCNN (2014), FRCNN (2015), SSD (2016), Yolo — You Only Look Once (version 1–3, YoloV1–2016, YoloV2–2017 & YoloV3–2018), RetinaNet (2017), and Detectron (2018).

Image Classification & Object Detection with IBM PowerAI Vision on IBM WMLA

One of the tools available in the market right now for doing Image Classification & Object Detection is IBM PowerAI Vision. It is the software that is running on IBM POWER Hardware (IBM POWER AC922) which is also the recommended platform for IBM Watson Machine Learning Accelerator (WMLA, or previously known as IBM PowerAI Enterprise). There is also a community edition, called as WML-CE (WML Community Edition, or previously known as IBM PowerAI). WML-CE and WMLA are the IBM’s Implementation of Deep Learning Frameworks & Libraries.

IBM POWER AC922 configuration consists of processor core, disk, and RAM like the typical server with additional 2, 4 or 6 GPUs (NVidia Tesla V100, each with 5120 CUDA cores). More GPUs can be added in DDL (Distributed Deep Learning) configuration. NVLink configuration between GPUs and between GPUs/CPUs significantly improves high-speed data transfer between CPUs & GPUs. Supported linux-based Operating Systems can be either Ubuntu or Redhat.

CUDA (Compute Unified Device Architecture) is NVidia’s architecture for performing computation at the smallest level in hardware. With 5120 CUDA cores, it means that the GPU can perform 5120 computation in parallel at once.

On top of that, there are machine learning frameworks (such as Caffe2, Tensorflow, driverless AI,, etc.) that we can install to go deep in doing experiment with deep learning (such as for developers or data scientists). For vision-related (Image Classification & Object Detection), we can install PowerAI Vision software on IBM POWER AC922 as our focus of discussion in this article.

Object Detection in PowerAI Vision

The author has been given a temporary access (30-days access in May 2019) by IBM to an installed PowerAI Vision version 1.1.3 in the cloud with shared-use of 12 installed NVidia V100 GPUs. It enables this article to be created and shared to the audience to experience about IBM PowerAI Vision’s functionalities and capabilities.

Let’s start by picking a use-case in which we will determine the objects within images “NasiRames” (Rice with various mixed toppings) in a dataset that we define by our own. We do this first by having a dataset: acquired by collecting the images, do data labeling it and data augmentation (1. Data Preparation). Then once dataset is ready, we can do modeling by passing the dataset into the training engine (2. Modeling) until we get the result: trained model. The trained model then can be deployed to any runtime engines (3. Inferencing).

a. Data Preparation

First of all, we need to access to the installed PowerAI Vision on Cloud (illustration-3) by doing login from a web-browser. We will then be presented with a welcome screen (illustration-4) where we can upload our own dataset (such as in illustration-5), then do data enrichment to the dataset through data augmentation & data labeling (illustration-7). This enriched dataset will contain multiple numbers of additional images, derived from the original dataset (illustration-8).

Illustration-3: Login screen to IBM PowerAI Vision using a provided temporary userid.
Illustration-4: Welcome screen of PowerAI Vision v1.1.3 on IBM Cloud.

Note that, labeling is only required when we are doing Object Detection. No labeling effort is needed for Image Classification. Labeling may take most of the time during data preparation and can sometimes, very costly (a few experiences in doing object labeling suggest that labeling may be outsourced to speed-up the process of generating labeled-dataset). The Autolabel feature in PowerAI comes to assist on this labeling process by trying to do auto labeling of additional objects in images, given that we have a deployed trained-model before with sufficient measurement metrics (such as acceptable accuracy: percentage of correct image classifications. It is calculated by (true positives + true negatives) / all cases).

Illustration-5: A sample dataset, created by taking 38 images of ‘Nasi rames’ (typically steamed white rice with mixed of various toppings such as vegetables, chicken, tofu, egg, meat, crackers, fried onion, etc. served in plate or other variations such as leaf of banana tree) by google image search and uploading those to PowerAI Vision (can be one by one per image, collection of images or images compressed in a zip file). The Images are the labeled one by one by hand (using the provided labeling tool in PowerAI Vision) to define the area for objects that we want to label and train later on.
Illustration-6: Two sample images of ‘Nasi-Rames’ that have been tagged with multiple objects in it. Each of object that we want to define needs to be labeled one by one per image (the polygon line that is being drawn needs to be as closed as possible to the whole boundary of each object) for all the images that we want the dataset to be trained on.
Illustration-7: Available data augmentations in PowerAI Vision, to enrich (create variations) the original dataset.
Illustration-8: Augmented & labeled dataset.

b. Modeling

The augmented & labeled dataset (sample as in illustration-6) is then trained using chosen pre-defined model that we can select from the available list (there are a few available predefined models (illustration-9) depending whether we want to do Image Classification of Object Detection). Illustration-9 shows the options for Object Detection for our use-case. However, there is other option like for Image Classification, which is in this case for PowerVision 1.1.3, the supported base architecture is GoogleNet.

Illustration-9: Base model selections to train our dataset.
Illustration-10: Hyperparameters that we can adjust for the training with a given neural network architecture (in this case: Detectron).

Note also, that we can bring our own custom-developed NN Architecture (implementing a state-of-the-art Image Classification or Object Detection from one of a recent papers for example), that needs to be coded in python programming language using tensorflow library by following certain rules detailed in PowerAI Vision documentation. The facility is there and provided out-of-the-box in PowerAI Vision.

After we select NN architecture (base model) for us to train our dataset, we can adjust a few available hyperparameters (illustration-10) to tell the PowerAI Vision on how to carry-out the training. One hyperparameter that we can adjust is the iteration (sometimes it is also called as epoch, e.g. how many times the data is passed to the algorithm <NN Architecture> to reduce its error, typically through differentiation <calculus in math>).

Following the completion of the training, a model is generated (we call this model as a ‘trained model’). The proper way to say ‘training’ (although in PowerAI Vision it is called as training) is to do ‘modeling’ from a given dataset. However, since the word ‘training’ is also commonly used, we use these two words interchangeably in this article.

Illustration-11: Progress of the training process based on Detectron neural network architecture. It completed in less than one hour on May 8 2019 after starting the training (get GPU allocation) at 8.28 AM in the morning (GMT+7) for the this specific use-case, with a defined labeled & augmented dataset along with chosen hyperparameters.

In general, a trained-model physical format is no more than just a file. The model is typically generated with a size of a few hundreds MB <Mega Bytes>) containing bunch of floating-points numbers which is structured to a defined NN Architecture (i.e. majority of the numbers consist of adjusted-trained weights, defined for each input neuron within a given neural network according to a chosen neural network architecture such as GoogleNet or ResNet). Sometimes, a meta data (data that describes data) is also accompanying the generated trained-model (a metadata is typically a JSON-formatted file <Javascript Object Notation> that describes the NN Architecture used to train the model, for example).

Illustration-12: The top-right corner on the screen is showing the GPU utilization in IBM PowerAI Vision on IBM Cloud, either being used for Training/Modeling or running a Deployed models. There are 12 GPUs available, and 1 GPU is being used for Training.

Following the training, PowerAI Vision produces several metrics for us to examine. For most of the modeling, sometimes we need to choose which NN Architecture for which type of dataset and use-case and the purpose of the model: either it will be optimized for speed or accuracy.

Typically, the higher accuracy that we want to achieve, the response time of the model will be slower. Similarly, if we want the model to be as fast as possible during run time (maximize frame per second for video recognition for example), the accuracy typically will be lower than if we optimize the model for accuracy. It’s the trade-off that we need to choose and decide for a given use-case.

Illustration-13 shows Loss vs Iteration curve for example. According to IBM PowerAI Vision 1.1.3 documentation, this graph shows “the relative performance of the model over time. The model should converge at the end of the training with low error and high accuracy.

Illustration-13: Some metrics that are shown for the trained model. Loss VS Iteration graph is one of the important measurement in which the loss should be as closed as possible to reach zero, but can not be zero. Model may be optimized for speed or accuracy.

Furthermore, Illustration-14 shows advanced graphs: Confusion Matrix & PR curve. IBM PowerAI Vision 1.1.3 documentation says that “Confusion Matrix is used to calculate the other metrics, such as precision and recall. Each column of the matrix represents the instances in a predicted class (those that PowerAI Vision marked as belonging to a category). Each row represents the instances in an actual category. Therefore, each cell measures how many times an image was correctly and incorrectly classified.” The documentation continues “You can view the confusion matrix as a table of values or a heat map. A heat map is a way of visualizing the data, so that the higher values appear more ‘hot’ (closer to red) and lower values appear more ‘cool’ (closer to blue). Higher values show more confidence in the model. This matrix makes it easy to see if the model is confusing categories, or not identifying certain categories.

Illustration-14: Confusion Matrix & PR Curve as part of generated metrics from IBM PowerAI Vision.
Illustration-15: List of deployed model that can be accessed through REST-API.

On PR curve, IBM documentation stated “The precision-recall (PR) curve plots precision vs. recall (sensitivity). Because precision and recall are typically inversely related, it can help you decide whether the model is appropriate for your needs. That is, do you need a system with high precision (fewer results, but the results are more likely to be accurate), or high recall (more results, but the results are more likely to contain false positives)?”. The documentation continues “Precision tells describes how ‘clean’ our population of hits is. It measures the percentage of images that are correctly classified. That is, when the model classifies an image into a category, how often is it correct? It is calculated by true positives / (true positives + false positives).” Still more, it is stated that“The percentage of the images that were classified into a category, compared to all images that should have been classified into that category. That is, when an image belongs in a category, how often is it identified? It is calculated as true positives/(true positives + false negatives).”

Illustration-16a: A test image with identified objects according to the trained model.

image source-1

Illustration-16b: A test image with identified objects according to the trained model.

image source-2

c. Model Deployment

The trained model can then be deployed to either Server (PowerAI Vision Server) or Edge (depending on the trained model. The model can be deployed to the smaller AI Engine on IoT edge footprint, such as NVidia Jetson TX2 with GPUs <256 CUDA cores> or just another supported engine with only CPUs without any GPUs). An example of inference engine with CPU is any mobile device with Android operating system.

If the model size is small enough, the deployment to NVidia newly 2019’s announced Jetson Nano (128 CUDA cores) maybe possible. The deployment option of larger model size can be to – a bigger than Jetson TX2 NVidia’s Edge Inference engine - such as Jetson AGX Xavier (512 CUDA cores).

Deployment can be automatic and seamless (see deployed models in illustration 15) if supported (immediately accessible using REST-API), or can be through export-model approach in which we need to have an application (such as the one built in python programming language) to parse ‘run’ the generated model.

Following deployment, we can access the model (see illustration 16a and 16b in which we pass several new images (the images that were not included in the dataset to train the model) to the deployed model to detect objects based on trained dataset) through provided REST-API (Application Programming Interface) to do inferencing (run time, use the trained model). Later, we can also provide more dataset (e.g. more new images) to improve the trained-model in predicting class (image classification) or objects (object classification).

What’s Next?

The developments of technology in the area of Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL) have been significantly advancing in the past 6+ years, especially due to advancements in hardware: GPUs and the inventions of new deep learning algorithms & neural network architectures.

Algorithms & Neural Network Architectures that used to be covered in academia courses for Doctoral students for research purposes (like MIT, NYU, NTU, Stanford and also including local universities in Indonesia like: University of Indonesia, Bina Nusantara University) have quickly became the things that have been taught and explored also in Master and Bachelor degrees.

The more mature approach in doing deep learning at this moment is still through Supervised Learning. It means that, in order for the model to predict something it needs to be trained with a lot of data (ideally thousands of data at least for each classification category of object category). We need to label the data thousands of times before the model can generate the pattern (in the form of: model) to predict new untrained data. IBM PowerAI Vision falls in this category of Supervised Learning in Computer Vision for Image Classification & Object Recognition.

Other emerging approaches that are still in hot topic in research (especially for Post-Doctoral research) are more towards Unsupervised Learning and Reinforcement Learning. There is also Imitation Learning and others that may emerge in the future along with advancements of research.

On the other hand, execution capability in the form of products & services for multiple target industries by the world AI-big players like G-MAFIA-BAT: Google, Microsoft, Amazon, Facebook, IBM, Apple and Baidu, Alibaba & Tencen have been getting more and more applicable to the market.

For us in Indonesia, the opportunities are opening since the last few years while the numbers and capabilities of the implementers in the area of AI are still limited. This is the opportunity for the players in Information Technology industry to quickly grab the market by starting with the ready implementable solution such as IBM PowerAI Vision running on IBM Watson Machine Learning Accelerator.

Well, let’s get started by doing something.


Andi Sama et al., 2018, Deep Learning — Image Classification, Cats & Dogs — A Cognitive use-case: Implement a Supervised Learning for Image Classification”, Edisi Q1 2018, Accessed online on May 9, 2019 at 4:37 PM.

Andi Sama, 2018, Processing Handwritten digit (mnist dataset)”, Accessed online on May 9, 2019 at 4:44 PM.

Andrew Widjaya, Cahyati S. Sangaji, 2019, Face Recognition, Powered by IBM Cloud, Watson & IoT on Edge”, Edisi Q2 2019, Accessed online on May 10, 2019 at 7:18 PM.

IBM, “IBM PowerAI Vision 1.1.3 on Cloud”, Accessed online during month of May 2019.

IBM, “IBM PowerAI Vision 1.1.3 documentation”, Accessed online during month of May 2019.