AI-Model Inferencing

Andi Sama
8 min readAug 7, 2019


Practical deployment approaches & considerations

Andi Sama CIO, Sinergi Wahana Gemilang

SWG Insight has been discussing about Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL) from different perspectives in the last couple of years. We shared multiple approaches and tools both running on-premise (such as IBM Watson Studio with Python on Jupyter Notebook on regular Windows-based System as well as IBM PowerAI Vision on IBM POWER AC922 hardware (IBM Watson Machine Learning Accelerator, WMLA) with NVidia V100 GPUs — Graphic Processing Units) and cloud (such as IBM Watson Studio or IBM Neural Network Modeler with Python on Jupyter Notebook on IBM Cloud or just plain Python on Jupyter Notebook running on IaaS-Cloud with NVidia K80 GPUs) in doing modeling, creating AI-models by learning directly from data (mostly through Supervised Learning, in which typical datasets contain set of input features along with corresponding known label as expected output).

We learned that datasets are supposed to be i.i.d (independent and identically distributed). As stated by (Kilian Weinberger, 2018): dataset should be independent and identically distributed for Machine Learning algorithms to work. It means dataset must be in the same distribution space.

A few well known deep learning frameworks have also been discussed & demonstrated such as Google’s Keras, Google’s Tensorflow and IBM’s mostly implemented framework: Caffe 2, to just name a few. Other frameworks were also mentioned like Facebook’s Pytorch, Microsoft’s ONNX, and Amazon’s supported Apache MXNet. In majority of discussions, those frameworks have been mostly applied to well known problems in computer vision (such as Face Recognition, Image/Video Captioning and Object Classification & Detection) using deep learning neural network model based on Yann Lecun’s famous CNN (Convolutional Neural Network) with some of its variations such as FRCNN (Faster Regional CNN) and Facebook’s Detectron.

Furthermore, in inferencing (run a trained AI-model), we demonstrated on-premise inferencing on NVidia Jetson TX2 or Android-based Smartphone for different use cases such as Face Recognition by Name and Image Captioning. On-premise inferencing included the deployment of a model on IBM PowerAI Vision on IBM WMLA. Use-cases varied from Computer Vision, Natural Language Processing to combining deployment with IoT-based open source controllers with Node-RED to IBM Watson IoT Platform on IBM Cloud to name a few.

Recently, we also discussed what’s in a mind of a typical Data Scientist, when presented with data to illustrate the thinking process of selecting machine learning (or neural network) algorithms to do feature engineerings & modeling, given various type of datasets at hand to generate the best possible model.

In this article, our focus is to discuss approaches to deploy a model. This is how we use the trained-model by giving new data following the completion of training and hope that it will perform as trained by predicting the output with certain confidence level.

AI Data Pipeline

Illustration-1: An AI Data Pipeline consisting of 3 major steps: 1. Data Preparation, 2. Modeling and 3. Deployment (or Inferencing)

To recap, illustration-1 (in brief) and illustration-2 (expanded view) show a typical AI Data Pipeline consisting of three major steps: Data Preparation, Modeling/Training and Deployment/Inferencing.

Illustration-2: An expanded view on AI Data Pipeline consisting of 3 major steps: 1. Data Preparation, 2. Modeling and 3. Deployment (or Inferencing) and the typical roles of Data Engineer, Data Scientist, Machine Learning Engineer and others.

1.Data Preparation

Dataset may not be suitable to be directly used for training. Images, texts, sounds or videos for examples needs to be converted to numbers (integer or floating point) somehow before being processed by machine learning/deep learning algorithms, which is basically an optimization algorithm through approximation with math & statistics. In general, data needs to be carefully selected, cleansed or transformed before going to algorithms.

According to an article in Harvard Business Review (Thomas C. Redman, 2018), collecting and preparing data for machine learning takes about six month on average.


Once data is ready, training can begin. Prior to the training we need to set a few hyperparameters such as epoch (number of iterations), learning rate (how fast the changes in small steps towards achieving global minima when doing stochastic gradient descent algorithm through 1st order derivative process (differentiation) on certain defined loss function) and dataset split ratio (dataset is typically divided into several portions such as [70%:15%:15%] or [60%:20%:20%] for the purpose of [Training:Validation:Testing].

We need to specify hyperparameters named k for example (number of expected clusters to group our dataset) if we select k-nn algorithm for Unsupervised Learning.

For Supervised Learning, in which we have a set of input data associated with a set of output label, the number of training data is typically in the order of thousands per class (or per type of object in which we want to find the relationship between a set of input data and a set of labeled output).

We need to review the result through some metrics (depends on chosen algorithm) such as accuracy, confusion matrix, mAP (mean Average Precision) and IoU (Intersection over Union) to find the best performing model. The selected model can then be deployed.

3.Deployment (Inferencing)

This is the step in which we use the trained model to predict the outcome, given new data. Deployment can be done on-premise or in the cloud, using either CPU or GPU with a few different approaches depending on different purposes: accessed by mobile application or web application or even accessed by a sub-process within an application.

Further, we will focus the discussion on Model Deployment/Inferencing.

Deployment/Inferencing in AI Data Pipeline

While it’s good to have a trained model, the process does not stop here. The model needs to be put to work by feeding new data, then do prediction. This is the Inferencing step.

AI, including inference can be part of a large business process such as Business Process Management (BPM) within an Enterprise AI or run as a server process being accessed by external applications like mobile app or web-based app or even accessed by a subprocess within an external application somewhere within multi-clouds environment. Illustration-3 shows Enterprise AI within a broader Intelligent Automation Spectrum.

Illustration-3: Enterprise AI within an Intelligent Automation Spectrum (Adapted and modified from Accenture 2017, Hfs Research 2018 and other sources).

There are many ways for doing inferencing. Although the tools like IBM PowerAI Vision on IBM WMLA has an integrated deployment engine out-of-the-box, a typical process would be to export the trained model to an external environment, then do inferencing. Inferencing can be done either on-premise or on-cloud or in combination, it is just deployment options that we need to select considering reliability and scalability that fit to the purpose of deployment (of course, cost factor is also one of the important factors to consider here).

Illustration-4 shows a typical deployment approach to run the trained model within an inference engine in which an external application passes the new data to predict. Prior to be given access to the inference engine, an external application must be authenticated somehow, e.g. through an assigned API-key (Application Programming Interface) generated by a server running in the same environment as the inference engine.

Illustration-4: An approach to deploy inferencing on a trained AI-model in which on-premise/cloud-based external application use the inference engine running somewhere on-premise or on cloud.

External Application => Inference Engine

Before reaching the inference engine, incoming data (compressed) passes through the message pooling/queuing subsystem (we can deploy this in an asynchronous messaging platform using publish/subscribe methods for example to promote scalability through message queue mechanism).

We use “publish to a topic, e.g. to request_message topic” when sending the data from an external application to the messaging platform. At the other end, the application logic “subscribes to the request_message topic”, so it will receive the data as soon as the data arrives to be passed to inference engine (after data has been decompressed).

Once predicted outcome is generated by inference engine, the application logic then “publishes the result back to a response topic, e.g. to response_message topic in the messaging platform”.

Inference Engine => External Application

Once the result reaches messaging platform, it is then passed back to the external application that “subscribes to response_message topic” for further processing, e.g. by combining the result from inference engine with other application states to execute some actions.

Note that, the use of messaging platform with asynchronous mode promotes scalability in handling multiple requests. In certain situation where a high performance with low latency between requests and responses are really required, we may also doing it synchronously rather than asynchronously. However, the use of synchronous mode must be exercised carefully as we may also need to build the reliable application logic for handling message resend & recovery that are provided out-of-the-box in asynchronous mode with its queuing mechanism.

The set of application logic + inference engine may also be configured as multi-threads in which it can handle multiple requests and perform multiple inferences in one pass within a process. Multiple application logic + inference engines may also be configured as containers to promote scalability in processing multiple parallel requests. The limited set of multi-threads within one virtual machine or within one container is meant to prevent the system’s resources (CPU, RAM, GPU) to be exhausted within that virtualized environment.

What’s next then?

The activity in data collection and data preparation through augmentation, dimension reduction and so forth take a lot of time & resources that used to be done by a role called as Data Engineer. Spending months of works are not uncommon in preparing the suitable data for Machine Learning/DeepLearning.

Then, modeling/training follows by selecting from variety of available algorithms as well as adjusting its hyperparameters. Many days or weeks (if not using high-end GPUs) can be spent for this step alone.

Implementing the trained model that can scale is another important matter that we need to consider. Doing deployment using ready-to-use inference engines such as IBM Machine Learning on IBM Cloud or IBM PowerAI Vision on IBM Watson Machine Learning Accelerator (IBM POWER AC922) would be something that we need to consider, especially with the support of Distributed Deep Learning (DDL) for high scalability across multiple servers.

Well, to understand things better, we need to start getting our hands dirty by working with some real datasets, proceed with some modelings then do multiple ways of inferencing.

What are we waiting for?


Andi Sama et al., 2019a, Image Classification & Object Detection”, Accessed online on August 6, 2019 at 6:35 PM.

Andi Sama et al., 2019b, Think like a Data Scientist”, Accessed online on August 6, 2019 at 6:32 PM.

Andi Sama, 2019c, “Guest Lecturing on AI: Challenges & Opportunity”, Lecture to FEBUI — University of Indonesia”, accessed online at 11.21AM on June 10, 2019

Andi Sama et al., 2018a, Deep Learning — Image Classification, Cats & Dogs — A Cognitive use-case: Implement a Supervised Learning for Image Classification”, Edisi Q1 2018, Accessed online on July 9, 2019 at 4:37 PM.

Andi Sama, 2018b, Processing Handwritten digit (mnist dataset)”, Accessed online on July 9, 2019 at 4:44 PM.

Andrew Widjaya, Cahyati S. Sangaji, 2019, Face Recognition, Powered by IBM Cloud, Watson & IoT on Edge”, Edisi Q2 2019, Accessed online on July 10, 2019 at 7:18 PM.

Hui Li, 2017, Which machine learning algorithm should I use?”, Accessed online on July 22, 2019 at 11:10 AM.

Kilian Weinberger, 2018, CS4780 — Machine Learning for Intelligent Systems — Cornell Computer Science”, Cornell University, Fall 2018, Accessed online on July 31, 2019 at 7:48 PM.

Thomas C. Redman, 2018, “If Your Data Is Bad, Your Machine Learning Tools Are Useless”, Harvard Business Review, April 2018 Edition, Accessed online on August 5, 2019 at 11:55 AM.

Tom Reuner, 2018, “HfS Blueprint Report, Enterprise Artificial Intelligence (AI) Services 2018”, HfS Research, Accessed online on July 5, 2019 at 11:55 AM.