Think like a Data Scientist

Andi Sama
17 min readJul 19, 2019

--

Machine Learning Algorithms, An Introduction: with a brief discussion on IBM Hardware & Software Portfolio for Machine Learning

Andi Sama CIO, Sinergi Wahana Gemilang with Auliah Nuraini and Andrew Widjaja

This article was first appeared in SWG site on facebook.com (posted by Andi Sama on July 12, 2019). The original article can be found in the following link: https://web.facebook.com/SinergiWahanaGemilang/photos/a.208444795905448/2305716122844961/

When presented with a set of accessible data and what to do with it to be able to create a pattern to do something (like prediction or classification for example) with a new unknown data, most Data Scientists used to think and experiment a lot to find the best way to model the relationship of those set of data.

Sometimes, many algorithms are explored to find the best model that are deemed to be the most optimized and best-fit to the data. And to a few of the most extreme, it may lead one to frustration by spending many days and nights to work on multiple experiments and variations, not to mention the wasted resources like computing power for doing all of that efforts.

Illustration-1: Best practices in selecting Machine Learning Algorithms, when presented with data. The workflow as contributed by an experienced late Data Scientist, Dr. Hui Li.

In this article, we will briefly discuss on structured thinking to select various available machine learning algorithms, so we can have better path in modeling our data. The discussion focus on different angle like the one we used to do with deep learning, as Artificial Neural Network (ANN) with deep learning is actually just one of the latest hot approaches to model our data. There are a lot of approaches in doing Machine Learning (see illustration-1) that have been there for years even decades, and those algorithms are still being used today by many of Data Scientists depending on the problem at hand.

“Picking a random algorithm, given data, and hope that it will just work is a terrible idea”

Kilian Weinberger, Professor of Artificial Intelligence at Cornell University

(Kilian Weinberger, 2018) said that selecting the best machine learning algorithms for a given dataset, is something that a Data Scientist has to do manually, for now — e.g. given a dataset, find the best available algorithms to learn from that dataset. Some people have been trying to automate the selection of Machine Learning algorithms, given dataset; but none has produced satisfied result so far. Picking a random algorithm, given data, and hope that it will just work is a terrible idea.

In Supervised Learning — given a dataset as input (features) and set of corresponding labels, the purpose of Machine Learning algorithm is to learn directly from the data, as such for any new input data within the same distribution space (as the training data) that will be passed to the trained algorithm (model), it is expected that the model will generate output (label) according to the trained model.

During training, the goal is to minimize the difference (as closed to zero) between expected output (from the model) and the actual output (from the real training data), this is done through minimizing a Loss function.

Table-1: Three categories of Machine Learning.

One of the basic Loss function for Regression is RMSE (Root Mean Square Error). RMSE represents the sample standard deviation (as in statistics) of the differences between predicted values and observed values (called residuals). Mathematically, it is calculated using the following formula:

Other metrics that we will see (such as in illustration-7a and 7b) are R-Squared, MSE (Mean Squared Error), and MAE (Mean Absolute Error).

Matlab documentation says “R-Squared is the Coefficient of determination that indicates the proportionate amount of variation in the response variable y explained by the independent variables X in the linear regression model.” It continues “The larger the R-squared is, the more variability is explained by the linear regression model.

While MAE is measuring the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.

A quick Recap on Artificial Intelligence

Time flies, that’s the thing we used to say when a certain period has passed surprisingly quickly. It applies also to the area in Computer Science named Artificial Intelligence (AI) & Machine Learning (ML), especially in the sub-field of Deep Learning (DL) in which the recent developments since 2012, have been tremendously grown significantly. All the rapid advancements in Graphic Processing Unit (GPU), Bigdata (e.g. Hadoop), new Algorithms as well as Inferencing on edge device with Field Programmable Gate Array (FPGA) or Application Scientific Integrated Circuit (ASIC) started to have real effect in various implementations in industries.

SWG Insight has been discussing about AI, ML and DL since the last 2+ years since 2017 editions and will continue to do so in the next more editions onwards. Topics within the typical workflow in doing machine learning mostly for ANN with deep learning have been discussed. It cover things such as General Concept, Data Preparation, Modeling, and Inferencing with or without GPU (including Inferencing on Edge, AIoT — AI on IoT Edge) in various uses cases, it also discussed on the environment for deep learning: Local machines or in the Cloud on multiple Operating Systems (OS) which is mostly on Linux.

The discussions on deep neural network (DNN, the ANN that is using Deep Learning) in previous articles were mostly related to computer vision as this has been the sub-field research that people has been putting much attention lately, implementing as well as improving & extending Professor Yan Lecun’s CNN neural network architecture to various use-cases. We do this in general by learning the relation from given inputs, and outputs to create a process dataset in the form of images or video to understand better.

Machine Learning

Machine Learning is a subset of Artificial Intelligence (AI). Wikipedia defines AI as “Intelligence exhibited by machines, rather than humans or other animals.” One of sub-branches of Machine learning is ANN which is a “mathematical model” of human biological brain.

Deep Learning is the current name of ANN in which it involves learning by utilizing more than 1-hidden layer. Initially, machine learning can be categorized as Supervised Learning (labelled dataset) and Unsupervised Learning (non-labelled data). Recently, the 3rd category emerges: Reinforcement Learning. Those 3 categories of Machine Learning are quickly summarized in table-1.

Machine learning offers the ability to extract certain knowledge and patterns from a series of observations. It’s all done through mathematical optimization by using models (pattern recognition or exploration of many possibilities). In supervised learning, minimizing the errors (as feedbacks) is very important to get the best learning result possible.

One of the good examples of Deep Learning application is perception. the ability for machine to be able to mimic human in recognizing images, see what objects are in the images, teaching the robot to understand the world around it and interact with it. Deep learning is the state of the art and emerging technology in Artificial Intelligence. Many applications are possible, including Computer Vision and Text/Speech Recognition with high degree of accuracy for example.

Each of the node in the hidden layer basically consists of complex calculations (mainly progressively matrix operations). It takes inputs from previous nodes — adjusted with unique biases (from nodes) and weights (from edges), then do some calculations (and measurements) to produce output to solve a problem. In Deep Learning, the hidden layers are more than one.

Handling dataset

As with our discussion in Neural Network approach before with deep learning, it is good to split the dataset that we have into training:test, then do modeling (training) with selected machine learning algorithm on training dataset, then do testing with the data once a model has been produced. The split can be like Training Dataset : Testing Dataset 70–80%:30–20%, assuming that Dataset comes from the same distribution space.

(Kilian Weinberger, 2018) said that the dataset should be independent and identically distributed for Machine Learning algorithms to work. It means dataset must be in the same distribution space.

If we are building a classifier (Supervised Learning) that will classify two brand of cars: Japanese or European cars for instance, taking hundred or thousand of images of sample cars for both brands during day time would be the example of taking the data from the same distribution space. However, taking sample images of Japanese cars during day time and taking sample images of European cars during night time would be considered as taking samples from different distribution space.

Similarly, taking sample images of both Japanese cars and European cars during day time for training dataset and taking sample images of Japanese cars & European cars during night time for testing dataset would be also considered as taking samples from different distribution space. You get the idea.

Number of samples for a given category should be large enough for a model to generalize properly. In general, a few thousand samples per class (or category) for a Supervised Learning is considered to be enough.

Dimensionality reduction

Dimensionality reduction is suggested to be the first to be explored for (with multiple algorithms selection: PCA — Principal Component Analysis, SVD — Singular Value Decomposition or LDA — Latent Dirichlet Allocation).

The purpose of dimensionality reduction is to reduce the dimension of the features (reduce number of features), meaning we need to just select the required features to train our model (analogy in excel table, features are columns). Often, we also hear the terms like features selection and features extraction when referring to dimensionality reduction.

Why do we need to reduce the dimension of dataset? Aside from wasting the resources to process unnecessary dataset, the modeling would be much more complex (and slower) along with the increasing number of dimension. Reduced dimension meaning that it will be easier also to visualize, as human in general would be comfortable working with two or three dimensions at most. More will be becoming more complex while it is inevitable and sometimes we need to work with more than three dimensions in certain situations.

Note that, the accuracy of the model is typically reduced when we are reducing the dimension. We need to balance and decide between the benefit of reducing the dimension with the cost of reduced accuracy. E.g. Reducing most of unrelated features for modeling may be worth it when the model accuracy is only reduced by a small percentages.

For example, Matlab documentation says on PCA dimensionality reduction algorithm “Principal component analysis is a quantitatively rigorous method for achieving this simplification. The method generates a new set of variables, called principal components. Each principal component is a linear combination of the original variables. All the principal components are orthogonal to each other, so there is no redundant information. The principal components as a whole form an orthogonal basis for the space of the data.

Selecting Machine Learning Algorithm(s)

Rather than trying multiple algorithms with the dataset, a Data Scientist (late Dr. Hui Li) suggested an approach (illustration-1) to work with dataset by following a structured thinking.

Illustration-2: The sample first 15 rows of “Credit Rating” dataset. Processed using Matlab version R2019a update 2.

Supervised or Unsupervised Learning?

Once we are done with dimensionality reduction, it is followed by whether the responses are available or not (responses here are labeled data, data that have certain relationship between input and output; this is a Supervised Learning: it can either be Regression for numerical continuous variables or Classification for categorical variables).

Illustration-3a: Classification on creditrating dataset, with PCA. Note the model accuracy is 70.1%. The graph shows the relation between 2 features: RE_TA and EBIT_TA.
Illustration-3b: Classification on creditrating dataset, without PCA. Note the model accuracy is 73.6%. The graph shows the relation between 2 features: RE_TA and EBIT_TA.

Let’s see the first example for Supervised Learning: Classification. Later, we will see second example on another Supervised Learning: Regression.

We will be using tools called: Matlab, version R2019a update 2, trial version to browse and process “Credit Rating” dataset. This dataset (provided in Matlab) is about “financial ratios and industry sectors information for a list of corporate customers”. The response variable consists of credit ratings (AAA, AA, A, BBB, BB, B, CCC) assigned by a rating agency). The dataset consists of 3932 rows with the following: Number of predictors: 6 , Number of observations: 3932 , Number of classes: 7, Response: Rating.

Illustration-2 shows the first 15 rows of the dataset containing six predictors (except for ID), they are Working capital / Total Assets (WC_TA), Retained Earnings / Total Assets (RE_TA), Earnings Before Interests and Taxes / Total Assets (EBIT_TA), Market Value of Equity / Book Value of Total Debt (MVE_BVTD), Sales / Total Assets (S_TA) and Industry. It also show the response, which is only one: Rating.

Algorithms selection for Regression are: Decision Tree, Linear Regression, Random Forest, Neural Network, Gradient Boosting Tree. While for Classification, the algorithms selection are: Linear SVM — Linear Support Vector Machine, Naïve Bayes, Decision Tree, Logistic Regression, Kernel SVM, Random Forest, Neural Network & Gradient Boosting Tree.

Illustration-4: ROC curve for classification on dataset without PCA.

If there are no responses (no labeled data), then it goes to Clustering (also with multiple algorithm selections: DBSCANS — Density-Based Spatial Clustering of Applications with NoiSe, k-means, Gaussian Mixture Models, k-modes & Hierarchical).

A Supervised Learning Example: Decision Tree with and without PCA

Illustration-3a and 3b show the result of our dataset that is trained with decision tree using two options of dimensionality reduction — PCA algorithm: the first is with PCA (3a — PCA ON) and the second is without PCA (3b — PCA OFF). It turns out, the accuracy without PCA is slightly still better a few percent than with PCA. The graphs show the relation between RE_TA feature at x-axis and EBIT_TA at y-axis.

Matlab documentation says on Decision tree and Regression algorithm: “Decision trees, or classification trees and regression trees, predict responses to data. To predict a response, follow the decisions in the tree from the root (beginning) node down to a leaf node. The leaf node contains the response. Classification trees give responses that are nominal, such as ‘true’ or ‘false’. Regression trees give numeric responses.

Our dataset only contains 6 predictors. PCA may give better result when there are a lot of predictors.

By referring to illustration-3b, let’s take a look at ROC curve — (Receiver Operating Characteristic curve, as in illustration-4) and Confusion Matrix (illustration-5).

Illustration-5: Confusion Matrix for classification on dataset without PCA.

A true positive rate value of 0.93 at AAA rating in ROC curve indicates that the current classifier assigns 93% of the observations correctly to the positive class. The Area Under Curve (AUC) number (now at 96%) is a measure of the overall quality of the classifier. Larger Area Under Curve values indicate better classifier performance.

On the other hand, see Confusion Matrix to understand how the currently selected classifier performed in each class (indicated by green color). Confusion Matrix also helps us in identifying which of the areas where the classifier has performed poorly.

Look for areas where the classifier performed poorly by examining cells off the diagonal that display high percentages and are colored in red. The higher the percentage, the brighter the hue of the cell color. In these red cells, the true class and the predicted class do not match. The data points are misclassified.

A Supervised Learning Example: Linear Regression with and without PCA

Let’s see another example, still on Supervised Learning: Regression. We will use another dataset provided by UCI Machine Learning Dataset: Abalone dataset. The dataset is about measurements of abalone (a group of sea snails). It predict the age of abalones, which is closely related to the number of rings in their shells.

Illustration-6 show the first 10 rows of the total of 4177 rows in Abalone dataset. The dataset itself contains Number of predictors: 8, Number of observations: 4177, and Response: Rings.

Illustration-6: The sample first 10 rows of “Abalone” dataset. Processed using Matlab version R2019a update 2.

Illustration-7a and illustration 7b show the measurement using RMSE (Root Mean Square Error) of the dataset with predictor: Weight and Response: BloodPressure_2. In general, the smaller the value of RMSE is better although which value is better is really depends number of things: such as number of data (e.g. if number of data in dataset is in many thousands of rows, then RMSE value below 1 may be considered good. However that same value may not be considered as good enough if the data is not sufficiently large, such as just a few thousands in our example).

Illustration-7a: Regression on Abalone Dataset with Linear Regression algorithm, with PCA. Note that the achieved RMSE is 2.6072 which is still looks quite high. The graph shows the relation between 2 features: Diameter of Abalone and its number of Rings.
Illustration-7b: Regression on Abalone Dataset with Linear Regression algorithm, without PCA. Note that the achieved RMSE is 2.1499 which is better compared with the one with PCA. The graph shows the relation between 2 features: Diameter of Abalone and its number of Rings.

Un-Supervised Learning Example: Clustering with k-means

Now, let’s see different example called Un-Supervised Learnng. Recall that in Supervised Learning, we do have labeled data (observations and input and responses as output). When we do not have or do not know whether the dataset has a pair of observations/responses, one approach is to do Un-Supervised Learning, and this time we will take a look on one of the algorithm for Clustering, called k-means (in which k is number of cluster). The purpose of the algorithm is to partition data into k number of mutually exclusive clusters.

The technique assigns each observation to a cluster by minimizing the distance from the data point to the mean or median location of its assigned cluster, respectively. We need to specify distance metric for the algorithm: the common ones are L1 (Manhattan distance) and L2 (Euclidean distance).

First we generate 1000-random data (illustration-8a) and applies k-means algorithm to do clustering with k=2, with distance metric sets to “cityblock” or known as Manhattan distance (the default is Euclidean distance metric).

The complete Matlab code is shown in illustration-9. The result of k-means algorithm is shown in Illustration-8b as two clustered data. Matlab documentation says, k-means algorithm performs k-means clustering to “partition the observations of the n-by-p data matrix X into k clusters, and returns an n-by-1 vector (idx) containing cluster indices of each observation. Rows of X correspond to points and columns correspond to variables.

Illustration-9: The Code in Matlab to generate illustration 8a and 8b: generate random data, then use k-nn algorithm to do clustering.

Back to illustration-1, we note that k-means is just one of the options among available selected algorithms in Supervised Learning: Clustering.

In Summary

We have quickly discussed on several algorithms: PCA (Principal Component Analysis) for dimensionality reduction, as well as Decision Tree for Classification & Regression in Supervised Learning with a touch of measurement metrics such as Confusion Matrix and AUC (Area Under Curve).

In addition to that, we also discussed k-means algorithm for doing clustering in Un-Supervised Learning.

We visualized the illustrations using Matlab, as the typical tools that many Data Scientists working in Machine Learning are familiar with.

Although nowadays there are various tools available that can be coded in either Python or R-Programming Languages. Python is mostly used by Computer Scientist and R is mostly used by Statistician.

Sometimes, another Programming Languages (for doing Machine Learning) are also used. Julia and Scala are two of these examples.

IBM Solutions for Data Science & Machine Learning

There are several options with IBM Hardware & Software to work with Machine Learning algorithms, either for Modeling (develop model) or Inferencing (deploy/run model). Some of them are as follow (indicated by HW for Hardware, and SW for Software-based solutions):

[HW] IBM POWER AC922 (Accelerated Computing) + IBM Watson Machine Learning Accelerator (WMLA, previously known as PowerAI on IBM POWER Systems + [SW] Power AI Vision with custom model. This is for Supervised Learning for either Classification or Regression, Vision (images, videos). POWER: Performance Optimization with Enhanced RISC (Reduced Instruction Set Computing).

WMLA is available as On-Premise solution for modeling & inference, and comes with very scalable configuration equipped with multiple NVidia high-end Tesla V100 GPUs. Inferencing at Edge is available also with Xilink Alveo U200 Field Programmable Gate Array (FPGA). WMLA & Power AI Vision is good for general users & non-experienced developers.

IBM POWER AC922 + WMLA + [SW] H2O Driverless AI. This is also mostly for Supervised Learning for either Classification or Regression (text based, images, videos) and more with extensive parameter settings and explainability. Good for advanced users & experienced developers or Data Scientists.

[SW] IBM Watson OpenScale. The platform to choose when we want to do scalable model deployment (Inferencing), running a few or many models in parallel with capabilities to track performance in production environment, model tuning as well as explainability (explain why the AI made certain decisions) — available in IBM Cloud.

[SW] IBM Watson Data Platform (IBM WDP, available in Cloud or On-Premise, or On-Premise in combination with IBM WMLA or [SW] IBM Cloud Platform (ICP)). This is for those who would like to have deep customization (using IBM Watson Studio) and greatest flexibility in implementing Machine Learning. The best fit for working in collaboration between experts of data analysts, experienced Data Engineers and Data Scientists.

There are set of tools provided for Data pre-processing for preparing data in IBM WDP (such as Data Manipulation/Cleansing, Data Augmentations) before being processed by Machine Learning algorithms, as well as Data Catalog for managing data sets used to train the data using Machine Learning algorithms.

[SW] IBM Cloud Private for Data (ICPD, available in On-Premises as well as in Cloud) is a native cloud solution that enables users to put lots of data to work quickly and efficiently to use the data to generate meaningful insights that can help users avoid problems and reach the goals, while maintaining the data privately. The best fit for experienced Data Scientists/Analysts & experienced developers.

[SW] IBM Watson Studio (part of IBM WDP, available in Cloud or On-Premise or On-Premise in combination with IBM WMLA or ICP). This is for those who would like to have deep customization and greatest flexibility in implementing Machine Learning. The best fit for expert or experienced developers as well as Data Scientists.

[SW] IBM Neural Network Modeler (available in IBM Cloud). This is for those who would like to work with guided tools in implementing Machine Learning. Good for advanced user or advanced developers as well as Data Scientists.

What’s next then?

Being a Data Scientist (or an aspiring Data Scientist) requires a continuous learning and lots of curiosity to keep exploring and working with various types of diverse datasets and algorithms for multi-industry use-cases implementations.

As any other scientists in other disciplines often dedicate their lifetime pursuing things in their area of research focus in order to be better than state-of-the-art, it may seem almost impossible to achieve the defined goals at the beginning of the journey. However, with strong persistence and a lot of patient, at the end of the road, although not always true, all of the efforts will be worth it.

Well, let’s get started by doing something. And the right time, is Now!.

References:

Andi Sama et al., 2018, “Deep Learning — Image Classification, Cats & Dogs — A Cognitive use-case: Implement a Supervised Learning for Image Classification”, https://web.facebook.com/SinergiWahanaGemilang/posts/1554879257928655, Edisi Q1 2018, Accessed online on May 9, 2019 at 4:37 PM.

Andi Sama, 2018, “Processing Handwritten digit (mnist dataset)”, https://github.com/andisama/mnist, Accessed online on May 9, 2019 at 4:44 PM.

Andrew Widjaya, Cahyati S. Sangaji, 2019, “Face Recognition, Powered by IBM Cloud, Watson & IoT on Edge”, Edisi Q2 2019, https://www.facebook.com/SinergiWahanaGemilang/posts/2152245434858698, Accessed online on May 10, 2019 at 7:18 PM.

Hui Li, 2017, “Which machine learning algorithm should I use?”, https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/, Accessed online on May 22, 2019 at 11:10 AM.

Kilian Weinberger, 2018, “CS4780 — Machine Learning for Intelligent Systems — Cornell Computer Science”, Cornell University, Fall 2018, http://www.cs.cornell.edu/courses/cs4780/2018fa/, Accessed online on May 31, 2019 at 7:48 PM.

MathWorks, 2019, “Matlab Documentation, R2019a”, MathWorks, https://www.mathworks.com/help/matlab/, Accessed online on June 5, 2019 at 2:23 AM.

--

--

No responses yet