Learn to Learn like a child — A Few Shots Learning for Image Classification

Andi Sama
7 min readJan 17, 2022

--

A Few Shots Learning as part of Meta-Learning, an advancement to traditional deep learning approach for Image Classification

Andi Sama CIO, Sinergi Wahana Gemilang

TL;DR; 
- A typical approach for doing Image Classification for k-class classification is to have a lot of sample images for each class, hundreds or even thousands of images. The huge challenge for image collection and image labeling.
- A few shots learning offers a way, in which we can have a query image and reference sample images within k-way n-shot. Each class(in k-way) consists of very few images (n-shot). Then we can find in which class the query image belongs to, by computing the similarities between the feature vector of the query image and the mean features of images in each of the classes.
An Image, source: Google Image Search.

While visiting the zoo, a child may wonder, “what is the name of this animal?” She has never seen an animal like that in her life before. Then, her parent shows a few labeled cards containing images. The child has never seen any of the animals in the cards either.

Take a look at the following reference cards containing images with four categories (tiger, panda, giraffe, and zebra).

The four categories of animals in the cards, with one image per category. Image Source: Google Image Search.

Amazingly, most children can quickly relate and conclude that the animal is most similar to the image labeled “Giraffe” in the reference images. Notice how a child can learn by only seeing one sample of a giraffe. This is the basic idea of A Few Shots Learning in the recent deep learning advancements for Image Classification.

Traditional Approach for Image Classification

The traditional approach in deep learning for doing Image Classification requires a lot of images for training data to achieve a model with high accuracy. For example, a simple four-class Image Classification requires hundreds of images per class, 250 images on average in this example (Andi Sama, Arfika Nurhudatiana, 2019b).

To promote the trained model to an acceptable production level with better accuracy, it is not uncommon to require even more datasets in the order of hundreds to thousands of images—a huge challenge for data collection and labeling effort. The available dataset (images) are usually further divided into some percentage for training:validation:testing, say 60%:20%:20% or 70%:15%:15%.

ResNet has been one of the deep learning architectures that can achieve high accuracy, given a sufficient dataset. However, to a certain number of images, the accuracy can not be improved even if we add more training images (Shusen Wang, 2021).

In deep learning for Image Classification, the goal is to have a generalized model based on training data. It means that the trained model can make predictions based on the training dataset with which it has been trained.

The cost of training a deep learning model with lots of images (thousands to millions of images) is high. The cost can be in the form of time, GPU usage, and overall system requirements.

Learn to Learn with A few Shots Learning

Instead of training a bunch of images for every class, we find the similarity scores between the input image (the “Query”) against a list of images (the “Support set”) with very few samples in each class, as low as just a single sample. Class means the image categories, i.e., anime man, anime woman, real man, and real woman like in the reference example (Andi Sama, Arfika Nurhudatiana, 2019b))

This is a few shots learning, the problem of making predictions based on a limited number of samples. The number of shots can be as low as one (only one sample image per class).

A few shots learning goals are NOT to have a generalized model like the traditional deep learning approach. Instead, the model learns to learn based on similarities and differences in feature vectors.

Initially, a few shots learning is trained on the considerable large set dataset (such as mini ImageNet or Omniglot) to find similarities and differences between objects. Later, when we make a query by providing the input image (called Query), the model can tell the difference between the Query and the Support Set, even if we only provide a limited set of new images per class, as low as just one image.

The model can tell that the following two images are look-alike (has a high similarity score). The model does not know whether these images fall into a category of Anime_Woman, for example.

Two images of Anime_Woman. Image Source: Google Image Search.

Ideally, for the above two images, the following similarity function returns 1.

image_sim = sim(image_1, image_2)
print(image_sim)
1.00

And the model can also tell that the following images are somewhat different, for example. The model does not know whether the first image is in the Anime_Woman class and the second is in the Real_Woman class.

The left image is within the Anime_Woman class, and the right is in the Real_Woman class—image Source: Google Image Search.

Ideally, for the above two images, the following similarity function returns 0.

image_sim = sim(image_1, image_2)
print(image_sim)
0.00

Query and Support Set

The Query is simply an input image that we want to compare against a list of images within different categories/classes.

A list of images (Support Set) can contain several classes (called “way”), and each class can include one or more images (called “shot”). For example, if we have a list of 8 different classes containing 2 images each, this is an 8-ways 2-shots Support Set.

Let’s take a look at the following two examples. The first example is a Query to a 4-way 1-shot Support Set. The second is a Query to a 4-way 2-shot Support Set.

An illustration of one Query Image and a 4-way 1 short Support Set. Some of the images are from Google Image Search.
An illustration of one Query Image and a 4-way 2 short Support Set. Some of the images are from Google Image Search.

Let’s explore more by illustrating similarity scores between the Query and the Support Set. Similarity scores are calculated between the Query and each class within the Support Set.

An example of a query against 4-way Support Set with only one image per class. The highest similarity score between the query image feature vector and the feature vector in a class within the Support Set indicates the chosen class, in this case, “Anime_Man”.

The answer for the above 4-way 1-shot Support Set will be the highest similarity score between the Query against all the classes. In this example, Anime_Man with similarity score = 0.70.

An example of a query against 4-way Support Set with 2 sample images per class. The highest similarity score between the query image feature vector and the respective mean vector in a class within the Support Set indicates the chosen class, in this case, “Real_Woman”.

Then, the answer for the above 4-way 2-shot Support Set will be the highest similarity score between the Query against all the classes. In this example, Real_Woman with similarity score = 0.65.

Training a few Shots Learning model

There are two basic steps to train a few shots model. 1. Pretraining and 2. Making a few shots prediction.

1st Step— Pretraining. By pretraining the model using a large dataset, such as with CNN (Convolutional Neural Network) in Supervised Learning. Then, use CNN for feature extractions.

2nd Step —Making a Few Shots Prediction. Given the Query and the Support Set, the model needs to predict which class (within the support set) the Query belongs to. Prediction can be made using similarity scores.

  • Map the images in the Query and Support Set to feature vectors.
  • Map the feature vectors in the same class to obtain the mean (the average) for each class. If the support set has k-classes, then we will have k-mean values.
  • Compare the query feature vector with the mean vector in each k-classes to find the cosine similarities (similarity scores).

As a note, it is good to add fine-tuning between the first and second steps. We can gain a few percents improvement to prediction accuracy by having this additional fine-tuning step. Fine-tuning can be done by combining cosine similarity with a softmax classifier, for example.

Prediction Accuracy

When making predictions using a few shots learning, the more the classes (k in k-way), the lower the prediction accuracy. The more the sample images (n in n-shot) in each of the classes, the higher the prediction accuracy.

References

--

--