Infusing Intelligence — AI, The Integral Building Block of the Metaverse

Andi Sama
8 min readJul 20, 2022


Artificial Intelligence as part of Metaverse’s decentralization and human interface layers

Andi Sama CIO, Sinergi Wahana Gemilang with Cahyati S. Sangaji & Andrew Widjaja

In Summary
- Transformer-based algorithms are part of recent developments in deep learning that support advancements in the area of Language Modeling (e.g. NLP, Natural Language Processing) as well as Image Processing (with Vision Transformer).
- Semantic Segmentation, Instance Segmentation, and Sequence Modeling provide the intelligence behind the Metaverse.

Artificial Intelligence (AI) has been one of the technology enablers for Metaverse. Recent advancements in Deep Learning, as part of AI, allows Metaverse’s creator to be creative in designing current and future releases of Virtual Reality (VR), Augmented Reality (AR), or Mixed-Reality (MR)-based Metaverses.

Deep Learning opens various innovative VR/AR-based implementations, combined with advanced Image Processing and Sequence Modeling. Deep Learning is the subfield of Machine Learning within Artificial Intelligence in the Computer Science.

Machine Learning and Deep Learning

With AI, humans try to mimic how humans predict things by building an AI model trained on a particular dataset.

Machine Learning

Traditional Machine Learning (ML) builds a Machine Learning model by finding the relationship between the input and the target output dataset (function approximation). If we have a set of input and output datasets (target labels), this is Supervised Learning. For instance, the example applications are regression (predict continuous variables) or classification use-cases (predict discrete variables).

If we only have the dataset without target labels, we can also build an AI model to find some patterns in the data. Clustering is one of the use-cases, for instance.

Deep Learning

With the advancements in technologies in the early 2010s (Algorithms, ImageNet dataset, Bigdata, and Graphics Processing Units — GPU), the Deep Learning approach has gained popularity. Deep Learning learns directly from the dataset and stores (after training) the relationship between a set of input data and a set of output data in its deep neural network structure.

Once trained, an AI model can predict the output with a certain confidence level, given similar input within the same data domain. For example, face recognition by name, with a 99.76% confidence level. Another example is object (or action) detection in a video stream, recognizing objects or short actions within a scene.

Thus, AI with neural networks (deep learning) is commonly known as an “excellent function approximation — when they have training data” (Ava Soleimany, 2022).

In the initial years of deep learning development, most CNN-based algorithms in supervised learning required a massive amount of dataset. Collecting datasets and the labeling effort became a considerable challenge and time-consuming.

CNN is a convolutional neural network. It is a way to extract features in an image by sliding a small filter kernel (in the form of square pixels, i.e., 3x3 pixels) through the image and learning to extract the features by doing backpropagation to update the neural network.

A significant recent development in self-supervised learning opens the possibility of developing AI models without annotation (or labeling).

There has been a new kind of transformer-based algorithm for processing sequential data (e.g., language modeling for NLP, Natural Language Processing). Vision Transformer is one of the variations of transformer for image processing. The requirement for large datasets is still there. However, as in the latest development of self-supervised learning (Meta, 2022) —the objective is computed automatically from the input data. It means human does not need to label the data.

Image Processing in Metaverse

Understanding the semantic meaning in an image (or stream of images, e.g., video) is one of the advancements in deep learning that contributes to Metaverse development.

Semantic Segmentation and Instance Segmentation

The following illustrates a snapshot of the Samsung 837X in the Decentraland Metaverse. Samsung has launched Samsung 837X at Consumer Electronics Show (CES) event in January 2022 “CES 2022 — Mencermati Berbagai Perkembangan dan Inovasi Teknologi” (Andi Sama, 2022c). Then, by using the semantic segmentation model in deep learning, we can classify each pixel in the image into specific labels (e.g., people, building, land, door, window, staircase, neon sign, pole, etc.).

A snapshot shows an avatar at the Samsung 837X building in the Decentraland Metaverse.
Semantic segmentation can classify each pixel into a specific label, e.g., people, building, land, door, etc. Semantic segmentation by (Meta, 2022).

There is also an instance segmentation model in deep learning. If there are three people in the scene, those three people can be recognized individually (person-1, person-2, and person-3) rather than just people, such as in the semantic segmentation (person).

By recognizing each pixel in the image, the AR/VR designer can be creative, e.g., replacing (or overlaying) an object with a completely different object in real-time.

The following video shows an example of a potential implementation, shown in mid-2021 (The Metaverse Park) by Delta Reality. The demo was shown at the Niantic Lightship Global Jam event.

Find Similarity, Given an Image

Deep learning can find similarities in images by computing the “distance.” The closer the distance (by cosine similarity, for example), the more similar those images are. The following shows an example in which we search for similar images given an image.

First Example: Our image is given as input to the image similarity algorithm.
Given the input image, do an image similarity search (Meta, 2022).
Second Example: Our image is given as input to the image similarity algorithm.
Do an image similarity search (Meta, 2022).

The interested reader may refer to the article “Learn to Learn like a child — A Few Shots Learning for Image Classification” (Andi Sama, 2022b).

Find Similarity in Part of an Image

The similarity search can be for part of an image too. In the following example, we select the nose part of the cat. Then, given a new image (rabbit), search for similarity for only that selected part of the object (patch similarity search). The closer the distance between the two patches in these two images, the stronger the marked similarity search result (in this case, marked in red).

Given the input image, select part of the image, then do a patch similarity search for only that part of the image (Meta, 2022).

Sequence Modeling in Metaverse

Imagine if we can speak in a language and the persons we talk to can understand it in their languages, in real-time. The universal translator, the advancement in Natural Language Processing (NLP) that we have seen to some extent so far in reality (not just in Metaverse), and will be expected to be better in the coming years (Meta AI, 2022).

“Universal Language Translator” enables cross-nation communications. Everyone speaks their language, and the other party hears and responds with their local language, including Bahasa Indonesia. In the future, Smart AR glasses or Smart contact lenses may serve as the communication medium (Meta AI, 2022).
A two-person interaction having a discussion on project progress in the field. The reviewer uses Smart AR glasses to inspect a document written in a foreign language he does not understand directly. The Smart AR glasses facilitate real-time translation for understanding the document (Meta AI, 2022).

AIoT — AI and IoT

We live in a modern world. The invention of the smartphone enables better communication and allows humans to enjoy better living in society.

Sensors and actuators have been part of our lives since many years ago. As technology advances, these sensors and actuators get smarter with intelligent controllers and gateways, even connected to a cloud-based system to enable better collaboration.

Intelligent sensors, actuators, and smartphones are part of the Internet of Things (IoT). They are intelligent devices that collect data, do local processing to some extent, and send the information to the central application (typically a cloud-based application). The smart devices can also receive commands from the cloud-based application, then execute the commands accordingly.

For instance, a trained AI model for object detection can be deployed into these intelligent devices. It enables the function of what we call: AIoT — Artificial Intelligence on IoT devices.

Modern AR/VR Headsets, Haptics, and Wrist Tracker are intelligent devices that can process AI models locally (on-device AI Inferencing, running the trained AI models).

Various AR/VR Headsets, including a wrist tracker. Image source: Google Image Search.
Personal VR Omni-Directional Treadmill (KATVR, 2022).
bHaptics: Haptic Glove — (will be available in Q4 2022).

Soon, along with the advancements in wireless technology that promises significant improvement to the wireless network speed (e.g., towards 6G), on-device AI inferencing may no longer be needed. The cloud can do AI inferencing, bringing the cost of intelligent devices down (e.g., no need to have specialized hardware, e.g., GPU — Graphics Processing Unit or ASIC — Application Specific Integrated Circuit, to run the AI models).

AI and Metaverse

Many technologies support the Metaverse ecosystem (which becomes the business and investment opportunities). Artificial Intelligence is one of them.

In “The seven layers of the metaverse” (Jon Radoff, 2021a), the role of AI is within decentralization and human interface layers.

The foundation infrastructure covers layers 1–6; layer 1 is at the bottom. The Foundation infrastructure cover technologies such as 5G dan 6G, GPU, Smart Phone, Wearable, Blockchain, Edge Computing, VR, AR, XR, and various design tools.

The top layer (layer-7) is the user experience layer in which we are experiencing various activities in the Metaverse (playing games, socializing, shopping, watching movies, and many more activities).

Many recent advancements in AI (Sergi Castella i Sape, 2022) may open up an exciting future for Metaverse. A few of them are:

  • evojax (February 2022): a “library for hardware-accelerated neuroevolution.”
  • Gradients without Backpropagation (February 2022)— kind of a strange idea as we have been doing backpropagation for some time. However, human brains seem to communicate only in forward mode.
  • Hierarchical Perceiver (February 2022) — a “new version of the Perceiver, which was a Transformer-based approach that could be applied to arbitrary modalities as long sequences (up to 100k!) of tokens: vision, language, audio-visual tasks.”
  • MuZero (February 2022 — Deepmind): for “video compression.”
  • PyTorch’s released TorchRec (February 2022 — Meta): a “domain library built to provide common sparsity & parallelism primitives needed for large-scale recommender systems.