Multimodal artificial intelligence

Context

After a report by “The Information” revealed that Google’s new yet-to-be-released multimodal large language model called ‘Gemini’ was already being tested in a bunch of companies.

Google is also working on a new project called‘Gobi’ which is expected to be a multimodal AI system from scratch, unlike the GPT models.

About

About multimodal AI:

Multimodal AI combines different types of information like text, images, and audio to perform various tasks, such as detecting hateful memes or predicting dialogue lines in videos.
Models like OpenAI's DALL.E use this approach to generate images based on text prompts, by finding patterns that connect visual data with image descriptions.
In the case of audio, OpenAI's Whisper, a speech-to-text translation model, enables the system to recognize speech in audio and convert it into simple text.

Applications of Multimodal AI:

Meta introduced a complex open-source AI system called ImageBind, which incorporates text, visual data, audio, temperature, and movement readings.
- This system hints at the possibility of future AI including more sensory data like touch, smell, and brain signals.
Industries like medicine and autonomous driving benefit from multimodal AI.
- It helps analyze complex datasets in areas like identifying rare genetic variations and processing CT scans.
- Additionally, speech translation models like Google Translate use multiple modes for efficient translation across different languages.