Multimodal artificial intelligence
Science & Technology
14th Oct, 2023
After a report by “The Information” revealed that Google’s new yet-to-be-released multimodal large language model called ‘Gemini’ was already being tested in a bunch of companies.
Google is also working on a new project called‘Gobi’ which is expected to be a multimodal AI system from scratch, unlike the GPT models.
About multimodal AI:
- Multimodal AI combines different types of information like text, images, and audio to perform various tasks, such as detecting hateful memes or predicting dialogue lines in videos.
- Models like OpenAI's DALL.E use this approach to generate images based on text prompts, by finding patterns that connect visual data with image descriptions.
- In the case of audio, OpenAI's Whisper, a speech-to-text translation model, enables the system to recognize speech in audio and convert it into simple text.
Applications of Multimodal AI:
- Meta introduced a complex open-source AI system called ImageBind, which incorporates text, visual data, audio, temperature, and movement readings.
- This system hints at the possibility of future AI including more sensory data like touch, smell, and brain signals.
- Industries like medicine and autonomous driving benefit from multimodal AI.
- It helps analyze complex datasets in areas like identifying rare genetic variations and processing CT scans.
- Additionally, speech translation models like Google Translate use multiple modes for efficient translation across different languages.