to download the project base paper

Multimodal Large Language Model (MLLM) recently has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising mergent capabilities of MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional methods, suggesting a potential path to artificial general intelligence. In this paper, we aim to trace and summarize the recent progress of MLLM. First of all, we present the formulation of MLLM and delineate its related concepts. Then, we discuss the key techniques and applications, including Multimodal Instruction Tuning (M-IT), Multimodal In-Context Learning (M-ICL), Multimodal Chain of Thought (M-CoT), and LLM-Aided Visual Reasoning (LAVR). Finally, we discuss existing challenges and point out promising research directions. In light of the fact that the era of MLLM has only just begun, we will keep updating this survey and hope it can inspire more research. An associated

GitHub link collecting the latest papers is available at

  1. Introduction
    Recent years have seen the remarkable progress of large language models [1–4]. By scaling up data size and model size, these LLMs raise amazing emergent abilities, typically including In-Context Learning (ICL) [5], instruction following [4, 6], and Chain of Thought (CoT) [7]. Although LLMs have demonstrated surprising zero/few-shot reasoning performance on most Natural Language processing (NLP) tasks, they are inherently “blind” to vision since they can only understand discrete text. Concurrently, large vision foundation models make rapid progress in perception [8–10],and the traditional combination with text pays more attention to modality alignment [11] and task unity [12], developing slowly in reasoning. In light of this complementarity, unimodal LLMs and vision models run towards each other at the same time, ultimately leading to the new field of MLLM. Formally, it refers to the LLM-based model with the ability to receive and reason with multimodal information. From the perspective of developing Artificial General Intelligence (AGI), MLLM may take a step forward from LLM for the following reasons: (1) MLLM is more in line with the way humans perceive
    the world. Our humans naturally receive multisensory inputs that are often complementary and cooperative. Therefore, multimodal information is expected to make MLLM more intelligent. (2) MLLM offers a more user-friendly interface. Thanks to the support of multimodal input, users can interact and communicate with the intelligent assistant in a more flexible way. (3) MLLM is a more well-rounded task-solvers. While LLMs can typically perform NLP tasks, MLLMs can generally support a larger spectrum of tasks. GPT-4 [2] ignites a research frenzy over MLLM because of the amazing examples it shows. However, GPT-4 does not open the multimodal interface, and no information about the model has been made public up until now. In spite of this, many efforts have been made by the research community to develop capable and open-sourced MLLMs, and some surprising practical capabilities have been exhibited, such as writing website codes based on images [13], understanding the deep meaning of a meme [14], and OCR-free math reasoning [15]. We write this survey to provide researchers with a grasp of the basic idea, main method, and current progress of MLLMs. Note that we mainly focus on visual and language modalities, but also include works involving other modalities. Specifically, we divide the existing MLLMs into four types with corresponding summarizations and, meanwhile, open a GitHub page that would be updated
    in real-time. To the best of our knowledge, this is the first survey on MLLM.

Leave a Comment


No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *