to download the project base paper.

This tutorial note summarizes the presentation on Large Multimodal Models: Towards Building and surpassing Multimodal GPT-4, a part of CVPR 2023 tutorial on Recent Advances in Vision Foundation Models. The tutorial consists of three parts. We first introduce the background on recent GPT-like large models for vision-and language modeling to motivate the research in instruction-tuned large multimodal models (LMMs). As a pre-requisite, we describe the basics of instruction-tuning in large language models, which is further extended to the multimodal space. Lastly, we illustrate how to build the minimum prototype of multimodal GPT-4 like models with the open-source resource, and review the recently emerged topics

In view of the rapid assimilation and widespread adoption of OpenAI ChatGPT [32]/GPT-4 [33] in contemporary society, there has been a growing interest among academics and researchers to develop
open-source large language models (LLMs), and simultaneously explore the extensions into large multimodal models (LMMs)1 . In order to elucidate this popular topic for a broader audience, in the
CVPR 2023 tutorial on Recent Advances in Vision Foundation Models, we have provided a lecture on Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4, based on the public materials in the literature. This note summarizes the tutorial presentation and makes it more complete. It gives guided tours through the literature and explain topics to those who seek to learn the areas on LMMs from basics to the advances. It is prepared for audience including graduate students, researchers and professionals that LMMs are outside their specialties, to help them develop perspectives, and identify trends in LMMs in an accessible way.

The focus of this note on large multimodal models, in the context of overall CVPR 2023
Tutorial on Recent Advances in Vision Foundation Models. In the full tutorial, as shown in Figure 2, we have covered the most recent approaches and principles at the frontier of learning and applying vision foundation models, including Q1: Visual and Vision Language Pre-training; Q2: Generic Vision Interface; Q3: Alignments in Text-to-image Generation; Q4: Large Multimodal Models; and Q5: Multimodal Agents.
This note focuses on Q4: how to leverage LLM for multimodality, and train LMMs in an end-to-end fashion, so that the models can see and chat. The presentation consists of three parts. To start, we first share background on recent GPT-like large models for vision-and-language modeling in Section 2. In the 2nd part, as a pre-requisite, we will introduce the concept of instruction tuning in language domains in Section 3, which empowered ChatGPT. Finally, Section 4 covers the last part of the presentation, where we focus on how to build a minimum version of multimodal GPT-4, using LLaVA as a running example. Since LMM is a popular research topic, many new papers have appeared in this line of research in the past three months, of which we provide a summary, so that the audience may quickly get a picture on what the LMM community has been working on. The related links of the tutorial presentation on large multimodal models are available at:

  • Slides:
  • YouTube Video:
  • Bilibili Video:
    For the full information and other parts of the CVPR tutorial, please see the official website at:
Leave a Comment


No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *