澳门毛片精品一区二区三区,欧美97色伦欧美一区二区日韩,久久精品国产夜色

?AGI之MFM：《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎(chǔ)模型：從專家到通用助手》翻譯與解讀之與LLM協(xié)同工作的多模態(tài)智能體、結(jié)論和研究趨勢(shì)

6、Multimodal Agents:Chaining Tools with LLM—?與LLM協(xié)同工作的多模態(tài)智能體

提出新的建模范式：將多個(gè)工具或?qū)＜遗cLLMs協(xié)同鏈接以解決復(fù)雜的開(kāi)放問(wèn)題，不需要訓(xùn)練，只需要示例教導(dǎo)

Large Language Models (LLMs) (Chowdhery et al., 2022; OpenAI, 2023a) have shown intriguing properties generalizing to user prompts in various domains, and rapidly adapting to new scenarios, using in-context learning with a few examples. Inspired by such strong capabilities, researchers are now exploring a new modeling paradigm that shifts from standalone models for solving finite, pre-defined problems, into synergistically chaining multiple tools or experts with LLMs to solve complicated, open problems. Unlike what has been introduced in Chapter 5, such a system can be built without any training involved, just by using a few demonstration examples to teach the LLM to generate proper calling to existing tools.

大型語(yǔ)言模型（LLMs）（Chowdhery等人，2022；OpenAI，2023a）已經(jīng)展示了一些有趣的特性，可以泛化到各個(gè)領(lǐng)域的用戶提示，并通過(guò)幾個(gè)例子使用上下文學(xué)習(xí)快速適應(yīng)新的場(chǎng)景。受到這種強(qiáng)大能力的啟發(fā)，研究人員現(xiàn)在正在探索一種新的建模范式，從解決有限預(yù)的、預(yù)定義問(wèn)題的獨(dú)立模型，轉(zhuǎn)向?yàn)閰f(xié)同鏈接多個(gè)工具或具有LLMs的專家來(lái)解決復(fù)雜的、開(kāi)放的問(wèn)題。與第5章中介紹的不同，這樣的系統(tǒng)可以在不涉及任何訓(xùn)練的情況下構(gòu)建，只需使用少量示范示例來(lái)教導(dǎo)LLM生成對(duì)現(xiàn)有工具的適當(dāng)調(diào)用即可。

In this chapter, we review the fast-evolving literature on chaining different multimodal experts with LLMs to solve complicated multimodal understanding problems, referred to as multimodal agents. We start with an overview on the evolution of this modeling paradigm in Section 6.1, highlighting the differences between traditional approaches and the new modeling paradigm of chaining tools with LLM. Section 6.2 gives a general overview of multimodal agents. Pivoting on an exemplary multimodal agent MM-REACT (Yang* et al., 2023), Section 6.3 comprehensively reviews how to build a multimodal agent, its emerging capabilities in multimodal understanding, and how it can be easily extended to incorporate the latest and strongest LLM and potentially millions of tools. Finally, in Section 6.4, we end the chapter with discussions on advanced topics, such as how to improve/evaluate multimodal agents, the diverse applications powered by multimodal agents.

在本章中，我們將回顧有關(guān)將不同多模態(tài)專家與LLMs協(xié)同工作以解決復(fù)雜的多模態(tài)理解問(wèn)題的快速發(fā)展文獻(xiàn)，稱為多模態(tài)智能體。我們從

第6.1節(jié)中對(duì)這種建模范式的演變進(jìn)行概述，強(qiáng)調(diào)傳統(tǒng)方法與使用LLMs協(xié)同工具的新建模范式之間的差異。

第6.2節(jié)概述了多模態(tài)智能體的總體概述。以典型的多模式代理MM-REACT?(Yang* et al.， 2023)為中心，

第6.3節(jié)全面回顧了如何構(gòu)建多模態(tài)智能體，它在多模態(tài)理解方面的新興能力，以及如何輕松擴(kuò)展以包含最新和最強(qiáng)大的LLM和潛在的數(shù)百萬(wàn)工具。最后，在

第6.4節(jié)中，我們以高級(jí)主題的討論結(jié)束本章，例如如何改進(jìn)/評(píng)估多模態(tài)智能體，多模態(tài)智能體驅(qū)動(dòng)的各種應(yīng)用。

6.1、Overview概述

建模范式的演變：特定任務(wù)模型→大型多模態(tài)模型→帶有LLM的鏈接工具范式(無(wú)需任何訓(xùn)練+加持現(xiàn)有開(kāi)源平臺(tái)或API工具)

We first revisit the evolution of modeling paradigms, from task-specific models to the most recent large multimodal models, which all require data curation and model training. We then introduce the new modeling paradigm of chaining tools with LLM, which may not require any training, but instead directly takes advantage of a pre-trained LLM and existing tools that are widely available through open-source platforms or APIs.

首先，我們重新審視了建模范式的演變，從特定任務(wù)模型到最新的大型多模態(tài)模型，所有這些都需要數(shù)據(jù)管理和模型訓(xùn)練。然后，我們引入了帶有LLM的鏈接工具的新的建模范例，它可能不需要任何培訓(xùn)，而是直接利用了預(yù)訓(xùn)練的LLM和通過(guò)開(kāi)源平臺(tái)或API廣泛提供的現(xiàn)有工具的優(yōu)勢(shì)。

(1)、Evolution of modeling paradigm建模范式的演變：

特定任務(wù)的專用模型→預(yù)訓(xùn)練模型的二階段(預(yù)訓(xùn)練+微調(diào)范式，如NLP中的BSRT系列、VL中的UNITER/OSCAR，仍是針對(duì)特定任務(wù)的微調(diào))→

As summarized in Figure 6.1, we are witnessing the transition from task-specific models towards general-purpose assistants across language, vision, and multi-modal research.

We started with task-specific models that are trained on small-scale well-annotated data. This results in dedicated models (Anderson et al., 2018; Li et al., 2019a; Yu et al., 2019) for each task or even each dataset.

如圖6.1所總結(jié)的那樣，我們正在目睹從特定任務(wù)模型向跨語(yǔ)言、視覺(jué)和多模態(tài)研究的通用助手的過(guò)渡。

我們從特定于任務(wù)的模型開(kāi)始，這些模型是在小規(guī)模的注釋良好的數(shù)據(jù)上訓(xùn)練的。這就產(chǎn)生了針對(duì)每個(gè)任務(wù)甚至每個(gè)數(shù)據(jù)集的專用模型（Anderson等人，2018；Li等人，2019a；Yu等人，2019）。

We then transitioned to the phase of pre-trained models, with the pretrain-then-finetune paradigm widely adopted across both NLP and vision-language (VL) research. During pre-training, the model can take advantages of large-scale, web-crawled noisy data, for example, millions to billions of image-text pairs (Chen et al., 2020d; Wang et al., 2022a), or billions of text tokens (Devlin et al., 2019; Liu et al., 2019). However, it is still mostly task-specific finetuned, requiring similarly small-scale, well-annotated data as the ones used in training task-specific models. This paradigm has led to many well-known models, such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) in NLP, and UNITER (Chen et al., 2020d), OSCAR (Li et al., 2020b) in VL. These early VL founda-tion models were considered to be large-scale (trained with 10M image-text pairs), but may be of intermediate or even small size in today’s view (billions of pairs).

然后，我們過(guò)渡到了預(yù)訓(xùn)練模型的階段，在NLP和視覺(jué)語(yǔ)言（VL）研究中廣泛采用了預(yù)訓(xùn)練-微調(diào)的范式。在預(yù)訓(xùn)練過(guò)程中，模型可以利用大規(guī)模的網(wǎng)絡(luò)抓取的嘈雜數(shù)據(jù)，例如數(shù)百萬(wàn)到數(shù)十億的圖像-文本對(duì)（Chen等人，2020d；Wang等人，2022a），或數(shù)十億的文本token（Devlin等人，2019；Liu等人，2019）。但是，它仍然主要是針對(duì)特定任務(wù)的微調(diào)，需要與訓(xùn)練特定任務(wù)模型中使用的數(shù)據(jù)類似的小規(guī)模、良好注釋的數(shù)據(jù)。這種范式已經(jīng)產(chǎn)生了許多著名的模型，例如NLP中的BERT（Devlin等人，2019），RoBERTa（Liu等人，2019），以及VL中的UNITER（Chen等人，2020d），OSCAR（Li等人，2020b）。這些早期的VL基礎(chǔ)模型被認(rèn)為是大規(guī)模的（訓(xùn)練了1000萬(wàn)個(gè)圖像-文本對(duì)），但在今天的觀點(diǎn)中可能是中等甚至較小的規(guī)模(數(shù)十億對(duì))。

→基于通用的大型語(yǔ)言/多模態(tài)模型(比如GPT系列/PaLM/LLaMA系列/Flamingo)→基于通用模型的構(gòu)建指令跟隨(比如Alpaca/LLaVA【視覺(jué)指令調(diào)優(yōu)】)

Nowadays, we are entering a new era of generalist modeling, where the pre-training has been fur-ther scaled up to trillions of text tokens (Gao et al., 2023b). For downstream adaptation, these gen-eralist models have shown strong performance with in-context few-shot learning on a few demon-stration examples, or even zero-shot evaluation. These models are what we now refer as large lan-guage/multimodal models, including the GPT family (OpenAI, 2022, 2023a), PaLM family (Chowd-hery et al., 2022; Driess et al., 2023), LLaMa (Touvron et al., 2023), Flamingo (Alayrac et al., 2022).

Based on the generalist models, the pipeline of building instruction-following models covered in Chapter 5, similarly follows the pretrain-then-finetune paradigm. For example, Alpaca (Taori et al., 2023), is built on top of the pre-trained LLaMa (Touvron et al., 2023), then finetuned on a smaller-scale instruction tuning dataset. Similarly, for instruction-following VL models (e.g. LLaVA (Li et al., 2023e)), an additional stage of image-text alignment pre-training is introduced to align the visual representations to the frozen LLM first, followed by visual instruction tuning.

如今，我們正在進(jìn)入通用建模的新時(shí)代，其中預(yù)訓(xùn)練已經(jīng)進(jìn)一步擴(kuò)展到數(shù)萬(wàn)億的文本token（Gao等人，2023b）。對(duì)于下游適應(yīng)，這些通用模型已經(jīng)展示出在少量示范示例上進(jìn)行上下文適應(yīng)學(xué)習(xí)或甚至零樣本評(píng)估時(shí)的強(qiáng)大性能。這些模型現(xiàn)在被稱為大型語(yǔ)言/多模態(tài)模型，包括GPT系列（OpenAI，2022，2023a），PaLM系列（Chowd-hery等人，2022；Driess等人，2023），LLaMa（Touvron等人，2023），Flamingo（Alayrac等人，2022）。

在通用模型基礎(chǔ)上，構(gòu)建指令跟隨模型的流程涵蓋在第5章中，同樣遵循預(yù)先訓(xùn)練然后微調(diào)的范式。例如，Alpaca（Taori等人，2023）是在預(yù)先訓(xùn)練的LLaMa（Touvron等人，2023）之上構(gòu)建的，然后在較小規(guī)模的指令調(diào)優(yōu)數(shù)據(jù)上進(jìn)行微調(diào)。同樣，對(duì)于指令跟隨VL模型（例如LLaVA（Li等人，2023e）），引入了額外的圖像-文本對(duì)齊預(yù)訓(xùn)練階段，以首先將視覺(jué)表示與凍結(jié)的LLM對(duì)齊，然后進(jìn)行視覺(jué)指令調(diào)優(yōu)。

(2)、New modeling paradigm:新的建模范式：與LLM鏈接的工具鏈

痛點(diǎn)：基本功能上的挑戰(zhàn)(數(shù)學(xué)推理/信息檢索)、通用局限性(能力依賴于過(guò)時(shí)訓(xùn)練數(shù)據(jù)的世界而無(wú)法及時(shí)更新信息)

New modeling paradigm: chaining tools with LLM. LLMs (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2023a) have demonstrated exceptional abilities to tackle new tasks with only a few examples or textual instructions, showing the promise of serving as general-purpose foundations for many applications. Despite being versatile and impressive, they encounter challenges with the basic functionalities, such as mathematical reasoning and information retrieval. Furthermore, a fundamental limitation of not only LLMs but also other large-scale models nowadays, is that they only represent the world described by their training data, which will inevitably become outdated over time. Regularly re-training the model with the latest information is simply not feasible.

新的建模范式：與LLM鏈接的工具鏈。

LLMs（Brown等人，2020；Chowdhery等人，2022；OpenAI，2023a）已經(jīng)展示出了只使用少量示例或文本指令就能處理新任務(wù)的卓越能力，顯示出它們有望作為許多應(yīng)用的通用基礎(chǔ)。盡管它們用途廣泛且令人印象深刻，但它們?cè)?span style="color:#ff0000;">數(shù)學(xué)推理和信息檢索等基本功能方面面臨挑戰(zhàn)。此外，不僅LLMs，而且現(xiàn)今的其他大規(guī)模模型的一個(gè)基本限制是，它們只代表其訓(xùn)練數(shù)據(jù)中描述的世界，這將不可避免地隨著時(shí)間的推移而過(guò)時(shí)。用最新的信息定期重新訓(xùn)練模型是不可行的。

提出語(yǔ)言建模一種新的建模范式(外部NLP工具補(bǔ)充LLMs，如計(jì)算器/搜索引擎/翻譯系統(tǒng)/日歷等)→未來(lái)的智能代理(使用工具啟用LLMs對(duì)多模態(tài)信號(hào)進(jìn)行感知)

Meanwhile, many tasks with real-world impact cannot be readily tackled by by LLMs alone. For example, accessing up-to-date information and performing computations, can be done via existing tools (e.g. , search engine or calculator). Hence, recent research in language modeling has explored a new modeling paradigm by supplementing LLMs with external NLP tools (Nakano et al., 2021; Huang et al., 2022b; Ahn et al., 2022), including calculators, search engines, translation systems, calendars, or even API calls on other models.

The above studies mainly focus on a single modality, i.e., language, in which the output of the tools are in text format, thereby can naturally be fed into LLMs as additional knowledge. However, we live in a multimodal world and a truly intelligent agent should be able to perform advanced multimodal reasoning and actions. How to enable LLMs with perception of multimodal signals via tool using, is the focus of the remaining part of this chapter.

與此同時(shí)，許多對(duì)現(xiàn)實(shí)世界有影響的任務(wù)不能輕易地由LLMs 單獨(dú)解決。例如，訪問(wèn)最新信息和執(zhí)行計(jì)算可以通過(guò)現(xiàn)有工具（例如，搜索引擎或計(jì)算器）來(lái)完成。因此，語(yǔ)言建模的最新研究探索了一種新的建模范式，用外部NLP工具補(bǔ)充LLMs（Nakano等人，2021；Huang等人，2022b；Ahn等人，2022），包括計(jì)算器、搜索引擎、翻譯系統(tǒng)、日歷，甚至其他模型的API調(diào)用。

上述研究主要集中在單一模態(tài)，即語(yǔ)言，其中工具的輸出以文本格式呈現(xiàn)，因此可以自然地輸入到LLMs中作為額外的知識(shí)。然而，我們生活在一個(gè)多模態(tài)的世界，一個(gè)真正智能的代理應(yīng)該能夠執(zhí)行高級(jí)的多模態(tài)推理和行動(dòng)。如何通過(guò)使用工具啟用LLMs對(duì)多模態(tài)信號(hào)進(jìn)行感知，是本章剩余部分的重點(diǎn)。

6.2、Multimodal Agent多模態(tài)智能體

代表性作品：VISPROG(第一個(gè)利用編程語(yǔ)言將不同的視覺(jué)工具與LLM相結(jié)合的工作)、Visual ChatGPT(通過(guò)各種圖像生成工具結(jié)合對(duì)話提問(wèn)實(shí)現(xiàn)圖像編輯)、MM-ReAct(體現(xiàn)LLM通過(guò)融合多個(gè)視覺(jué)專家完成復(fù)雜的跨模態(tài)行為和推理)

There are several representative works on building multimodal agent with tool use of vision experts, including VISPROG (Gupta and Kembhavi, 2022b), Visual ChatGPT (Wu et al., 2023a) and MM-ReAct (Yang* et al., 2023). VISPROG is the very first work on using programming language to chain different vision tools with a LLM. Visual ChatGPT enables dialogue-based image editing by?complementing ChatGPT (OpenAI, 2022) with various image generation tools. MM-ReAct shows that when collaborating various advanced vision experts, ChatGPT can perform complex multimodal actions and reasoning. Figure 6.2 presents the fast-evolving literature in multimodal agents from November 18, 2022 to July 26th, 2023. Among them, we include a few more exemplary multimodal agents in Table 6.1, along with two representative works in the NLP domain.

利用視覺(jué)專家的工具構(gòu)建多模態(tài)智能體的代表性作品有括VISPROG（Gupta和Kembhavi，2022b）、Visual ChatGPT（Wu等人，2023a）和MM-ReAct（Yang*等人，2023）。

>> VISPROG是第一個(gè)使用編程語(yǔ)言將不同的視覺(jué)工具與LLM鏈接起來(lái)的作品。

>> Visual ChatGPT通過(guò)結(jié)合ChatGPT（OpenAI，2022）和各種圖像生成工具，實(shí)現(xiàn)了基于對(duì)話的圖像編輯。

>> MM-ReAct表明，當(dāng)協(xié)作各種高級(jí)視覺(jué)專家時(shí)，ChatGPT可以執(zhí)行復(fù)雜的多模態(tài)操作和推理。

圖6.2展示了從2022年11月18日到2023年7月26日多模態(tài)智能體領(lǐng)域的快速發(fā)展文獻(xiàn)。其中，我們?cè)诒?.1中列出了幾個(gè)更具代表性的多模態(tài)智能體，以及NLP領(lǐng)域的兩個(gè)代表性作品。

典型多模態(tài)智能體框架實(shí)現(xiàn)：通過(guò)用戶與工具分配器的直接交互,由LLM擔(dān)任分配器的大腦來(lái)規(guī)劃使用單個(gè)或協(xié)同多個(gè)工具完成用戶需求的步驟,執(zhí)行后匯集結(jié)果輸入到LLM中實(shí)現(xiàn)響應(yīng)

An overview of a typical multimodal agent framework is illustrated in Figure 6.3. The user directly interacts with the Tool Allocator, which functions as the brain of the agent. In current literature, the tool allocator is usually a LLM. To achieve the user’s goal, the LLM will outline all the steps necessary with either a single tool or collaborating multiple tools together. Subsequently, it will retrieve from all the candidate tools for the needed tools, and execute possibly multiple rounds of tools to fulfill the human requirement. Finally, the execution results from the tools are gathered as inputs of the LLM to generate a response to the user. Next, we cover the three key components of multimodal agents.

典型多模態(tài)智能體框架的概述如圖6.3所示。用戶直接與工具分配器交互，它充當(dāng)代理的大腦。在當(dāng)前文獻(xiàn)中，工具分配器通常是一個(gè)LLM。為了實(shí)現(xiàn)用戶的目標(biāo)，LLM將使用單個(gè)工具或協(xié)作多個(gè)工具來(lái)概述實(shí)現(xiàn)任務(wù)所需的所有步驟。隨后，它將從所有候選工具中檢索所需的工具，并執(zhí)行可能涉及多輪工具以滿足人類需求。最后，工具的執(zhí)行結(jié)果被收集作為L(zhǎng)LM的輸入，以生成對(duì)用戶的響應(yīng)。

接下來(lái)，我們將介紹多模態(tài)智能體的三個(gè)關(guān)鍵組成部分。

模態(tài)代理的三個(gè)關(guān)鍵組件

Tools工具：提供LLM缺失的多模態(tài)信息，比如開(kāi)源模型/API/代碼解釋器

Tools. Tools are external modules that are callable by the LLM to obtain extra information that is missing from the model weights, including open-source models, public/private APIs, or code inter-preters. As LLMs only accept language inputs, one must include tools that can process multimodal inputs to build a multimodal agent.

工具。工具是可以由LLM調(diào)用的外部模塊，用于獲取模型權(quán)重中缺失的額外信息，包括開(kāi)源模型、公共/私有API或代碼解釋器。由于LLMs只接受語(yǔ)言輸入，因此必須包含能夠處理多模態(tài)輸入的工具來(lái)構(gòu)建多模態(tài)智能體。

Planning規(guī)劃：將用戶需求細(xì)化為可執(zhí)行步驟調(diào)用工具

Planning. During planning, the LLM decomposes the user requests into smaller, manageable sub-problems, and outlines a step-by-step solution, each of which involves calling an external tool. There are two ways to teach LLMs for planning. One is to prompt the LLM with in-context few-shot examples of all candidate tools. This approach can extend the general model directly but is limited by the context length. The other approach relies on large amounts of annotated data to fine-tune the LLM, which most likely will damage the robustness and generalizability of the model.

規(guī)劃。在規(guī)劃過(guò)程中，LLM將用戶的請(qǐng)求分解為較小、可管理的子問(wèn)題，并概述了逐步解決方案，每個(gè)解決方案都涉及調(diào)用外部工具。

教導(dǎo)LLMs進(jìn)行規(guī)劃有兩種方式。一種是使用所有候選工具的上下文少量示例來(lái)提示LLM進(jìn)行規(guī)劃。這種方法可以直接擴(kuò)展通用模型，但受上下文長(zhǎng)度的限制。另一種方法依賴于大量注釋數(shù)據(jù)來(lái)對(duì)LLM進(jìn)行微調(diào)，這很可能會(huì)損害模型的穩(wěn)健性和通用性。

Execution執(zhí)行：由LLM翻譯計(jì)劃調(diào)用工具得到結(jié)果與用戶對(duì)話

Execution. The generated plan is further translated into executable calls to the required tools, which can be done via regular expression matching (Yang* et al., 2023); directly prompting LLMs to generate executable programs (Sur′?s et al., 2023); or leveraging in-context few-shot learning capability of LLMs by providing natural language instructions that describe the roles of each module together with a few calling examples (Lu et al., 2023b). The execution results are fed back to the LLM to generate a response to the user.

執(zhí)行。生成的計(jì)劃進(jìn)一步轉(zhuǎn)化為對(duì)所需工具的可執(zhí)行調(diào)用，可以通過(guò)正則表達(dá)式匹配（Yang*等人，2023）來(lái)完成；直接提示LLMs生成可執(zhí)行程序（Sur′?s等人，2023）；或者通過(guò)提供描述每個(gè)模塊角色的自然語(yǔ)言指令以及一些調(diào)用示例來(lái)利用LLMs 的上下文少量學(xué)習(xí)能力（Lu等人，2023b）。執(zhí)行結(jié)果反饋給LLM，以生成對(duì)用戶的響應(yīng)。

6.3、Case Study: MM-REACT案例研究：MM-REACT

We use MM-REACT (Yang* et al., 2023) as a case study to show how to build a multimodal agent, its emerging capabilities in multimodal understanding, and how it can be easily extended to incor-porate the latest and strongest LLM and potentially millions of tools.

我們以MM-REACT（Yang*等人，2023）作為案例研究，展示如何構(gòu)建多模態(tài)智能體，它在多模態(tài)理解方面的新興能力，以及如何輕松擴(kuò)展以整合最新和最強(qiáng)大的LLM以及潛在的數(shù)百萬(wàn)工具。

6.3.1、System Design系統(tǒng)設(shè)計(jì)：MM-REACT設(shè)計(jì)智能體范式，通過(guò)ChatGPT作為大腦結(jié)合多模態(tài)視覺(jué)專家，支持圖像和視頻等多模態(tài)輸入輸出實(shí)現(xiàn)多模態(tài)推理與行動(dòng)能力

MM-ReAct designs the system paradigm that composes numerous multimodal tools1 with Chat-GPT (OpenAI, 2022) for multimodal reasoning and action. By augmenting the language-only ChatGPT with various multimodal tools, MM-REACT supports both inputs and outputs in multi-modalities, including text, image and video, as shown in Figure 6.4.

Figure 6.5 shows the system design of MM-REACT. The multimodal tools explored in MM-REACT are mainly computer vision models that take an image as input and interpret the image content from different perspectives. For instance, the image captioning model generates a natural description, the OCR model extracts the scene text in the image, the celebrity recognition model identifies the celebrity names, and the object detection model extracts the salient object with bound-ing box locations. LLMs such as ChatGPT serves as the brain of the agent, which plans on which tools to use, and in what order, based on the input image and the user intent. Next, with the example in Figure 6.5, we unfold the planning and execution of MM-REACT behind the scene.

MM-ReAct設(shè)計(jì)了系統(tǒng)范例，該系統(tǒng)范例使用ChatGPT?(OpenAI, 2022)組成了多個(gè)多模態(tài)工具，用于多模態(tài)推理和行動(dòng)。通過(guò)利用各種多模態(tài)工具來(lái)增強(qiáng)僅支持語(yǔ)言的ChatGPT，MM-REACT支持多種模態(tài)輸入和輸出，包括文本、圖像和視頻，如圖6.4所示。

圖6.5顯示了MM-REACT的系統(tǒng)設(shè)計(jì)。MM-REACT中探討的多模態(tài)工具主要是計(jì)算機(jī)視覺(jué)模型，它們以圖像作為輸入并從不同角度解釋圖像內(nèi)容。例如，圖像字幕模型生成自然描述，OCR模型提取圖像中的場(chǎng)景文本，名人識(shí)別模型識(shí)別名人姓名，物體檢測(cè)模型提取帶有邊界框位置的顯著物體。像ChatGPT這樣的LLM充當(dāng)了代理的大腦，它根據(jù)輸入圖像和用戶意圖規(guī)劃使用哪些工具以及以什么順序使用。接下來(lái)，通過(guò)圖6.5中的示例，我們將揭示MM-REACT在幕后的規(guī)劃和執(zhí)行過(guò)程。

User prompt用戶提示：MM-REACT利用圖像文件路徑作為ChatGPT輸入,讓其在規(guī)劃階段通過(guò)調(diào)用視覺(jué)工具來(lái)理解圖像內(nèi)容并回答用戶問(wèn)題

User prompt.?As ChatGPT only accepts language inputs, to enable image as inputs, we simply use the file path as the input to ChatGPT. The file path functions as a placeholder, allowing ChatGPT to treat it as a black box and later seek help from different tools during the planning stage. Besides the input image, the user can also provide the intent in text format (e.g. , a question about the input image). When there is no text input from the user, the goal is to get a general understanding about the image.

由于ChatGPT僅接受語(yǔ)言輸入，為了啟用圖像作為輸入，我們簡(jiǎn)單地使用文件路徑作為ChatGPT的輸入。文件路徑充當(dāng)占位符，允許ChatGPT將其視為黑盒，并在規(guī)劃階段尋求不同工具的幫助。除了輸入圖像，用戶還可以以文本格式提供意圖（例如關(guān)于輸入圖像的問(wèn)題）。當(dāng)用戶沒(méi)有提供文本輸入時(shí)，目標(biāo)是對(duì)圖像有一個(gè)大致的了解。

Planning規(guī)劃：MM-REACT通過(guò)提示詞與正則判斷是否需要外部工具,并提供工具描述與使用示例指導(dǎo) ChatGPT合理調(diào)用視覺(jué)專家完成任務(wù)

Planning. Upon receiving the input image and user prompt, ChatGPT plans for what tools to use. Inspired by REACT (Yao et al., 2022c), MM-REACT instructs ChatGPT to respond with certain watchwords, such as “Assistant, what objects are there in the image? <file path>”, if a specific tool is required (i.e., action request in Figure 6.5). In practice, one can tell whether a multimodal tool is needed by simply string-matching the keyword “Assistant,” in ChatGPT’s response.

在接收到輸入圖像和用戶提示后，ChatGPT規(guī)劃要使用哪些工具。受到REACT（Yao等人，2022c）的啟發(fā)，MM-REACT指導(dǎo)ChatGPT以特定的關(guān)鍵詞來(lái)回應(yīng)，例如“助手，圖像中有什么對(duì)象？<文件路徑>”，如果需要特定工具（即圖6.5中的操作請(qǐng)求）。在實(shí)踐中，可以通過(guò)簡(jiǎn)單地字符串匹配ChatGPT的響應(yīng)中的關(guān)鍵字“Assistant”來(lái)判斷是否需要多模態(tài)工具。

MM-ReAct encourages ChatGPT to show the thought (reasoning) process to highlight why an exter-nal tool is needed, which has been proven beneficial in NLP studies (Yao et al., 2022c). In addition, for generating proper calling to each tool, both instructions and in-context examples are added as the prefix when prompting ChatGPT. Each tool is described with the model name, a general description of its capability, the input data format, and the output information. After describing each tool, a few in-context dialogue examples are also included for enhanced performance.

MM-ReAct鼓勵(lì)ChatGPT展示思考（推理）過(guò)程，以突出為什么需要外部工具，這在NLP研究中已被證明是有益的（Yao等人，2022c）。此外，為了生成對(duì)每個(gè)工具的正確調(diào)用，當(dāng)提示ChatGPT時(shí)，會(huì)將指令和上下文示例作為前綴添加。每個(gè)工具都用模型名稱、其功能的一般描述、輸入數(shù)據(jù)格式和輸出信息來(lái)描述。在描述每個(gè)工具之后，還包括了一些上下文對(duì)話示例，以增強(qiáng)性能。

Execution執(zhí)行：MM-REACT通過(guò)正則匹配解析ChatGPT的動(dòng)作請(qǐng)求,調(diào)用相應(yīng)工具完成各類視覺(jué)任務(wù)后,匯總結(jié)果與ChatGPT對(duì)話,解決用戶提出的問(wèn)題。

Execution. Given the action request from ChatGPT, the tool name and the file path can be parsed via regular expression matching, which are used to invoke the tool (action execution).

Take the example shown in Figure 6.5, upon receiving the input image, ChatGPT first invokes a se-ries of tools for a general understanding about the image. The invoked tools include image caption-ing for an overall description of the image; dense captioning to get the region-level, more detailed description about the objects in the image; object tagging to get the tags of the objects in the image; face detection to get the box coordinates of the two faces mentioned in the object tags. The outputs from the tools (i.e. observations) are serialized as text, and fed back to ChatGPT.

Combining observations with the chat history, ChatGPT can further invoke additional experts or return the final answer to the user. In this specific example, ChatGPT invokes a second round of thought-action-observation over the two faces detected in the image and calls celebrity recognition to get the names of these two persons.

執(zhí)行。根據(jù)ChatGPT的行動(dòng)請(qǐng)求，可以通過(guò)正則表達(dá)式匹配解析工具名稱和文件路徑，這些信息用于調(diào)用工具（操作執(zhí)行）。

以圖6.5中顯示的示例為例，收到輸入圖像后，ChatGPT首先調(diào)用一系列工具以對(duì)圖像進(jìn)行一般性理解。所調(diào)用的工具包括用于所述圖像的總體描述的圖像字幕；密集字幕以獲取圖像中物體的區(qū)域級(jí)更詳細(xì)的描述；對(duì)象標(biāo)簽以獲取圖像中物體的標(biāo)簽；人臉檢測(cè)以獲取物體標(biāo)簽中提到的兩張臉的框坐標(biāo)。工具的輸出（即觀察結(jié)果）被序列化為文本，并反饋給ChatGPT。

將觀察結(jié)果與聊天歷史結(jié)合起來(lái)，ChatGPT可以進(jìn)一步調(diào)用其他專家或?qū)?span style="color:#ff0000;">最終答案返回給用戶。在此特定示例中，ChatGPT在圖像中檢測(cè)到的兩張臉上進(jìn)行了調(diào)用第二輪的思考-行動(dòng)-觀察，并調(diào)用名人識(shí)別以獲取這兩位人物的姓名。

Response generation響應(yīng)生成：MM-REACT實(shí)現(xiàn)了對(duì)話系統(tǒng)，通過(guò)判斷是否需要調(diào)用外部工具，將所有觀察信息分析總結(jié)給用戶，或利用人名和邊界框調(diào)用Bing搜索來(lái)回答未知詳情的 follow-up 問(wèn)題

When ChatGPT decides no external tools are needed, it takes consideration of all observations gathered and summarize them as the response to the user, which is “This image contains two celebrities, Kobe Bryant and Paul Pierce. They are both basketball players.” for the example shown in Figure 6.5.

If the user continues to interact with MM-REACT, it repeats the process described above, but with all observations and chat history available when planning for the tools needed. For instance, if the user then asks “how many championship rings did the player on the left win in his career”, it is not available in the existing observations nor chat history, but ChatGPT has the bounding boxes to decide who is on the left, and also the names of the players. It plans to invoke Bing Search to find the right answer, which should be 5.

當(dāng)ChatGPT確定不需要外部工具時(shí)，它會(huì)考慮到收集到的所有觀察結(jié)果，并將它們總結(jié)為向用戶的響應(yīng)，例如圖6.5中所示的響應(yīng)是“這張圖像包含兩位名人，科比·布萊恩特和保羅·皮爾斯。他們都是籃球運(yùn)動(dòng)員?！?。

如果用戶繼續(xù)與MM-REACT進(jìn)行互動(dòng)，它將重復(fù)上述過(guò)程，但在規(guī)劃所需工具時(shí)會(huì)考慮到所有觀察結(jié)果和聊天歷史。例如，如果用戶接著問(wèn)“左邊的球員在他的職業(yè)生涯中贏得了多少個(gè)總冠軍戒指”，在現(xiàn)有觀察結(jié)果和聊天歷史中沒(méi)有該信息，但ChatGPT可以使用邊界框決定誰(shuí)在左邊，以及球員的名字。它計(jì)劃調(diào)用Bing Search來(lái)找到正確的答案，答案應(yīng)該是5。

6.3.2、Capabilities能力：MM-REACT 證明了多種代表性能力和應(yīng)用場(chǎng)景

Figure 6.6 shows the representative capabilities and application scenarios that MM-REACT demon-strates, including visual math and text reasoning, understanding visual-conditioned jokes/memes, spatial/coordinate understanding, visual planning and prediction, multi-image reasoning, multi-hop document understanding, open-world concept understanding, video analysis and summarization.

In addition, we show an example of the full response from MM-REACT on multi-image reasoning in Figure 6.7, which may not be easily achievable by visual instruction tuning in Chapter 5. For more comprehensive examples of all emerging capabilities of MM-REACT, we refer the reader to the original paper.

圖6.6顯示了MM-REACT展示的代表性能力和應(yīng)用場(chǎng)景，包括視覺(jué)數(shù)學(xué)和文本推理、理解視覺(jué)條件下的笑話/表情、空間/坐標(biāo)理解、視覺(jué)規(guī)劃和預(yù)測(cè)、多圖像推理、多跳文檔理解、開(kāi)放世界概念理解、視頻分析和總結(jié)。

此外，圖6.7中展示了MM-REACT在多圖像推理方面的完整響應(yīng)示例，這可能在第5章的視覺(jué)指令調(diào)優(yōu)中不容易實(shí)現(xiàn)。對(duì)于MM-REACT的所有新興能力的更全面示例，我們建議讀者參閱原始論文。

6.3.3、Extensibility可擴(kuò)展性(工具鏈構(gòu)建多模態(tài)智能體的優(yōu)勢(shì))：可擴(kuò)展性的兩大策略

One favorable property of tool chaining to build multimodal agents is that the system can be easily extended and enhanced, from two perspectives. One is to upgrade the core part of the system, the LLM, and the other is to expand the number of external tools.

工具鏈構(gòu)建多模態(tài)智能體的一個(gè)有利特性是系統(tǒng)易于擴(kuò)展和增強(qiáng)，從兩個(gè)方面來(lái)看。一個(gè)是升級(jí)系統(tǒng)的核心部分LLM，另一個(gè)是擴(kuò)展外部工具的數(shù)量。

(1)、Upgrading LLM升級(jí)LLM：MM-REACT的系統(tǒng)設(shè)計(jì)可不需重訓(xùn)練就將LLM升級(jí)為更強(qiáng)大的新模型，比如ChatGPT升級(jí)到多模態(tài)能力的GLP-4

The system design of MM-REACT allows for upgrading the core part of the system, the LLM, to newer and more powerful models as they come out, without the need of re-training. We show an example in Figure 6.8 on upgrading ChatGPT to language-only GPT-4, which improves MM-REACT to potentially match the performance of multimodal GPT-4.

MM-REACT的系統(tǒng)設(shè)計(jì)允許將系統(tǒng)的核心部分LLM升級(jí)為更新和更強(qiáng)大的模型，而無(wú)需重新訓(xùn)練。我們?cè)趫D6.8中展示了將ChatGPT升級(jí)為僅支持語(yǔ)言的GPT-4的示例，這可以改進(jìn)MM-REACT以潛在地匹配多模態(tài)GPT-4的性能。

(2)、Plug-and-play 即插即用（添加更多工具）：現(xiàn)有的多模態(tài)智能體通過(guò)插拔式機(jī)制整合工具(如HuggingGPT、Chameleon和RestGPT)允許在無(wú)需訓(xùn)練的情況下添加更多工具→擴(kuò)展到數(shù)千萬(wàn)個(gè)工具仍然具有挑戰(zhàn)性(TaskMatrix.AI的潛力)→SAM可以作為一種工具來(lái)實(shí)現(xiàn)與多模態(tài)智能體的多種方式的人際互動(dòng)

Plug-and-play (adding more tools). Existing multimodal agents incorporates tools via a plug-and-play mechanism, allowing adding more tools without training. One prominent work along this direction is HuggingGPT (Shen et al., 2023b), which proposes to leverage all open-source models hosted on huggingface. Chameleon (Lu et al., 2023b), incorporates not only huggingface models, but also open-source models from GitHub, Bing search API, and python compiler. RestGPT (Song et al., 2023) proposes a multi-level online planning framework that effectively handles the practical challenges associated with integrating LLMs with more than 100 RESTful APIs. However, it re-mains challenging in scaling this framework to thousands to millions of tools, which is the potential future demonstrated in TaskMatrix.AI (Liang et al., 2023b).

現(xiàn)有的多模態(tài)智能體通過(guò)即插即用的機(jī)制集成工具，允許在無(wú)需訓(xùn)練的情況下添加更多工具。沿著這個(gè)方向的一項(xiàng)重要工作是HuggingGPT（Shen等人，2023b），該工作提出利用托管在huggingface上的所有開(kāi)源模型。Chameleon（Lu等人，2023b）不僅包括huggingface模型，還包括來(lái)自GitHub、必應(yīng)搜索API和Python編譯器的開(kāi)源模型。RestGPT（Song等人，2023）提出了一個(gè)多層次在線規(guī)劃框架，有效處理了將LLMs與超過(guò)100多個(gè)RESTful API集成相關(guān)的實(shí)際挑戰(zhàn)。然而，將該框架擴(kuò)展到成千上萬(wàn)的工具仍然具有挑戰(zhàn)性，這是TaskMatrix.AI（Liang等人，2023b）所展示的未來(lái)潛力。

Moreover, one can leverage SAM (Kirillov et al., 2023) as a tool to allow for more types of human interaction with the multimodal agent other than text. Recall in MM-REACT, the user intent is all captured by the natural language query from the user. In InternGPT (Liu et al., 2023l), by connecting the tool SAM with GPT, it allows for more ways to interact with the system, for example, via clicks, scribbles, and drawing bounding boxes. These additional interactions, to some extent, are mimicking the action of finger-pointing when we humans are having a conversation.

此外，人們可以利用SAM（Kirillov等人，2023）作為一種工具，允許以文本之外的多模態(tài)智能體進(jìn)行更多類型的人類交互?；仡櫼幌?strong>MM-REACT，用戶意圖都是通過(guò)用戶的自然語(yǔ)言查詢來(lái)捕獲的。在InternGPT（Liu等人，2023l）中，通過(guò)將工具SAM與GPT連接，它允許以更多方式與系統(tǒng)進(jìn)行互動(dòng)，例如通過(guò)點(diǎn)擊、涂鴉和繪制邊界框。在某種程度上，這些額外的互動(dòng)方式模擬了我們?nèi)祟愡M(jìn)行對(duì)話時(shí)指向的動(dòng)作。

6.4、Advanced Topics高級(jí)主題

In this section, we discuss more advanced topics and shed light on potential future directions.

在本節(jié)中，我們將討論更高級(jí)的主題，并探討潛在的未來(lái)發(fā)展方向。

6.4.1、Comparison to Training with LLM in Chapter與第五章中LLM訓(xùn)練的比較

構(gòu)建基于LLM的高級(jí)多模態(tài)系統(tǒng)方向上的兩個(gè)方法→融合兩種范式優(yōu)勢(shì)的中間領(lǐng)域的可能性→探討：否可以用LLaVA替代LLM作為工具分配器+需要哪些能力來(lái)啟用工具

T1、通過(guò)指令調(diào)整實(shí)現(xiàn)端到端模型+直接解釋多模態(tài)輸入中的豐富語(yǔ)義+但需要數(shù)據(jù)篩選和訓(xùn)練+成本較高

T2、通過(guò)將LLM與現(xiàn)成的工具鏈接+借助上下文中的少樣本示例來(lái)教導(dǎo)LLM進(jìn)行規(guī)劃+但存在如何選擇工具的問(wèn)題且弱領(lǐng)域?qū)＜覍?dǎo)致性能差

We have covered two directions on building advanced multimodal systems based on LLMs. As the key distinction, the multimodal agents in this chapter leverages LLMs’ high-level planning abilities to allocate various multimodal tools, while training multimodal models with LLMs in Chapter 5 solely leverages LLMs for text generation conditioned on multimodal inputs.

Nonetheless, both of these methods exhibit their respective advantages and disadvantages. On one hand, instruction tuning enables an end-to-end model that directly interprets rich semantics in multi-modal inputs, but requires data curation and training, hence more computationally expensive. How-ever, limited instruction tuning data may limit its capabilities in certain scenarios, such as OCR. On the other hand, one can build a multimodal agent without any training by chaining LLMs with abundant off-the-shelf models/APIs/code interpreters as tools, and leveraging in-context few-shot examples to teach LLMs on planning. However, as there is no training, the system may fail to in-voke the right tool. Moreover, weak domain experts may produce noisy outputs, that can confuse LLM on planning or reasoning, leading to weak performance.

我們已經(jīng)介紹了兩種基于LLM構(gòu)建高級(jí)多模態(tài)系統(tǒng)的方法。作為關(guān)鍵區(qū)別，本章中的多模態(tài)智能體利用了LLM的高級(jí)規(guī)劃能力來(lái)分配各種多模態(tài)工具，而第5章中使用LLM訓(xùn)練多模態(tài)模型僅利用LLM來(lái)生成基于多模態(tài)輸入的文本。

然而，這兩種方法都具有各自的優(yōu)點(diǎn)和缺點(diǎn)。一方面，指令調(diào)優(yōu)使端到端模型能夠直接解釋多模態(tài)輸入中的豐富語(yǔ)義，但需要數(shù)據(jù)管理和訓(xùn)練，因此計(jì)算成本更高。然而，有限的指令調(diào)優(yōu)數(shù)據(jù)可能會(huì)限制其在某些場(chǎng)景下的能力，例如OCR。另一方面，可以通過(guò)將LLM與大量現(xiàn)成的模型/API/代碼解釋器鏈在一起作為工具，以及利用上下文中的少量示例來(lái)教導(dǎo)LLM進(jìn)行規(guī)劃，來(lái)構(gòu)建多模態(tài)智能體而無(wú)需任何訓(xùn)練。然而，由于沒(méi)有訓(xùn)練，系統(tǒng)可能無(wú)法調(diào)用正確的工具。此外，弱領(lǐng)域?qū)＜?/span>可能會(huì)產(chǎn)生噪聲的輸出，這可能會(huì)使LLM在規(guī)劃或推理方面感到困惑，導(dǎo)致性能較差。

Though these two approaches exhibit distinct variations,, we envision the possibility of an interme-diate domain that amalgamates the strengths of both paradigms, and raise the following questions. Now that we have open-source LMM such as LLaVA (Liu et al., 2023c), can we replace the LLM with LLaVA as the tool allocator? If so, what capabilities would require a tool to be enabled? And what problems can be solved by instruction tuning. These are interesting directions that may worth exploring in the near future.

盡管這兩種方法具有不同的變化，但我們?cè)O(shè)想了一種中間領(lǐng)域的可能性，將兩種范例的優(yōu)勢(shì)融合在一起，并提出以下問(wèn)題。既然我們有像LLaVA（Liu等人，2023c）這樣的開(kāi)源LMM，是否可以將LLM替換為L(zhǎng)LaVA作為工具分配器？如果是這樣，需要啟用工具的哪些功能?哪些問(wèn)題可以通過(guò)指令調(diào)優(yōu)來(lái)解決。這些都是值得在不久的將來(lái)探索的有趣方向。

6.4.2、Improving Multimodal Agents提高多模態(tài)智能體的性能

痛點(diǎn)：當(dāng)前主要依賴上下文內(nèi)的少樣本示例來(lái)教導(dǎo)LLM進(jìn)行規(guī)劃，導(dǎo)致不夠可靠和不準(zhǔn)確

Existing multimodal agents mainly rely on in-context few-shot examples to teach LLM on planning, which can be unreliable, leading to inaccurate tool using. To improve the accuracy in planning, several works have been proposed and we group them into three categories below.

現(xiàn)有的多模態(tài)智能體主要依賴于上下文中的少量示例來(lái)教導(dǎo)LLM進(jìn)行規(guī)劃，這可能不可靠，導(dǎo)致工具的使用不準(zhǔn)確。為了提高規(guī)劃的準(zhǔn)確性，已經(jīng)提出了一些方法，我們將它們分為以下三類。

Composing tools via code generation通過(guò)代碼生成組合工具(代碼仍由LLM生成導(dǎo)致準(zhǔn)確性問(wèn)題)：探索使用編程語(yǔ)言來(lái)代替自然語(yǔ)言進(jìn)行更準(zhǔn)確的工具使用規(guī)劃，基于自然語(yǔ)言指令利用GPT-3(Codex)生成Python代碼，如視覺(jué)編程/ViperGPT?

?Most existing multimodal agents uses natural language to prompt LLM for planning which tool to use. Researchers (Gupta and Kembhavi, 2023; Sur′?s et al., 2023) have also been exploring using programming language for more accurate execution. Visual programming (Gupta and Kembhavi, 2023) is a prominent work along this direction, which?uses the in-context learning ability of GPT-3 (Brown et al., 2020) to generate python-like modular programs from natural language instructions for compositional visual tasks ViperGPT Sur′?s et al.(2023) instructs GPT-3 Codex (Chen et al., 2021a) to generate Python code to compose multimodal tools for a one-round query answering. However, as the codes are still generated by a LLM, the problem of inaccurate tool using still remains.

大多數(shù)現(xiàn)有的多模態(tài)智能體使用自然語(yǔ)言提示LLM規(guī)劃使用哪個(gè)工具。研究人員（Gupta和Kembhavi，2023；Sur′?s等人，2023）也一直在探索使用編程語(yǔ)言來(lái)更準(zhǔn)確地執(zhí)行。視覺(jué)編程（Gupta和Kembhavi，2023）是這個(gè)方向的一項(xiàng)突出工作，它利用了GPT-3（Brown等人，2020）的上下文學(xué)習(xí)能力，從自然語(yǔ)言指令中生成類似Python的模塊化程序，用于組合視覺(jué)任務(wù)。ViperGPT Sur′?s等人（2023）指示GPT-3?Codex（Chen等人，2021a）生成Python代碼，以組合多模態(tài)工具進(jìn)行一輪查詢回答。然而，由于代碼仍然是由LLM生成的，因此仍然存在工具使用不準(zhǔn)確的問(wèn)題。

Improving accuracy in tool using: self-assessment提高工具使用的準(zhǔn)確性—自我評(píng)估：AssistGPT試圖通過(guò)自我評(píng)估提升工具使用準(zhǔn)確性

A recent work AssistGPT (Gao et al., 2023a) tries to improve the accuracy in tool using via self-assessment. It adds a stage of inspection and learning loop into the system. When the round of plan and execution is finished, the system inspects the outcome, and determines whether the reasoning path of calling the tool is a success or not, if so, save it as an in-context example, to teach LLM for a more accurate tool calling in the future rounds.

最近的一項(xiàng)工作AssistGPT（Gao等人，2023a）嘗試通過(guò)自我評(píng)估來(lái)提高工具使用的準(zhǔn)確性。它在系統(tǒng)中添加了一個(gè)檢查和學(xué)習(xí)循環(huán)的階段。當(dāng)一輪計(jì)劃和執(zhí)行完成后，系統(tǒng)檢查結(jié)果，判斷調(diào)用工具的推理路徑是否成功，如果成功，將其保存為上下文示例，以指導(dǎo)LLM在以后的輪中更準(zhǔn)確地調(diào)用工具。

Improving accuracy in tool using: instruction tuning提高工具使用的準(zhǔn)確性—指令調(diào)優(yōu)：通過(guò)自我指導(dǎo)產(chǎn)生指令-API對(duì)數(shù)據(jù)集微調(diào)較小規(guī)模LLM，從而改善工具使用準(zhǔn)確性

Improving accuracy in tool using: instruction tuning. Another thread on improving accuracy in tool using is to combine the system with instruction tuning (Patil et al., 2023; Yang et al., 2023c). One can generate a dataset of instruction-API pairs via self-instruct to tune a smaller LLM (e.g. , Vicuna-7B (Vicuna, 2023)).

提高工具使用的準(zhǔn)確性：指令調(diào)優(yōu)。另一種提高工具使用準(zhǔn)確性的方法是將系統(tǒng)與指令調(diào)優(yōu)（Patil等人，2023；Yang等人，2023c）相結(jié)合。可以通過(guò)自我指導(dǎo)生成指令-API對(duì)的數(shù)據(jù)集，以調(diào)優(yōu)較小的LLM（例如，Vicuna-7B（Vicuna，2023））。

LMM 作為工具分配器？將LMM替換為系統(tǒng)中的多模態(tài)工具分配器,取消統(tǒng)一工具輸出為文本序列的需求,支持更自然的多模態(tài)工具交互，如模態(tài)GPT-4

LMM as the tool allocator?

In addition, as LMMs evolve, we envision that the LLM can be replaced by a LMM as the tool allocator in the system, to enable even more advanced application scenarios. If the tool allocator can take multimodal inputs, there is no need to unify the outputs of tools into text sequence, allowing more natural interactions between the tool allocator and multi-modal tools, particularly those producing multimodal outputs. For instance, one can imagine using multimodal GPT-4 (OpenAI, 2023a) to coordinate various image or video generation tools to make a short movie by providing it with a sketch of the storyline and visual examples of the main characters.

此外，隨著LMM的發(fā)展，我們?cè)O(shè)想LMM可以取代LLM作為系統(tǒng)中的工具分配器，以實(shí)現(xiàn)更高級(jí)的應(yīng)用場(chǎng)景。如果工具分配器可以接受多模態(tài)輸入，就無(wú)需將工具的輸出統(tǒng)一為文本序列，從而允許工具分配器與多模態(tài)工具之間進(jìn)行更自然的交互，特別是那些生成多模態(tài)輸出的工具。例如，可以想象使用多模態(tài)GPT-4（OpenAI，2023a）來(lái)協(xié)調(diào)各種圖像或視頻生成工具，通過(guò)提供故事情節(jié)的草圖和主要角色的視覺(jué)示例來(lái)制作短片。

6.4.3、Diverse Applications of Multimodal Agents多模態(tài)智能體的多樣應(yīng)用

通過(guò)組合來(lái)自特定領(lǐng)域的工具的系統(tǒng)范式可以支持多樣化的領(lǐng)域特定應(yīng)用，比如圖像合成、機(jī)器人執(zhí)行、音頻生成、3D場(chǎng)景生成、醫(yī)學(xué)圖像理解和視覺(jué)語(yǔ)言導(dǎo)航等

By composing tools from a specific domain, this new system paradigm can also support diverse domain-specific applications.

Yu et al. (2023b) composes LLMs with image synthesis tools and object-level/pixel-level image un-derstanding tools to build a data synthesis pipeline to provide diverse annotations on synthesized image. Instruct2Act (Huang et al., 2023c) complements the LLM with robotic executors, to enable robotic actions based on multi-modal instructions. When chaining a pool of audio models with LLM, AudioGPT (Huang et al., 2023a) can understand and generate speech, music, sound and talk-ing head. Similarly, WavJourney (Liu et al., 2023i) further supports compositional audio creation with storylines encompassing speech, music, and sound effects. With tracking, captioning, audio un-derstanding models, ChatVideo (Wang et al., 2023c) enables ChatGPT to understand multi-channel videos. Other application scenarios include 3D scene generation (Lin et al., 2023; Feng et al., 2023), medical image understanding (Liu and Zuo, 2023; Sun et al., 2023c) and vision-language naviga-tion (Zhou et al., 2023b).

通過(guò)組合來(lái)自特定領(lǐng)域的工具，這個(gè)新的系統(tǒng)范例還可以支持不同的特定于領(lǐng)域的應(yīng)用程序。

Yu等人（2023b）將LLMs與圖像合成工具和物體級(jí)/像素級(jí)圖像理解工具組合起來(lái)，構(gòu)建了一個(gè)數(shù)據(jù)合成管道，為合成圖像提供多種注釋。

Instruct2Act（Huang等人，2023c）將LLM與機(jī)器人執(zhí)行器結(jié)合使用，以實(shí)現(xiàn)基于多模態(tài)指令的機(jī)器人動(dòng)作。

在將一組音頻模型與LLM鏈接時(shí)，AudioGPT（Huang等人，2023a）可以理解和生成語(yǔ)音、音樂(lè)、聲音和說(shuō)話頭。類似地，WavJourney（Liu等人，2023i）進(jìn)一步支持包括語(yǔ)音、音樂(lè)和音效在內(nèi)的敘述性音頻創(chuàng)作。

借助跟蹤、字幕、音頻理解模型，ChatVideo（Wang等人，2023c）使ChatGPT能夠理解多通道視頻。

其他應(yīng)用場(chǎng)景包括3D場(chǎng)景生成（Lin等人，2023；Feng等人，2023）、醫(yī)學(xué)圖像理解（Liu和Zuo，2023；Sun等人，2023c）和視覺(jué)語(yǔ)言導(dǎo)航（Zhou等人，2023b）。

6.4.4、Evaluation of Multimodal Agents多模態(tài)智能體的評(píng)估

多模態(tài)工具使用能力廣泛但其工具使用準(zhǔn)確率尚無(wú)定量研究：API-Bank（是在系統(tǒng)評(píng)估工具增強(qiáng)型LLM中的起點(diǎn)

Multimodal tool using. Although we have seen qualitative examples of new scenarios enabled by multimodal agents, it remains unclear how these agents perform in terms of the accuracy in tool using. API-Bank (Li et al., 2023k) is a starting point on building pipeline in systematically evaluating tool-augmented LLMs.

多模態(tài)工具使用。盡管我們已經(jīng)看到了多模態(tài)智能體所啟用的新場(chǎng)景的定性示例，但目前尚不清楚這些代理在工具使用準(zhǔn)確性方面的表現(xiàn)如何。API-Bank（Li等人，2023k）是系統(tǒng)評(píng)估工具增強(qiáng)的LLM的起點(diǎn)，用于系統(tǒng)地評(píng)估工具增強(qiáng)的LLM。

Emergent capabilities新興能力：現(xiàn)有的視覺(jué)語(yǔ)言基準(zhǔn)未能考察到大型多模態(tài)智能體的涌現(xiàn)能力，研究人員已經(jīng)開(kāi)始設(shè)計(jì)全面的評(píng)估樣本來(lái)促進(jìn)LMM評(píng)估，比如MM-Vet定義的6個(gè)核心視覺(jué)語(yǔ)言能力

Emergent capabilities. Existing VL benchmarks focus on specific capabilities of interest, such as visual recognition (Antol et al., 2015), image description (Chen et al., 2015; Agrawal et al., 2019), as well as other benchmarks for specialized capabilities such as scene text understanding (Sidorov et al., 2020; Gurari et al., 2018), commonsense reasoning (Zellers et al., 2019), outside knowl-edge (Schwenk et al., 2022). The intriguing abilities shown in large multimodal models and multi-modal agents are not examined by existing benchmarks, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, or explaining visual jokes. Furthermore, the long, chatty outputs from these systems poses challenges to today’s evaluation met-rics. Researchers (Fu et al., 2023; Liu et al., 2023j) have started to design comprehensive evaluation samples to facilitate the LMM evaluation. As an attempt to test multimodal systems on integrated capabilities, MM-Vet (Yu et al., 2023d) defines 6 core VL capabilities and examines the 16 integra-tions of interest derived from the capability combination (Figure 6.10). In addition, to accommodate for the open-ended free-form text outputs, MM-Vet proposes an LLM-based evaluator to enable evaluation across different question types and answer styles.

現(xiàn)有的VL基準(zhǔn)主要關(guān)注特定感興趣的能力，例如視覺(jué)識(shí)別（Antol等人，2015）、圖像描述（Chen等人，2015；Agrawal等人，2019），以及用于專門(mén)能力的其他基準(zhǔn)，例如場(chǎng)景文本理解（Sidorov等人，2020；Gurari等人，2018）、常識(shí)推理（Zellers等人，2019）、外部知識(shí)（Schwenk等人，2022）。

大型多模態(tài)模型和多模態(tài)智能體中顯示出的有趣能力并沒(méi)有被現(xiàn)有的基準(zhǔn)所檢驗(yàn)，比如解決黑板上寫(xiě)的數(shù)學(xué)問(wèn)題，推理新聞圖像中的事件和名人，或者解釋視覺(jué)笑話。

此外，這些系統(tǒng)產(chǎn)生的冗長(zhǎng)對(duì)話輸出對(duì)今天的評(píng)估指標(biāo)提出了挑戰(zhàn)。研究人員（Fu等人，2023；Liu等人，2023j）已經(jīng)開(kāi)始設(shè)計(jì)全面的評(píng)估樣本，以促進(jìn)LMM的評(píng)估。作為在綜合能力上測(cè)試多模態(tài)系統(tǒng)的嘗試，MM-Vet（Yu等人，2023d）定義了6種核心VL能力，并檢查了從能力組合中派生出的16種感興趣的整合（圖6.10）。此外，為了適應(yīng)自由形式文本輸出，MM-Vet提出了一種基于LLM的評(píng)估器，以實(shí)現(xiàn)跨不同問(wèn)題類型和答案風(fēng)格的評(píng)估。

6.4.5、Tool Creation工具創(chuàng)建

NLP領(lǐng)域：探索通過(guò)編寫(xiě)代碼或指令來(lái)即時(shí)創(chuàng)建工具以滿足用戶需求，比如CREATOR

多模態(tài)智能體領(lǐng)域：尋找方法來(lái)創(chuàng)建能夠處理多模態(tài)輸入的工具，比如ViperGPT/AutoML GPT

Imagine if we have a completely new scenario without a robust tool to use. Can we create a tool based on the user need on-the-fly? In NLP, CREATOR (Qian et al., 2023) proposes to create tools by writing python code for math reasoning, as opposed to calling math solver API such as Wolfram Alpha. Cai et al. (2023) further explores the capabilities of LLMs to make tools, and experiment with two LLMs, one as the tool maker and the other as the tool user to collaboratively solve com-plicated tasks, such as scheduling a meeting. In terms of multimodal agents, the challenge is how to create a tool that can process multimodal inputs. One may follow ViperGPT (Sur′?s et al., 2023) to instruct LLMs to generate python programs leveraging pre-existent python packages such as Open-CV. AutoML GPT (Zhang et al., 2023j) envisions that one can utilize LLMs to automate the model training pipeline. There may be potential to develop novel multimodal deep learning tools tailored to more effectively address the requirements of users.

想象一下，如果我們有一個(gè)全新的場(chǎng)景，沒(méi)有一個(gè)強(qiáng)大的工具可以使用。我們能否根據(jù)用戶的需求創(chuàng)建一個(gè)即時(shí)工具？

在NLP領(lǐng)域，CREATOR（Qian等人，2023）建議通過(guò)為數(shù)學(xué)推理編寫(xiě)Python代碼來(lái)創(chuàng)建工具，而不是調(diào)用數(shù)學(xué)求解器API，例如Wolfram Alpha。

Cai等人（2023）進(jìn)一步探討了LLM創(chuàng)建工具的能力，并對(duì)兩個(gè)llm進(jìn)行了實(shí)驗(yàn)，，一個(gè)作為工具制造商，另一個(gè)作為工具用戶，以協(xié)同解決復(fù)雜的任務(wù)，如安排會(huì)議。

在多模態(tài)智能體方面，挑戰(zhàn)是如何創(chuàng)建一個(gè)可以處理多模態(tài)輸入的工具?？梢越梃bViperGPT（Sur′?s等人，2023）的方法，指示LLM利用現(xiàn)有Python包（如Open-CV）生成python程序。AutoML GPT（Zhang等人，2023j）設(shè)想可以利用LLM自動(dòng)化模型訓(xùn)練管道。有可能開(kāi)發(fā)出新的多模態(tài)深度學(xué)習(xí)工具，以更有效地滿足用戶的需求。

6.4.6、Retrieval-Augmented Multimodal Agents檢索增強(qiáng)的多模態(tài)智能體

背景：大部分信息存儲(chǔ)在數(shù)據(jù)庫(kù)中，用戶可能需要準(zhǔn)確檢索這些信息

NLP領(lǐng)域：通過(guò)外部數(shù)據(jù)以結(jié)構(gòu)化語(yǔ)言和關(guān)系表示來(lái)增強(qiáng)LLMs，通過(guò)檢索器檢索相關(guān)文檔并使用生成器生成預(yù)測(cè)，以解決無(wú)法將所有世界知識(shí)編碼到預(yù)訓(xùn)練模型的權(quán)重中的問(wèn)題

In real-life applications, a substantial portion of information resides within databases, and user needs may require accurate retrieval of such information. Meanwhile, it is infeasible to encode all the world knowledge into the weights of pre-trained models, particularly when it comes to the long-tail concepts and fast-evolving data.

In NLP, several works augment LLMs with external data encoded with structured language and relation representations (Peters et al., 2019; Guu et al., 2020; Lewis et al., 2020). Given input texts, such retrieved-augmented models utilize a retriever that retrieves relevant documents from an external memory, and uses a generator to generate predictions given the retrieved documents.

在實(shí)際應(yīng)用中，大部分信息存儲(chǔ)在數(shù)據(jù)庫(kù)中，用戶可能需要準(zhǔn)確檢索這些信息。與此同時(shí)，在將世界知識(shí)編碼到預(yù)訓(xùn)練模型的權(quán)重中，特別是對(duì)于長(zhǎng)尾概念和快速發(fā)展的數(shù)據(jù)來(lái)說(shuō)，是不可行的。

在NLP領(lǐng)域，一些工作利用結(jié)構(gòu)化語(yǔ)言和關(guān)系表示來(lái)增強(qiáng)LLMs與外部數(shù)據(jù)的連接（Peters等人，2019；Guu等人，2020；Lewis等人，2020）。這些檢索增強(qiáng)模型利用檢索器從外部存儲(chǔ)中檢索相關(guān)文檔，并使用生成器根據(jù)檢索到的文檔生成預(yù)測(cè)。

多模態(tài)智能體領(lǐng)域：基于檢索增強(qiáng)模型的啟發(fā)，利用視覺(jué)和/或文本知識(shí)來(lái)提高視覺(jué)任務(wù)，通過(guò)檢索和應(yīng)用外部知識(shí)，提供核心模型所需的額外信息來(lái)改善任務(wù)性能，比如RAC/K-LITE/REACT//

Motivated by retrieval-augmented models in NLP, several recent works leverage visual and / or textual knowledge to improve vision tasks, such as image classification (Long et al., 2022), cap-tioning (Yang et al., 2023a), question answering (Wu et al., 2021; Marino et al., 2021; Yang et al., 2022d; Chen et al., 2022e), image generation (Blattmann et al., 2022; Sheynin et al., 2022; Chen et al., 2022f; Zhou et al., 2022c), and multi-modal tasks simultaneously (Yasunaga et al., 2022). RAC (Long et al., 2022) improves long-tail classification by retrieving from a non-parametric mem-ory consisting of pre-encoded images and text. K-LITE (Shen et al., 2022a) enhances the text prompts with the retrieved external knowledge that is encoded in natural language. REACT (Liu et al., 2023d) retrieve from billions of the paired knowledge of image-text and aims to improve task transfer performance for core vision problems. Among them, RA-CM3 (Yasunaga et al., 2022) builds the first retrieval-augmented LMM with a multimodal retriever to retrieve multimodal docu-ments, and a retrieval-augmented generator that can generate both text and image. Chaining tools with LLM shares a strong connection with the retrieval-augmented methods in that both leverage ex-ternal knowledge to provide additional information for the core model to utilize. In the multimodal regime, the image itself can be used as the query to gain external knowledge, either retrieved from a knowledge base, or extracted from another pre-trained vision expert models.

受NLP中的檢索增強(qiáng)模型的啟發(fā)，最近的一些工作利用視覺(jué)和/或文本知識(shí)來(lái)提高視覺(jué)任務(wù)的性能，例如圖像分類（Long等人，2022）、圖像描述（Yang等人，2023a）、問(wèn)答（Wu等人，2021；Marino等人，2021；Yang等人，2022d；Chen等人，2022e）、圖像生成（Blattmann等人，2022；Sheynin等人，2022；Chen等人，2022f；Zhou等人，2022c）以及同時(shí)進(jìn)行多模態(tài)任務(wù)（Yasunaga等人，2022）。

RAC（Long等人，2022）通過(guò)從預(yù)先編碼的圖像和文本組成的非參數(shù)存儲(chǔ)中檢索來(lái)提高長(zhǎng)尾分類性能。

K-LITE（Shen等人，2022a）利用檢索到的以自然語(yǔ)言編碼的外部知識(shí)增強(qiáng)文本提示。

REACT（Liu等人，2023d）從數(shù)十億的圖像-文本對(duì)知識(shí)中檢索，并旨在提高核心視覺(jué)問(wèn)題的任務(wù)遷移性能。

其中，RA-CM3（Yasunaga等人，2022）構(gòu)建了第一個(gè)檢索增強(qiáng)LMM，使用多模態(tài)檢索器檢索多模態(tài)文檔，并使用檢索增強(qiáng)生成器生成文本和圖像。將工具與LLM鏈接與檢索增強(qiáng)方法具有很強(qiáng)的聯(lián)系，因?yàn)閮烧?span style="color:#ff0000;">都利用外部知識(shí)為核心模型提供額外信息。在多模態(tài)模式下，圖像本身可以作為查詢來(lái)獲取外部知識(shí)，或者從知識(shí)庫(kù)中檢索，或者從另一個(gè)預(yù)訓(xùn)練的視覺(jué)專家模型中提取。

7、Conclusions and Research Trends結(jié)論和研究趨勢(shì)

多模態(tài)基礎(chǔ)模型在快速發(fā)展：共同的總體目標(biāo)是—?jiǎng)?chuàng)建通用型模型能夠遵循人類意圖并執(zhí)行各種域外視覺(jué)任務(wù)

Multimodal foundation models have garnered significant interest among scholars in the fields of computer vision and multimodal vision-language research. Although prevailing research topics, approaches and methodologies have been evolving – encompassing image self-supervised learning, language-image contrastive learning, text-to-image generation, unified vision modeling, and large language-and-vision assistants – they converge on a common overarching objective: the creation of general-purpose models and systems capable of following human intents and effortlessly executing a diverse array of vision and vision-language tasks in the wild. In this chapter, we provide a concise summary of what has been reviewed, and delve into the prevailing research tendencies in the field.

多模態(tài)基礎(chǔ)模型在計(jì)算機(jī)視覺(jué)和多模態(tài)視覺(jué)語(yǔ)言研究領(lǐng)域引起了學(xué)者們的極大興趣。盡管流行的研究主題、方法和方法學(xué)一直在不斷發(fā)展，包括圖像自監(jiān)督學(xué)習(xí)、語(yǔ)言-圖像對(duì)比學(xué)習(xí)、文本到圖像生成、統(tǒng)一視覺(jué)建模以及大規(guī)模語(yǔ)言和視覺(jué)助手，但它們都聚焦于一個(gè)共同的總體目標(biāo)：創(chuàng)建通用模型和系統(tǒng)，能夠遵循人類意圖并輕松執(zhí)行各種域外視覺(jué)和視覺(jué)語(yǔ)言任務(wù)。在本章中，我們對(duì)已經(jīng)回顧過(guò)的內(nèi)容進(jìn)行了簡(jiǎn)要總結(jié)，并深入探討了該領(lǐng)域的主要研究趨勢(shì)。

7.1、Summary and Conclusions總結(jié)和結(jié)論：多模態(tài)基礎(chǔ)模型研究的最新進(jìn)展的兩大類

特定用途的多模態(tài)基礎(chǔ)模型：關(guān)注問(wèn)題相關(guān)數(shù)據(jù)的預(yù)訓(xùn)練和零-少樣本遷移

This paper surveys the most recent advances at the frontier of multimodal foundation model research, categorized into two classes discussed below.

>>Specific-purpose multimodal foundation models. There is a diverse set of problems to tackle in the computer vision community. To lay a comprehensive foundation for the introduction of general-purpose visual assistants, we have discussed many seminar papers in the era of pre-training. The major paradigm during this period is pre-training on a large amount of problem-related data, and then transferring to a number of real-world scenarios of the same problem type in a zero- or few-shot fashion. More specifically, we have presented two general topics: (i) Vi-sual Understanding in Chapter 2: individual multimodal foundation models have developed to analyze the content of visual data in the image, region, pixel levels, prospectively. The language-augmented vision models are a popular family, contributing to the recent success of visual under-standing tasks in the wild. (ii) Visual Generation in Chapter 3: text-to-image generation models have served the foundation for image synthesis, which has been successfully extended to allow user controllability and customization at more fine-grained manners. The availability and cre-ation of large amount of problem-related data has played a key role in making these multimodal foundation models possible.

本文對(duì)多模態(tài)基礎(chǔ)模型研究前沿的最新進(jìn)展進(jìn)行了調(diào)查，分為以下兩類進(jìn)行討論。

>>?特定用途的多模態(tài)基礎(chǔ)模型。在計(jì)算機(jī)視覺(jué)社區(qū)中有各種各樣的問(wèn)題需要解決。為了為通用視覺(jué)助手的引入奠定一個(gè)全面的基礎(chǔ)，我們討論了在預(yù)訓(xùn)練時(shí)代的許多研討會(huì)論文。這個(gè)時(shí)代的主要范式是在大量與問(wèn)題相關(guān)的數(shù)據(jù)上進(jìn)行預(yù)訓(xùn)練，然后以零次或少次的方式將其轉(zhuǎn)移到相同問(wèn)題類型的多種實(shí)際場(chǎng)景中。

更具體地說(shuō)，我們提出了兩個(gè)主題：

(i) 第2章中的視覺(jué)理解：個(gè)體多模態(tài)基礎(chǔ)模型已經(jīng)發(fā)展起來(lái)，可以在圖像、區(qū)域、像素級(jí)別上分析視覺(jué)數(shù)據(jù)的內(nèi)容。語(yǔ)言增強(qiáng)的視覺(jué)模型是一個(gè)受歡迎的系列，為域外視覺(jué)理解任務(wù)的最近成功做出了貢獻(xiàn)。

(ii) 第3章中的視覺(jué)生成：文本到圖像生成模型為圖像合成提供了基礎(chǔ)，并已成功擴(kuò)展到允許用戶以更細(xì)粒度的方式進(jìn)行可控性和自定義。問(wèn)題相關(guān)數(shù)據(jù)的可用性和創(chuàng)建在使這些多模態(tài)基礎(chǔ)模型成為可能方面發(fā)揮了關(guān)鍵作用。

通用型助手：關(guān)注具有統(tǒng)一網(wǎng)絡(luò)架構(gòu)和接口的通用型助手模型研究，為視覺(jué)任務(wù)提供了類似于NLP中的通用助手的解決方案

>>General-purpose assistants. We have reviewed recently emerged literature on building general-purpose assistants, which often possess an unified network architecture, an unified input-output data format, and a general interface that facilitates easy interaction with humans. Inspired by the success in NLP that LLM such as ChatGPT/GPT-4 is a general assistant for a wide range of lan-guage tasks, researchers in computer vision have explored various solutions to their counterpart for vision tasks. Based on how LLM is leveraged in the methodology, existing works can be cate-gorized into three topics: (i) Unified Vision Models in Chapter 4: The spirit of unifying modeling in LLM is borrowed to build unified vision models at different levels and across different tasks.(ii) Training with LLM in Chapter 5: Starting with a pre-trained LLM, visual data is connected to LLM for end-to-end training. (iii) Chaining with LLM in Chapter 6: By freezing LLM, existing vision experts can be triggered by prompt engineering LLM to complete specific vision tasks.

>>?通用助手。我們回顧了最近出現(xiàn)的關(guān)于構(gòu)建通用助手的文獻(xiàn)，這些助手通常具有統(tǒng)一的網(wǎng)絡(luò)架構(gòu)、統(tǒng)一的輸入輸出數(shù)據(jù)格式以及便于與人類進(jìn)行輕松交互的通用接口。在NLP中，像ChatGPT/GPT-4這樣的LLM是廣泛語(yǔ)言任務(wù)的通用助手，受到其成功的啟發(fā)，計(jì)算機(jī)視覺(jué)研究人員已經(jīng)探索了各種解決方案來(lái)解決視覺(jué)任務(wù)。根據(jù)LLM在方法論中的運(yùn)用，現(xiàn)有的工作可以分為三個(gè)主題：

(i) 第4章中的統(tǒng)一視覺(jué)模型：借鑒LLM中的統(tǒng)一建模精神，構(gòu)建了不同層次和不同任務(wù)的統(tǒng)一視覺(jué)模型。

(ii) 第5章中的LLM訓(xùn)練：從預(yù)訓(xùn)練的LLM開(kāi)始，將視覺(jué)數(shù)據(jù)與LLM連接，進(jìn)行端到端的訓(xùn)練。

(iii) 第6章中的LLM鏈接：通過(guò)凍結(jié)LLM，可以通過(guò)提示工程LLM觸發(fā)現(xiàn)有的視覺(jué)專家，以完成特定的視覺(jué)任務(wù)。

The comparisons among these models are summarized in Table 7.1.

這些模型之間的比較總結(jié)在表7.1中。

7.2、Towards Building General-Purpose AI Agents邁向構(gòu)建通用AI代理

專門(mén)的多模態(tài)基礎(chǔ)模型→通用視覺(jué)助手：目前已出現(xiàn)強(qiáng)大的視覺(jué)助手(如Flamingo和 multimodal GPT-4相)，但相比未來(lái)的多模態(tài)AI智能體仍處于初級(jí)階段

At the end of each chapter, we have discussed future trends for individual topics. The paper itself is organized to demonstrate the transition from specialist multimodal foundation models to general-purpose visual assistants. Though powerful, existing visual assistants such as Flamingo (Alayrac et al., 2022) and multimodal GPT-4 (OpenAI, 2023b) are in the preliminary form, compared with grand vision on building a general-purpose multimodal AI agent via foundation models. In what follows, we highlight a number of research trends towards this goal.

在每章的結(jié)尾，我們討論了各個(gè)主題的未來(lái)趨勢(shì)。本文自身的組織方式旨在展示從專門(mén)的多模態(tài)基礎(chǔ)模型向通用視覺(jué)助手的過(guò)渡。盡管像Flamingo（Alayrac等人，2022）和多模態(tài)GPT-4（OpenAI，2023b）這樣的現(xiàn)有視覺(jué)助手已經(jīng)非常強(qiáng)大，但與通過(guò)基礎(chǔ)模型構(gòu)建通用多模態(tài)AI智能體的宏偉愿景相比，還處于初級(jí)階段。接下來(lái)，我們將重點(diǎn)介紹一些朝著實(shí)現(xiàn)這一目標(biāo)的研究趨勢(shì)。

Generalist agents with multi-modality多模態(tài)的通用代理：研究趨勢(shì)是構(gòu)建一個(gè)通用多模態(tài)智能體(融合多種通道與世界進(jìn)行交互),感知和合成視覺(jué)信號(hào)(如Gato/PaLM-E)，其中視覺(jué)感知是關(guān)鍵更是挑戰(zhàn)

Generalist agents with multi-modality. This aligns with the grand goal of building a single gen-eralist agent that interacts with world like humans through fusing multiple channels such as lan-guage, vision, speech and actions. From this perspective, the notion of multimodal foundation models becomes somewhat indistinct on its own. Instead, it serves as a crucial component of the agent for perceiving and synthesizing visual signals. For example, Gato (Reed et al., 2022) and PaLM-E (Driess et al., 2023) perform a wide range of language, multimodal and control tasks with a single set of model weights, where visual perception is a crucial component in understanding the environment. It also raises challenges in the effective and scalable pre-training objectives for unified vision and multimodal modeling.

這與構(gòu)建一個(gè)像人類一樣通過(guò)融合多個(gè)渠道（如語(yǔ)言、視覺(jué)、語(yǔ)音和行動(dòng)）與世界互動(dòng)的通用智能體的宏偉目標(biāo)一致。從這個(gè)角度看，多模態(tài)基礎(chǔ)模型的概念本身顯得有些模糊。相反，它作為智能體的重要組成部分，用于感知和合成視覺(jué)信號(hào)。例如，Gato（Reed等人，2022）和PaLM-E（Driess等人，2023）使用一組模型權(quán)重執(zhí)行各種語(yǔ)言、多模態(tài)和控制任務(wù)，其中視覺(jué)感知是理解環(huán)境的關(guān)鍵組成部分。這也為統(tǒng)一視覺(jué)和多模態(tài)建模的有效和可擴(kuò)展的預(yù)訓(xùn)練目標(biāo)提出了挑戰(zhàn)。

Alignment with human intents與人類意圖的對(duì)齊：視覺(jué)提示比語(yǔ)言表達(dá)更好，基于視覺(jué)提示的多模態(tài)人機(jī)交互是解鎖新場(chǎng)景的關(guān)鍵

Alignment with human intents. AI alignment research focuses on steering AI systems towards humans’ intended goals, values, or ethical guidelines. An AI system is deemed aligned when it effectively promotes the desired goals. Though language has exhibited its generality in expressing human intents, it is not always the best option. As demonstrated in SAM (Kirillov et al., 2023) and ControlNet/GLIGEN (Zhang and Agrawala, 2023; Li et al., 2023n), human intents can be more precisely and conveniently represented in visual prompts such as key points, bounding boxes and sketch drawing, for visual understanding and generation tasks, respectively. Building foundation models that are well equipped with such a multimodal human-machine interaction interface is a key step to unlock new use scenarios, where human intents are best represented visually. For example, the spatial arrangement of elements within a scene, as well as the artistic style and visual appeal of a piece of visual art.

AI對(duì)齊研究專注于將AI系統(tǒng)引導(dǎo)到人類預(yù)期的目標(biāo)、價(jià)值觀或道德準(zhǔn)則上。當(dāng)一個(gè)AI系統(tǒng)有效地促進(jìn)所期望的目標(biāo)時(shí)，AI系統(tǒng)被認(rèn)為是對(duì)齊的。盡管語(yǔ)言在表達(dá)人類意圖方面表現(xiàn)出了其通用性，但它并不總是最佳選擇。如SAM（Kirillov等人，2023）和ControlNet/GLIGEN（Zhang和Agrawala，2023；Li等人，2023n）所示，人類意圖可以更精確、更方便地以視覺(jué)提示的形式表示，如關(guān)鍵點(diǎn)、邊界框和草圖繪制，用于視覺(jué)理解和生成任務(wù)。

構(gòu)建具備這種多模態(tài)人機(jī)交互接口的基礎(chǔ)模型，是解鎖新的使用場(chǎng)景的關(guān)鍵步驟，其中人類意圖最好以視覺(jué)方式表示，例如場(chǎng)景中元素的空間排列，以及視覺(jué)藝術(shù)品的藝術(shù)風(fēng)格和視覺(jué)吸引力。

AI智能體系統(tǒng)框架四大組成=基于LLM驅(qū)動(dòng)的智能體大腦+三大組件(視覺(jué)模態(tài)的作用，規(guī)劃【改進(jìn)視覺(jué)理解】、記憶【利用上下文學(xué)習(xí)和交錯(cuò)多模態(tài)提示實(shí)現(xiàn)短期記憶+通過(guò)多模態(tài)向量空間快速檢索實(shí)現(xiàn)長(zhǎng)期記憶】和工具使用【智能體學(xué)習(xí)利用外部API來(lái)獲取基礎(chǔ)模型權(quán)重中缺少的知識(shí)】)

Planning, memory, and tool use. It is highlighted in Weng (2023) that a LLM-powered au-tonomous agent system can be built, where LLM functions as the agent’s brain, complemented by several key components: planning, memory and tool use. Following the framework, we could foresee the role of multimodal foundation models in this AI agent system. (i) Planning. To com-plete complex tasks in real-world scenarios, the agent should be able to decompose large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks. In the ideal case, the?AI agent possesses the self-improvement ability, engaging in self-assessment and introspection re-garding previous actions, enabling it to learn from errors and enhance its approach for subsequent endeavors, ultimately leading to better outcomes. Visual modality is a common channel to represent state of the environment. To facilitate planning, it raises challenges in improving the capability of the current visual understanding models in perceiving more fine-grained visual details and longer sequence videos. (ii) Memory. For short-term memory, in-context learning (or prompt engineering) is utilized as short-term memory for the model to learn. Interleaved multimodal prompts can enable new scenarios to clarify the human intents. For long-term memory, it provides the agent with the capability to recall external knowledge over extended sessions, which can be implemented by fast retrieving from a multi-modal vector space (Liu et al., 2023d). In term of modeling, foundation models are required to learn the new skills to effectively leverage both types of memory. (iii) Tool use. The agent learns to utilize external APIs for knowledge that is missing from the foundation model weights. New capabilities are required to deal with the vision modality in several scenar-ios. For example, based on both the input visual signal and instructions, the model decides and plans whether certain external APIs are needed to complete the goal, such as code execution of detection/segmentation/OCR/generator experts.

在Weng（2023）中強(qiáng)調(diào)了可可以構(gòu)建一個(gè)由LLM驅(qū)動(dòng)的自主智能體系統(tǒng)，其中LLM作為智能體的大腦，輔以幾個(gè)關(guān)鍵組成部分：規(guī)劃、記憶和工具使用。在這個(gè)框架下，我們可以預(yù)見(jiàn)多模態(tài)基礎(chǔ)模型在這一AI智能體系統(tǒng)中的作用。

(i) 規(guī)劃。為了在現(xiàn)實(shí)場(chǎng)景中完成復(fù)雜的任務(wù)，智能體應(yīng)該能夠?qū)⒋笮?span style="color:#ff0000;">任務(wù)分解成較小、可管理的子目標(biāo)，從而實(shí)現(xiàn)對(duì)復(fù)雜任務(wù)的高效處理。在理想情況下，AI智能體應(yīng)具備自我改進(jìn)的能力，進(jìn)行自我評(píng)估和對(duì)先前行動(dòng)的反思，使其能夠從錯(cuò)誤中學(xué)習(xí)，并增強(qiáng)其在后續(xù)任務(wù)中的方法，最終實(shí)現(xiàn)更好的結(jié)果。視覺(jué)模態(tài)是表示環(huán)境狀態(tài)的常見(jiàn)通道。為了便于規(guī)劃，這對(duì)當(dāng)前視覺(jué)理解模型在感知更細(xì)粒度的視覺(jué)細(xì)節(jié)和更長(zhǎng)的序列視頻方面的能力提出了挑戰(zhàn)。

(ii) 記憶。對(duì)于短期記憶，可以利用上下文學(xué)習(xí)（或提示工程）作為模型的短期記憶，以便學(xué)習(xí)。交錯(cuò)的多模態(tài)提示可以啟用新的場(chǎng)景來(lái)澄清人類的意圖。對(duì)于長(zhǎng)期記憶，它為智能體提供了在長(zhǎng)時(shí)間會(huì)話中回憶外部知識(shí)的能力，這可以通過(guò)從多模態(tài)向量空間中快速檢索來(lái)實(shí)現(xiàn)(Liu et al.， 2023)。在建模方面，基礎(chǔ)模型需要學(xué)習(xí)新的技能來(lái)有效地利用這兩種類型的記憶。

(iii) 工具使用。智能體學(xué)習(xí)利用外部API來(lái)獲取基礎(chǔ)模型權(quán)重中缺少的知識(shí)。在一些場(chǎng)景中，需要新的功能來(lái)處理視覺(jué)模式。例如，基于輸入的視覺(jué)信號(hào)和指令，模型決定和計(jì)劃是否需要某些外部API來(lái)完成目標(biāo)，例如檢測(cè)/分割/OCR/生成器專家的代碼執(zhí)行。

The field of multimodal foundation models is evolving at a rapid speed, with new directions/methods emerging frequently. There are many important research topics that are not discussed in this paper, mostly due to the daily-updated research innovation. We are optimistic about the future of mul-timodal foundation models, not only because we are convinced that foreseeable exciting research innovations/ideas in individual areas are becoming reality by following the path of LLM in the near future, but also because connecting computer vision with the broader AI community, and building general-purpose AI agents is going to significantly advance the daily life of human being.

多模態(tài)基礎(chǔ)模型領(lǐng)域正在快速發(fā)展，新的研究方向和方法不斷涌現(xiàn)。由于研究創(chuàng)新每天都在更新，因此本文未討論的許多重要研究主題。我們對(duì)多模態(tài)基礎(chǔ)模型的未來(lái)充滿信心，不僅因?yàn)槲覀兿嘈旁诓痪玫膶?lái)，通過(guò)追隨LLM的道路，各個(gè)領(lǐng)域可預(yù)見(jiàn)的令人興奮的研究創(chuàng)新/想法正在成為現(xiàn)實(shí)，而且還因?yàn)閷⒂?jì)算機(jī)視覺(jué)與更廣泛的AI社區(qū)聯(lián)系起來(lái)，構(gòu)建通用的人工智能智能體將大大改善人類的日常生活。

Acknowledgments

This book is largely based on our CVPR 2023 tutorial on vision foundation models. Many people have supported us and provided valuable feedback to the writing of this book. We thank all the authors who have contributed to the related papers, which makes the tutorial and book possible. We are also grateful to Mark de Jongh, the editor from the journal of Foundations and Trends? in Computer Graphics and Vision, for inspiring and encouraging us to write the book on multimodal foundation models.

本書(shū)主要基于我們?cè)贑VPR 2023上關(guān)于視覺(jué)基礎(chǔ)模型的教程。許多人在書(shū)寫(xiě)本書(shū)過(guò)程中為我們提供了支持和寶貴的反饋意見(jiàn)。我們感謝所有為相關(guān)論文做出貢獻(xiàn)的作者，這使得教程和書(shū)籍得以實(shí)現(xiàn)。我們還感謝《計(jì)算機(jī)圖形與視覺(jué)基礎(chǔ)與趨勢(shì)》期刊的編輯Mark de Jongh，他啟發(fā)并鼓勵(lì)我們撰寫(xiě)關(guān)于多模態(tài)基礎(chǔ)模型的書(shū)籍。

本站僅提供存儲(chǔ)服務(wù)，所有內(nèi)容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊舉報(bào)。

九色国产,午夜在线视频,新黄色网址,九九色综合,天天做夜夜做久久做狠狠,天天躁夜夜躁狠狠躁2021a,久久不卡一区二区三区

相關(guān)文章

AGI之MFM：《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎(chǔ)模型：從專家到通用助手》翻譯與解讀之簡(jiǎn)介

AGI之MFM：《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎(chǔ)模型：從專家到通用助手》翻譯與解讀之視覺(jué)理解、視覺(jué)生成

AGI之MFM：《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎(chǔ)模型：從專家到通用助手》翻譯與解讀之統(tǒng)一的視覺(jué)模型、加持LLMs的大型多模態(tài)模型

AGI之MFM：《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模態(tài)基礎(chǔ)模型：從專家到通用助手》翻譯與解讀之與LLM協(xié)同工作的多模態(tài)智能體、結(jié)論和研究趨勢(shì)

6、Multimodal Agents:Chaining Tools with LLM—?與LLM協(xié)同工作的多模態(tài)智能體

提出新的建模范式：將多個(gè)工具或?qū)＜遗cLLMs協(xié)同鏈接以解決復(fù)雜的開(kāi)放問(wèn)題，不需要訓(xùn)練，只需要示例教導(dǎo)

6.1、Overview概述