first commit

BAAI-Agents · Dec 7, 2023 · 721475c · 721475c
1 parent eb2704e
commit 721475c
Show file tree

Hide file tree

Showing 10 changed files with 61 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -1 +1,61 @@
-# SteveEye
+# STEVE-EYE: Equiping LLM-based Embodied Agents with Visual Perception in Open Worlds 
+<div align="center">
+
+[[Website]](https://sites.google.com/view/steve-eye) 
+[[Arxiv Paper]](https://arxiv.org/abs/2310.13255)
+
+[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/MineDojo)](https://pypi.org/project/MineDojo/)
+[<img src="https://img.shields.io/badge/Framework-PyTorch-red.svg"/>](https://pytorch.org/)
+[![GitHub license](https://img.shields.io/github/license/MineDojo/MineCLIP)](https://github.com/BAAI-Agents/SteveEye/blob/main/LICENSE)
+
+
+<div align="left">
+
+Steve-Eye is an end-to-end trained large multimodal model to address this limitation, which integrates the LLM with a visual encoder to process visual-text inputs and generate multimodal feedback. 
+We adopt a semi-automatic strategy to collect an extensive dataset comprising 850K open-world instruction pairs, enabling our model to encompass three essential functions for an agent: multimodal perception, foundational knowledge base, and skill prediction and planning.
+Our contribution can be summarized as:
+
+* **Open-World Instruction Dataset:** We construct instruction data for the acquisition of three mentioned functions, which contains not only the agent’s per-step status and environmental features but also the essential knowledge for agents to act and plan.
+
+* **Large Multimodal Model and Training:** Steve-Eye combines a visual encoder which converts visual inputs into a sequence of embeddings, along with a pre-trained LLM which empowers embodied agents to engage in skill or task reasoning in an open world.
+
+* **Open-World Benchmarks:** We develop the following benchmarks to evaluate agent performance from a broad range of perspectives: (1) environmental visual captioning (ENV-VC); (2) foundational knowledge question answering (FK-QA); (3) skill prediction and planning (SPP).
+
+
+![](figs/plan4mc.png)
+
+## Environmental Visual Captioning (ENV-VC) Results
+| Model          | Visual Encoder | Inventory <img src="figs/icons/inventory.png" height="12pt"> | Equip <img src="figs/icons/iron-axe.png" height="12pt"> | Object in Sight <img src="figs/icons/cow.png" height="12pt"> | Life <img src="figs/icons/heart.jpg" height="12pt"> | Food <img src="figs/icons/hunger.png" height="12pt"> | Sky <img src="figs/icons/sky.png" height="12pt"> |
+|----------------|----------------|-----------|-------|-----------------|------|------|-----|
+| BLIP-2         | CLIP           | 41.6      | 58.5  | 64.7            | 88.5 | 87.9 | 57.6|
+| Llama-2-7b     | -              | -         | -     | -               | -    | -    | -   |
+| Steve-Eye-7b   | VQ-GAN         | 89.9      | 78.3  | 87.4            | 92.1 | 90.2 | 68.5|
+| Steve-Eye-13b  | MineCLIP       | 44.5      | 61.8  | 72.2            | 89.2 | 88.6 | 68.2|
+| Steve-Eye-13b  | VQ-GAN         | 91.1      | 79.6  | 89.8            | 92.7 | 90.8 | 72.7|
+| Steve-Eye-13b  | CLIP           | **92.5**  | **82.8** | **92.1**     | **93.1** | **91.5** | **73.8** |
+
+## (FK-QA) Results
+|               | Scoring    |         |        |          | Accuracy |       |
+|---------------|------------|---------|--------|----------|----------|-------|
+|               | Wiki Page  | Wiki Table | Recipe | TEXT All | TEXT    | IMG   |
+|---------------|------------|---------|--------|----------|----------|-------|
+| Llama-2-7b    | 6.90       | 6.21    | 7.10   | 6.62     | 37.01%   | -     |
+| Llama-2-13b   | 6.31 (-0.59)| 6.16 (-0.05)| 6.31 (-0.79)| 6.24 (-0.38)| 37.96%| - |
+| Llama-2-70b   | 6.91 (+0.01)| 6.97 (+0.76)| 7.23 (+0.13)| 7.04 (+0.42)| 38.27%| - |
+| gpt-turbo-3.5 | 7.26 (+0.36)| 7.15 (+0.94)| 7.97 (+0.87)| 7.42 (+0.80)| 41.78%| - |
+| Steve-Eye-7b  | 7.21 (+0.31)| 7.28 (+1.07)| 7.82 (+0.72)| 7.54 (+0.92)| 43.25%| 62.83%|
+| Steve-Eye-13b | 7.38 (+0.48)| 7.44 (+1.23)| 7.93 (+0.83)| 7.68 (+1.06)| 44.36%| 65.13%|
+
+
+## Skill Planning Results
+
+
+## Citation
+```
+@article{zheng2023steve,
+ 	 title={Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds},
+ 	 author={Zheng, Sipeng and Liu, Jiazheng and Feng, Yicheng and Lu, Zongqing},
+  	journal={arXiv preprint arXiv:2310.13255},
+  	year={2023}
+}
+```
diff --git a/figs/icons/cow.png b/figs/icons/cow.png
diff --git a/figs/icons/eye.png b/figs/icons/eye.png
diff --git a/figs/icons/heart.jpg b/figs/icons/heart.jpg
diff --git a/figs/icons/hunger.png b/figs/icons/hunger.png
diff --git a/figs/icons/inventory.png b/figs/icons/inventory.png
diff --git a/figs/icons/iron-axe.png b/figs/icons/iron-axe.png
diff --git a/figs/icons/sky.png b/figs/icons/sky.png
diff --git a/steve-eye-2211.pdf b/steve-eye-2211.pdf
diff --git a/tmp b/tmp