Skip to content

An improved version of InstructBLIP that uses SCST to reduce visual reasoning errors (oversights, hallucinations, ...)

Notifications You must be signed in to change notification settings

zhu-xlab/InstructBLIP_SCST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

38 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Logo

LAVIS's InstructBLIP model finetuned to remote sensing image-text data via Reinforcement Learning. The aim is to teach Visual Reasoning to a VLM on Remote Sensing imagery: Visual Reasoning data is quite scarce in the domain of remote sensing, the goal of this RL finetuning is to better exploit the existing data and to "enforce" Visual Reasoning in RS VLMs.

license License Static Badge

nVIDIA Visual Studio Python PyTorch

Table of Contents

❓ About

Forked from SalesForce's LAVIS repository, this improved version implements Reinforcement Learning to bolster image captioning abilities for the specific domain of remote sensing. On top of optimization through Cross-Entropy loss minimization, a few supplementary Reinforcement Learning epochs are completed to guide the model towards more desirable outputs, using learning signals tailored to the domain of Remote Sensing. More precisely, Self-Critical Sequence Training (https://arxiv.org/abs/1612.00563), a variant of the REINFORCE algorithm which is similar to PPO, is used to enforce these learning signals.

Additional info
Note that SCST can be made compatible with PPO/GRPO, with the issue that there are no intermediate rewards during the generation of a caption (the full generated caption is required to compute the learning signals).

πŸ›  Built With

  • πŸ— SalesForce's LAVIS - Core vision-language model, easily adaptable to RL
  • πŸ“Š FACTUAL Scene Graph Extractor - One of the most impactful reward function is obtained by measuring the closeness of generated captions and ground-truth (human-annotated) captions. FACTUAL extracts "scene graphs", like the SPICE metric, to compute such a reward by comparing the graphs. It also highlights the missing objects and the hallucinations made by the model.

πŸ“ˆ Diagram of the model

Figure BLIP_SCST

πŸ“Š Qualitative results

Examples extracted from RSICD's test split

Example captioning

Caption generated by our best model

  • three tennis courts are next to a road and some green trees

Human-created captions

  • Several orange and green tennis courts sit side by side beside the road.
  • Three tennis courts are surrounded by several buildings and green trees.
  • Three tennis courts are surrounded by several buildings and trees.
  • Three tennis courts semi-surrounded by some trees and buildings is next to a road.
  • Three tennis courts are surrounded by several buildings and trees.

Our model fails to mention the buildings.

Example captioning

Caption generated by our best model

  • six white storage tanks are near a road and some green meadows

Human-created captions

  • There are seven white columnar tanks near a road with two tracks.
  • Seven same cylinder storage tanks are placed on the grass between a forest and a road.
  • Seven same cylinder storage tanks are placed on the grass between a forest and a road.
  • Seven storage tanks stand alongside the straight road where two trucks are running.
  • Seven white storage tanks in two lines are near some green trees.

Our model doesn't count storage tanks correctly, which seems to be happening because two storage tanks are not fully within the picture. It also fails to mention the two trucks, and mistakes a forest for meadows.

πŸ“ˆ Quantitative results

πŸ“– Standard captioning metrics

Our model is first evaluated on standard captioning metrics, including:

SPICE is the most correlated with human judgement.

Experiments were conducted on RSICD, UCM-Captions, and on VRSBench.

When evaluated on RSICD using these metrics, our method demonstrates SOTA performances. CE = Cross-Entropy loss training.

RSICD_standard_metrics

πŸ“ˆ Custom metrics (oversights, hallucinations)

Reward functions used to optimize these metrics directly (A.K.A. learning signals)

NKL: Negative Kullback-Leibler Divergence. Using a small language model (a pretrained BERT model from spaCy), we compute embeddings for every tokens of the ground-truth captions and of the generated captions. This yields two distributions of embeddings, that we try to bring closer by minimizing their KL-Divergence.

CIDEr: a classic captioning metric that relies on TF-IDF vectors ressemblance between two sentences to compare.

length: opposite of the number of tokens in the generated caption (since the policy loss is being minimized, we must minimize its opposite to maximize the length of generated captions).

SDE: Scene Description Exhaustiveness, proportion of entities in the generated caption present in the ground-truth caption(s), and serves the purpose of getting ground-truth captions entities into generated captions, to align with the expert human annotators. Entities are lemmatized before this score is computed, to avoid false negatives.

SDE computation example:

β€’ Generated caption: There is a forest. (object: forest)
β€’ Ground-truth caption 1: There is a forest and a river. (objects: forest, river)
β€’ Ground-truth caption 2: There is a forest, a river and a road. (objects: forest, river, road)

Objects detected in the human-annotated (ground-truth) captions: forest, river, road (3 objects)
Object detected in the model's output caption: forest (1 object)

Therefore, the SDE score is 1/3 in this example.

RSICD Dataset

RSICD oversights/hallucinations

The up and down arrows next to the scores indicate the direction towards which the score should go to improve the score. For instance, -2,62% in the "oversights" column on the first line is in accordance with the arrow's direction, meaning that it is going in the right direction. However, the code fails at addressing hallucinations: this is probably caused by the relatively short length of the captions. VRSBench has longer, more expressive and contains a more elaborate vocabulary than the other datasets, making it the perfect candidate for hallucinations reduction testing.

UCM Dataset

UCM" width="600" height="150" class=

Our method seems even more efficient on UCM dataset. This might be caused by the fact that this dataset is quite small, and contains a lot of duplicate captions.

VRSBench

VRSBench

VRSBench captions are particularly long, which allows us to demonstrate the effectiveness of our approach in decreasing hallucinations without affecting its overall ability of decreasing oversights.

βž• Addendum to the policy loss of SCST

Another loss term, termed V/E for Varentropy/Entropy, is jointly minimized with the policy loss. Inspired by Entropix, the point is to balance between diverse vocabulary usage (high entropy) and consistent token distributions (low varentropy). This significantly limits degenerate generated tokens distributions, and encourages vocabulary exploration at the same time, which increases the model's vocabulary by taking inspiration from the human-annotated captions. The $\lambda \ = \ 10^{-4}$ constant is the weight we multiply the V/E term with to control its magnitude.

A simple ablation study where two models are trained in the same conditions, but one with CIDEr only and the other one with CIDEr and V/E, demonstrates a slight improvement in terms of oversights decreasing, BLEU, METEOR, and CIDEr.

πŸš€ Getting Started

Prerequisites

  • Clone the present repository (installing the original LAVIS repository will require multiple precise modifications in the code that have already been done in this very repository).

RS-LAVIS with RL

  • After installing this repository, you need to create an environment, activate it, and install the libraries from requirements.txt. PYTHON 3.9+ REQUIRED

conda

conda create --name lavis_rl python=3.9
conda activate lavis_rl

pip

pip install -r requirements.txt

FACTUAL Scene Graph Extraction

Crucial for the "Object Proportion" (SDE) learning signal to work.

pip install FactualSceneGraph

OR choose a pretrained model from huggingface: https://github.com/zhuang-li/FactualSceneGraph

πŸŽ›οΈ Training configuration

The training configuration for captioning can be found here: lavis/projects/blip2/train/caption_rs_ft.yaml

BLIP2 models

Alternative frozen vision encoders can be used with BLIP2. They can be found in lavis/configs/models/blip2.

Datasets configurations

The .yaml file for dataset configuration may be found here: lavis/configs/datasets/rs/defaults_cap.yaml. The image folder must contain every image from the dataset, regardless of the split they belong to. The JSON files containing the captions for the train, val and test splits must be in COCO format.

Object detector based "pseudo-captioning" can be activated by editing lines 48 and 72 from lavis/datasets/datasets/rs.py. This can slightly improve performances.

In case you need to modify the dataset config, edit this code: lavis/datasets/builders/rs_caption.py.

Finally, set the paths to your val and test json files in lavis/tasks/captioning.py, lines 138-139

βŒ› Start training

Once everything is correctly installed and configured, run the following command:

python train.py --cfg-path your_main_folder/LAVIS/lavis/projects/blip2/train/caption_rs_ft.yaml --model_name eva_clip_g_plus

πŸ† Best model

Weights for the best InstructBLIP model we have obtained. https://huggingface.co/tdujardin/InstructBLIP_RS_RL/tree/main

βš™οΈ Learning signals registry

The "rewards.py" registry of learning signals may be found in InstructBLIP_SCST/lavis/tasks/rewards.py

🧾 License

This repository contains code derived from SalesForce's LAVIS which is licensed under the BSD 3-Clause License, and code from FACTUAL which is licensed under the MIT License.

  • New contributions to this repository are licensed under the MIT License.
  • Portions derived from LAVIS remain under the BSD 3-Clause License.
  • The sentence object extractor, FACTUAL, is licensed under the MIT License

πŸ™ Acknowledgements

We extend our gratitude to SalesForce for developing the LAVIS repository, which provides an intuitive Vision-Language models library. Implementing Reinforcement Learning was made significantly easier by their work.

Additionally, one of our main learning signals for RL was based on FACTUAL, a finetuned FLAN-T5 model that extracts scene graphs.

About

An improved version of InstructBLIP that uses SCST to reduce visual reasoning errors (oversights, hallucinations, ...)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published