An all-bugs-fixed (I hope so), cleaner and improved implementation of LLaVA-UHD, based on the code from the official repo, LLaVA-UHD.
The study of LLaVA-UHD is a great work for the community of LMMs.
However, the official-released repo is full of bugs, wrong logic, uncleared/uncommented debugging codes, and missing but necessary procedures. For some examples, it wrongly deal with the patch numbers of the slices and the overall image in a batch, wrongly calculate pos_embed
and attention_mask
in resampler
, misses the separate tokens between sub-images, and misses the procedures of normalizing the image according CLIP's preprocessing, etc. And many people have raised problems related to this issue. It seems the official team will not fix these issues in the near future.
Thus, this repo aims to address these issues and improve the overall quality of the implementation, following the spirit of the paper LLaVA-UHD. I also want to open-source this modified repo to welcome others from the community, to use it, review the code to see if there are still undiscovered mistakes, and feedback.
- cleaned up the file structure and the code
- fixed the bugs
- supplemented the missing procedures
- improved the implementation of some important functions
- a little modification to the architecture definition of the
Resampler
You can find details about the major modifications by searching the comments like [Edited by zhenwei - 2024-xx-xx xx:xx]
in the code. I have left detailed comments there to explain what I have changed and why (mostly in Chinese though 😊).
The preparation of the environment and datasets follows LLaVA. Run the bash script to start the training:
bash scripts/train.sh
Note: dependency packages of higher versions may lead problems. For example, you may meet token mismatch
warning at stage-2. Commenting out the line use_fast=False
of the tokenizer may help.
It is ongoing... I will release checkpoints as soon as I reproduce the results of a satisfactory performance.
- add codes for model evaluation on common benchmarks
- release reproducible checkpoint
- add logic for simultaneously fine-tuning the vision encoder when stage-2
This repo is based on the official repo LLaVA-UHD. I would like to express my profound gratitude and respect for their valuable work.
If you find this repo useful for your research and applications, please cite:
@misc{llava-uhd-better,
author = {Shao Zhenwei},
title = {LLaVA-UHD-Better},
year = {2024},
publisher = {GitHub},
url = {https://github.com/ParadoxZW/LLaVA-UHD-Better},
note = {GitHub repository},
}
@article{xu2024llava-uhd,
title={{LLaVA-UHD}: an LMM Perceiving Any Aspect Ratio and High-Resolution Images},
author={Xu, Ruyi and Yao, Yuan and Guo, Zonghao and Cui, Junbo and Ni, Zanlin and Ge, Chunjiang and Chua, Tat-Seng and Liu, Zhiyuan and Huang, Gao},
journal={arXiv preprint arXiv:2403.11703},
year={2024}
}