We recommend setting up a conda environment for the project:
conda create --name=pg_video_llava python=3.10
conda activate pg_video_llava
git clone https://mbzuai-oryx.github.io/Video-LLaVA/
cd Video-LLaVA
pip install -r requirements.txt
export PYTHONPATH="./:$PYTHONPATH"
Additionally, install FlashAttention which will be required for training,
pip install ninja
git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
git checkout v1.0.7
python setup.py install
-
Download LLaVA-v1.5 weights from HuggigFace
-
Download projector weights:
- Projector for LLaVA-v1.5-7B (download)
- Projector for LLaVA-v1.5-13B (download)
The reset of the documentation assumes these weight files are stored in the following structure.
Video-LLaVA
└── weights
└── llava
| ├── llava-v1.5-7b
| └── llava-v1.5-13b
└── projection
├── mm_projector_7b_1.5_336px.bin
└── mm_projector_13b_1.5_336px.bin
-
Setup DEVA as mentioned here
-
Setup Grounded-Segment-Anything as mentioned here
-
Save or symlink all the tracker weights at Video-LLaVA/grounding_evaluation/weights. (The weight files can be donwloaded from here)
Video-LLaVA └── grounding_evaluation └── weights ├── DEVA-propagation.pth └── groundingdino_swint_ogc.pth └── GroundingDINO_SwinT_OGC.py └── mobile_sam.pt └── ram_swin_large_14m.pth └── sam_vit_h_4b8939.pth
E.g.: Run CLI Demo without grounding.
export PYTHONPATH="./:$PYTHONPATH"
python video_chatgpt/chat.py \
--model-name <path_to_LLaVA-7B-1.5_weights> \
--projection_path <path_to_projector_wights_for_LLaVA-7B-1.5> \
--use_asr \
--conv_mode pg-video-llava
E.g.: Run CLI Demo with grounding.
export PYTHONPATH="./:$PYTHONPATH"
export OPENAI_API_KEY=<OpenAI API Key>
python video_chatgpt/chat.py \
--model-name <path_to_LLaVA-7B-1.5_weights> \
--projection_path <path_to_projector_wights_for_LLaVA-7B-1.5> \
--use_asr \
--conv_mode pg-video-llava \
--with_grounding