COCA/GIT/BLIP/CLIP Caption tool as a Colab notebook and Python script
Similarity matching for tags will now eliminate dupliciate sounding tags resulting from CLIP interrogation. This affects interrogation when using the interrogate_classic and interrogate_fast methods. In addition, the uniquify_tags feature will also eliminate any similar tags. This can be useful when running CLIP flavors on existing caption files that include tags.
- Notebook v0.3.0 - Added similarity matching UI support
- v0.3.0 - Added similarity matching when uniquifying tags
- Notebook v0.2.4 - Added caption editor widget
- v0.2.1 - Added experimental BLIP2 questions support
- Notebook v0.2.3 - Added experimental BLIP2 questions support
- Notebook v0.2.2 - Added BLIP2 support
- v0.2.0 - Added BLIP2 support
- Notebook v0.2.1 - Added GDrive cell
- Notebook v0.2.0 - Refactored to support new module
- v0.1.0 - Initial release of local script
Feature requests? Support? Find me on the FinetunerAI Labs Discord
Warning: Using BLIP2 will take up a large amount of disk space. Users have reported as much as 45GB. In addition, the VRAM requirement is minimum 24GB.
Trying to follow the below instructions with versions newer than 3.8 will cause errors.
Clone the repo
git clone
Install requirements
pip install -r requirements.txt
Sample arguments that will execute a GIT pass, followed by Coca if the fail phrases are triggered, and append CLIP flavors to multiple datasets designed for a SD 2.x model:
python e:\data\set1 e:\data\set2 --existing=skip --cap_length=300 --git_pass --coca_pass --model_order='git,coca' --clip_model_name=ViT-H-14/laion2b_s32b_b79k --clip_flavor --clip_max_flavors=32 --clip_method=interrogate_fast --fail_phrases="a sign that says,writing that says,that says,with the word" --uniquify_tags --prepend_text="a photo of " --device=cuda --extension=txt
python --help
Caption a set of images
positional arguments:
folder One or more folders to scan for iamges. Images should be jpg/png.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
--output OUTPUT Output to a folder rather than side by side with image files
--existing {skip,ignore,copy,prepend,append}
Action to take for existing caption files (default: skip)
--cap_length CAP_LENGTH
Maximum length of caption. (default: 0)
--git_pass Perform a GIT model pass
--coca_pass Perform a Coca model pass
--blip_pass Perform a BLIP model pass
--model_order MODEL_ORDER
Perform captioning/fallback using this order (default: coca,git,blip)
--use_blip2 Uses BLIP2 for BLIP pass. Only activated when --blip_pass also specified
--blip2_model {blip2_t5/pretrain_flant5xxl,blip2_opt/pretrain_opt2.7b,blip2_opt/pretrain_opt6.7b,blip2_opt/caption_coco_opt2.7b,blip2_opt/caption_coco_opt6.7b,blip2_t5/pretrain_flant5xl,blip2_t5/caption_coco_flant5xl}
Specify the BLIP2 model to use
--blip2_question_file BLIP2_QUESTION_FILE
Specify a question file to use to query BLIP2 and add answers as tags
--blip_beams BLIP_BEAMS
Number of BLIP beams (default: 64)
--blip_min BLIP_MIN BLIP min length (default: 30)
--blip_max BLIP_MAX BLIP max length (default: 75)
--clip_model_name {ViT-H-14/laion2b_s32b_b79k,ViT-L-14/openai,ViT-bigG-14/laion2b_s39b_b160k}
CLIP model to use. Use ViT-H for SD 2.x, ViT-L for SD 1.5 (default: ViT-H-14/laion2b_s32b_b79k)
--clip_flavor Add CLIP Flavors
--clip_max_flavors CLIP_MAX_FLAVORS
Max CLIP Flavors (default: 8)
--clip_artist Add CLIP Artists
--clip_medium Add CLIP Mediums
--clip_movement Add CLIP Movements
--clip_trending Add CLIP Trendings
--clip_method {interrogate,interrogate_fast,interrogate_classic}
CLIP method to use
--fail_phrases FAIL_PHRASES
Phrases that will fail a caption pass and move to the fallback model. (default: "a sign that says,writing that says,that says,with the word")
--ignore_tags IGNORE_TAGS
Comma separated list of tags to ignore
--find FIND Perform find and replace with --replace REPLACE
--replace REPLACE Perform find and replace with --find FIND
--folder_tag Tag the image with folder name
--folder_tag_levels FOLDER_TAG_LEVELS
Number of folder levels to tag. (default: 1)
--folder_tag_stop FOLDER_TAG_STOP
Do not tag folders any deeper than this path. Overrides --folder_tag_levels if --folder_tag_stop is shallower
--uniquify_tags Ensure tags are unique
--fuzz_ratio FUZZ_RATIO
Sets the similarity ratio allowed for tags when uniquifying (default: 60.0)
--prepend_text PREPEND_TEXT
Prepend text to final caption
--append_text APPEND_TEXT
Append text to final caption
--preview Do not write to caption file. Just displays preview in STDOUT
--use_filename Read the existing caption from the filename, stripping all special characters/numbers
--device {cuda,cpu} Device to use. (default: cuda)
--extension {txt,caption}
Caption file extension. (default: txt)
- @cacoe for the inception of the idea. Be sure to check out his new IlluminatiAI model v1.1. It slaps.
- @Kaz, @jvkas, and @PeePa for help with testing
- @Stille Willem and @NimbusFPV for the blunts