Skip to content

simple interactive web-based demo for socratic models; image captioning, auto tagging, keyword generation

Notifications You must be signed in to change notification settings

geonm/socratic-models-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple interactive web-based demo for Socratic models

[paper] [official repository] [official project website] Hugging Face Spaces

Abstract Large pretrained (e.g., “foundation”) models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning.

teacher

NOTE:

  • This demo produces captions and keywords for image search, highly related to an input image.
  • This is an un-official repository for simple interactive web-based demo for socratic models.
    • I would like to share an easy-to-use demo for socratic models.
    • The authors provided demo codes but it is built in python-notebook and zeroshot classifiers should be computed in order to test it. It takes much time.
  • This repo contains precomputed zero-shot classifiers using a text encoder of CLIP ViT-L/14 model.
    • object classifier using class names from tencent-ML-images
    • place classifier using class names from Place365
    • additional object classifier using class names from openimage

prompt for image captioning

    prompt_caption = f'''I am an intelligent image captioning bot.
    This image is a {img_type}. There {ppl_result}.
    I think this photo was taken at a {sorted_places[0]}, {sorted_places[1]}, or {sorted_places[2]}.
    I think there might be a {object_list} in this {img_type}.
    A creative short caption I can generate to describe this image is:'''

prompt for keyword generation

    prompt_search = f'''Let's list keywords that include the following description.
    This image is a {img_type}. There {ppl_result}.
    I think this photo was taken at a {sorted_places[0]}, {sorted_places[1]}, or {sorted_places[2]}.
    I think there might be a {object_list} in this {img_type}.
    Relevant keywords which we can list and are seperated with comma are:'''

Usage

installation

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git
$ pip install openai

Get your OpenAI API Key for GPT3

How to run the demo

$ python demo_socratic.py --port 5000 --openai-API-key {YOUR_OpenAI_API_KEY}

How to use the demo

  • Just fetch an image url.

fetch

Result

reasoning

About

simple interactive web-based demo for socratic models; image captioning, auto tagging, keyword generation

Resources

Stars

Watchers

Forks