Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question :The code to generate Vision Features #35

Open
roapple10 opened this issue Mar 2, 2023 · 5 comments
Open

Question :The code to generate Vision Features #35

roapple10 opened this issue Mar 2, 2023 · 5 comments

Comments

@roapple10
Copy link

I would like to study more about the Vision Features, is it convenient to share the coding part to generate the npy file?
Much appreciate the hard work here.

@gianfrancodemarco
Copy link

Bump

@Francesco-Ranieri
Copy link

Hi,
I made this code snippet for visual feature extraction. Unfortunately, the results obtained on the ScienceQA dataset differ (slightly) from those present in this repository. Despite this, the results obtained are consistent in size and allow the execution of both classification and rationale generation.
Hope it can be useful.

from transformers import AutoImageProcessor, DetrForObjectDetection
from PIL import Image
import torch

pretrained_model = "facebook/detr-resnet-101-dc5"
image_processor = AutoImageProcessor.from_pretrained(pretrained_model)
model = DetrForObjectDetection.from_pretrained(pretrained_model)

image_path = "img.jpg"
image = Image.open(image_path)
inputs = image_processor(images=image, return_tensors="pt")
outputs = model(**inputs) 

# the last hidden states are the final query embeddings of the Transformer decoder
vision_features = outputs.last_hidden_state.numpy()

@aiPenguin
Copy link

Thanks the author for this awesome work!

Some questions in the dataset contain both image of question and images of the choices. I was wondering how the author get the visual features in this case. Are there some pooling funtion applied?

How do you deal with this case, Francesco-Ranieri?

@Francesco-Ranieri
Copy link

As long as i understood by their implementation, always one image features vector is used for each question. Being the code of the vision features generation not available, we need an answer from the authors to know if any pooling function was applied.
However, i honestly think that only one image was taken into consideration.

@aiPenguin
Copy link

Same opinion as yours. But I found that there are more features in .npy than questions which have image contexts. So I open another issue with respect to it. #46

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants