Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Real image input example #4

Closed
JamshedAlamQaderi opened this issue Mar 30, 2024 · 6 comments
Closed

Real image input example #4

JamshedAlamQaderi opened this issue Mar 30, 2024 · 6 comments

Comments

@JamshedAlamQaderi
Copy link

JamshedAlamQaderi commented Mar 30, 2024

Hello @kyegomez ,

Thank you so much for this awesome repo. I'm very excited to test this project. So, i've tried with example code but it gives me this below error

SyntaxError: Non-UTF-8 code starting with '\xff' in file C:\Users\alamj\Downloads\screenai.py on line 1, but no encoding declared; see https://peps.python.org/pep-0263/ for details

Could you use a real example of giving input image and text and converting them to vector and feed to the model. I really want to check it out

Thank you!

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar
Copy link

Hello there, thank you for opening an Issue ! 🙏🏻 The team was notified and they will get back to you asap.

@Yingrjimsch
Copy link

Hello,

I could run it with an actual img with the following code

import torch
from torchvision.io import read_image
from screenai.main import ScreenAI

# Create a tensor for the image
image = read_image('test.png').unsqueeze(0).to(torch.float32)
# Create a tensor for the text
text = torch.randint(0, 20000, (1, 1028))

# Create an instance of the ScreenAI model with specified parameters
model = ScreenAI(
    num_tokens = 20000,
    max_seq_len = 1028,
    patch_size=16,
    image_size=224,
    dim=512,
    depth=6,
    heads=8,
    vit_depth=4,
    multi_modal_encoder_depth=4,
    llm_decoder_depth=4,
    mm_encoder_ff_mult=4,
)

# Perform forward pass of the model with the given text and image tensors
out = model(text, image)

# Print the shape of the output tensor
print(out)

and a test image which needs to be 224 x 224 pixels for example:
test

Maybe this helps.

@JamshedAlamQaderi
Copy link
Author

JamshedAlamQaderi commented Apr 6, 2024

@Yingrjimsch thank you so much for the help. Can you also tell me if i could input prompt text and encode it to tensor? how to do decode output tensor?

@Yingrjimsch
Copy link

Hi @JamshedAlamQaderi I had no time yet to try that but I would suggest use the Hugging Face transformer library to find a tokenizer. Use the tokenizer on your input text and set num_tokens as well as max_seq_length to the tokenizers specs. If I have time I'll try it as well and keep you updated.

@Barney-Steven
Copy link

Barney-Steven commented Apr 9, 2024

Hi, @JamshedAlamQaderi , this repo is not the official Implementation, you can see the definition in "from screenai.main import ScreenAI", it is a very simple structure. ScreenAI is not open source for now. I find something similar in Huggingface, try moondream2.

@JamshedAlamQaderi
Copy link
Author

Thank you guys for helping me

@JamshedAlamQaderi JamshedAlamQaderi closed this as not planned Won't fix, can't repro, duplicate, stale Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants