-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stable Diffusion for 4.1 or later #877
Comments
from Inference group task summary slide: https://docs.google.com/presentation/d/1jHuhzyo_4zR1gjIsAxMywpDDN_D7H0mG0CoPqkPi3PU/edit?usp=drive_link
|
@mohitmundhragithub please check if you need more information For mobile device, we should start from
|
more on stable diffusion
|
@Mostelk With the AI Edge Torch, a tool that can convert PyTorch models directly to TFLite, I managed to convert HuggingFace SD 1.5 Unet to saved_model and tflite. The tool has some rough edges, but it works to mostly. |
Converting Stable Diffusion 1.5 Text Encoder and Unet to TFLite, https://colab.research.google.com/drive/11S1LM9S0eHO22VF3-AgOJJXJ9jNXr-ZN?usp=sharing. |
Freedom, unable to access the link. Possible to share in some google drive location? |
access permission updated. Please try again. It's in my personal Google Drive. |
Hi @freedomtan Thanks. We were able to convert the encoder and Unet to tflite versions. What about the VAE decoder from the SDpipeline. Any limitations in converting them or we have not tried it yet? |
I had troubles to convert the VAE decoder. It seems the VAE decoder model doesn't follow some kind of rules so that the |
let's try to implement FID and CLIP (with images / distribution of output tensor pre-generated on a host machine, I mean not trying to do Inception V3 and CLIP image part on Android devices). @RSMNYS to check feasibility. |
For CLIP score: it's to calculate the cosine similarity between text features and image features. Where text features are features from sending captions to the CLIP text encoder. That is, yes, surely it's possible to pre-compute text features. But we need to generate it too (when sending prompts to the text encoder). For the image features, we need to send an image to the CLIP image encoder and get the output. Using COCO images is not enough. We have to send generated images to the CLIP image encoder to get image features. So: I guess we need CLIP image encoder on Android. |
For FID score, we need to compare two sets of distributions by sending groundtruth images and generated images into Inception V3. The former one could be generated offline; the second one is supposed to be computed on devices. So we need to have Inception V3 related stuff, tool. |
Let's try discuss in the mobile working group meeting.
@freedomtan to do CLIP score in C/C++ |
For CLIP score: it turns out to be quite straightfward. Convert an OpenAPI CLIP model to tflite and run it with TFLite interpreter, then we can get CLIP scores. See my reference code at https://github.com/freedomtan/clip_score_on_android/. For our accuracy use, we need to
|
For the output to LoadGen: the
for non-offline case: mobile_app_open/flutter/cpp/mlperf_driver.cc Lines 66 to 80 in 09e4b41
What returned in the My understand is LoadGen actually treats output data as opaque blobs and it's not necessary to return accuracy metrics. |
we may need to or extend the mobile_app_open/flutter/cpp/dataset.h Line 68 in 09e4b41
|
For my understanding, here is a quick overview of the steps to implement the score calculation:
Step 1 can be done beforehand using a Python script, to pre-generate the @freedomtan Is my understanding correct? Is there anything else I need to pay attention to? |
YES, that's correct. |
The text was updated successfully, but these errors were encountered: