Stable Diffusion for 4.1 or later #877

freedomtan · 2024-04-30T05:46:13Z

mainly follow what the main Inference group have and replace SD XL with 1.5
we don't have 1.5 keras or tflite, we can start with 1.4 tflite files (1.5 and 1.4 have the same architecture, just different weights)
we can try to convert Inference scripts to C/C++ files (selected some images from the COCO as the benchmark dataset)
there are 2 metrics (FID and ??) @freedomtan to update this.

freedomtan · 2024-05-06T03:49:37Z

from Inference group task summary slide: https://docs.google.com/presentation/d/1jHuhzyo_4zR1gjIsAxMywpDDN_D7H0mG0CoPqkPi3PU/edit?usp=drive_link

Model
- Model: Stable Diffusion XL Base 1.0 (SDXL) without refiner
  - stabilityai/stable-diffusion-xl-base-1.0
- Negative prompt:
  - normal quality, low quality, worst quality, low res, blurry, nsfw, nude
- Scheduler: Euler discrete (EulerDiscreteScheduler from diffusers)
  - Scheduler steps: 20
- Guidance: 8
- Starting latents: a pre-generated static latent for all prompts:
  - https://github.com/mlcommons/inference/tree/master/text_to_image/tools (latents.py and latents.npz)
- Output image size: 1024x1024
Dataset
- Some image chosen from 5,000 from COCO-2014 validation (all the 5,000 ones or subset?)
Scores
- FID and CLIP scores are used
  https://huggingface.co/docs/diffusers/conceptual/evaluation

freedomtan · 2024-05-06T05:47:41Z

@mohitmundhragithub please check if you need more information

For mobile device, we should start from

Stable Diffusion 1.5 (or 1.4 when 1.5 tflite files not ready)
output image size 512x512

freedomtan · 2024-05-07T01:12:55Z

more on stable diffusion

there are 3 models
- Text encoder: strings needed to be encoded with BPE first.
- diffusion model: take output from the encoder + RNG output + previous output
- decoder: VAE decoder of the figure in the bottom
the 3 models do not cover: bpe encoding and scheduling (couple parameters when looping)
- Freedom’s quick-and-dirty implementation in C++
- ggml-based stable_diffusion.cpp
- for tokenizer, it's also possible to use sentencepiece from TensorFlow Text to tflite, see https://www.tensorflow.org/text/guide/text_tf_lite
Stable Diffusion in TFLite, converted from Keras CV implementation, which uses weights from the original PyTorch one

freedomtan · 2024-05-16T09:54:33Z

@Mostelk With the AI Edge Torch, a tool that can convert PyTorch models directly to TFLite, I managed to convert HuggingFace SD 1.5 Unet to saved_model and tflite. The tool has some rough edges, but it works to mostly.

freedomtan · 2024-05-17T09:12:17Z

Converting Stable Diffusion 1.5 Text Encoder and Unet to TFLite, https://colab.research.google.com/drive/11S1LM9S0eHO22VF3-AgOJJXJ9jNXr-ZN?usp=sharing.

mohitmundhragithub · 2024-05-17T10:16:51Z

Converting Stable Diffusion 1.5 Text Encoder and Unet to TFLite, https://colab.research.google.com/drive/11S1LM9S0eHO22VF3-AgOJJXJ9jNXr-ZN?usp=sharing.

Freedom, unable to access the link. Possible to share in some google drive location?

freedomtan · 2024-05-17T13:09:47Z

Converting Stable Diffusion 1.5 Text Encoder and Unet to TFLite, https://colab.research.google.com/drive/11S1LM9S0eHO22VF3-AgOJJXJ9jNXr-ZN?usp=sharing.

Freedom, unable to access the link. Possible to share in some google drive location?

access permission updated. Please try again. It's in my personal Google Drive.

aswib · 2024-05-21T05:10:37Z

Hi @freedomtan Thanks. We were able to convert the encoder and Unet to tflite versions. What about the VAE decoder from the SDpipeline. Any limitations in converting them or we have not tried it yet?

freedomtan · 2024-05-21T05:25:32Z

Hi @freedomtan Thanks. We were able to convert the encoder and Unet to tflite versions. What about the VAE decoder from the SDpipeline. Any limitations in converting them or we have not tried it yet?

I had troubles to convert the VAE decoder. It seems the VAE decoder model doesn't follow some kind of rules so that the torch.export.export() doesn't work. Need more works. That is, we need to modify the VAE decoder, I guess.

freedomtan · 2024-06-18T05:47:12Z

let's try to implement FID and CLIP (with images / distribution of output tensor pre-generated on a host machine, I mean not trying to do Inception V3 and CLIP image part on Android devices).

@RSMNYS to check feasibility.

freedomtan · 2024-06-21T09:52:37Z

For CLIP score: it's to calculate the cosine similarity between text features and image features. Where text features are features from sending captions to the CLIP text encoder. That is, yes, surely it's possible to pre-compute text features. But we need to generate it too (when sending prompts to the text encoder). For the image features, we need to send an image to the CLIP image encoder and get the output. Using COCO images is not enough. We have to send generated images to the CLIP image encoder to get image features.

So: I guess we need CLIP image encoder on Android.

@mohitmundhragithub and @AhmedTElthakeb

freedomtan · 2024-06-24T22:18:23Z

For FID score, we need to compare two sets of distributions by sending groundtruth images and generated images into Inception V3. The former one could be generated offline; the second one is supposed to be computed on devices. So we need to have Inception V3 related stuff, tool.

freedomtan · 2024-06-25T05:26:36Z

Let's try discuss in the mobile working group meeting.

do we want to have two or only one score?
if two scores are used, do we need to change either loadgen or our dataset interface? and may UI change?

@freedomtan to do CLIP score in C/C++
@aswib to do FID score

freedomtan · 2024-06-28T05:01:06Z

For CLIP score: it turns out to be quite straightfward. Convert an OpenAPI CLIP model to tflite and run it with TFLite interpreter, then we can get CLIP scores.

See my reference code at https://github.com/freedomtan/clip_score_on_android/.

For our accuracy use, we need to

use tokenizer output and pad the attention mask
resize image (512x512) to 224x224 and convert NHWC to NCHW. (we need this anyway, InceptionV3 1x299x229x3)

freedomtan · 2024-07-02T02:46:21Z

For the output to LoadGen: the ::mlperf::QuerySamplesComplete() is called to return processed outputs

mobile_app_open/flutter/cpp/mlperf_driver.cc

Line 83 in 09e4b41

::mlperf::QuerySamplesComplete(responses.data(), responses.size());

for non-offline case:

mobile_app_open/flutter/cpp/mlperf_driver.cc

Lines 66 to 80 in 09e4b41

    
           for (int idx = 0; idx < samples.size(); ++idx) { 
        
             ::mlperf::QuerySample sample = samples.at(idx); 
        
             std::vector<void*> inputs = dataset_->GetData(sample.index); 
        
             backend_->SetInputs(inputs); 
        
             backend_->IssueQuery(); 
        
             // Report to mlperf. 
        
             std::vector<void*> outputs = backend_->GetPredictedOutputs(); 
        
             response_data.push_back(dataset_->ProcessOutput(sample.index, outputs)); 
        
             responses.push_back( 
        
                 {sample.id, 
        
                  reinterpret_cast<std::uintptr_t>(response_data[idx].data()), 
        
                  response_data[idx].size()}); 
        
             backend_->FlushQueries(); 
        
             query_counter_ += 1;

What returned in the QuerySampleResponse, which uses uintptr data

https://github.com/mlcommons/inference/blob/9e2c9f642e6e12b74e7c08d2e099c8af0e542873/loadgen/query_sample.h#L49-L76

My understand is LoadGen actually treats output data as opaque blobs and it's not necessary to return accuracy metrics.

freedomtan · 2024-07-02T05:33:50Z

we may need to or extend the ComputeAccuracy() if we use 2 (FID and CLIP) scores. However, this has nothing to do with LoadGen interface.

mobile_app_open/flutter/cpp/dataset.h

Line 68 in 09e4b41

virtual float ComputeAccuracy() { return -1.0f; }

anhappdev · 2024-08-14T11:07:50Z

For CLIP score: it turns out to be quite straightfward. Convert an OpenAPI CLIP model to tflite and run it with TFLite interpreter, then we can get CLIP scores.

See my reference code at https://github.com/freedomtan/clip_score_on_android/.

For our accuracy use, we need to

use tokenizer output and pad the attention mask

resize image (512x512) to 224x224 and convert NHWC to NCHW. (we need this anyway, InceptionV3 1x299x229x3)

For my understanding, here is a quick overview of the steps to implement the score calculation:

Use the CLIPTokenizer to preprocess the prompts and get input_ids and attention_mask.
Get the pixel_values (512x512 image) from the backend output, resize to 224x224, normalize, and convert the data layout (what the CLIPImageProcessor does)
Pass input_ids, attention_mask and pixel_values to a CLIPModel, then get the logits_per_image and logits_per_text as outputs. This is also the CLIP score.

Step 1 can be done beforehand using a Python script, to pre-generate the input_ids and attention_mask and store them in a .tfrecord file.
Step 2 and 3 will be done on-device in C++.

@freedomtan Is my understanding correct? Is there anything else I need to pay attention to?

freedomtan · 2024-08-14T12:17:13Z

For CLIP score: it turns out to be quite straightfward. Convert an OpenAPI CLIP model to tflite and run it with TFLite interpreter, then we can get CLIP scores.
See my reference code at https://github.com/freedomtan/clip_score_on_android/.
For our accuracy use, we need to

use tokenizer output and pad the attention mask

resize image (512x512) to 224x224 and convert NHWC to NCHW. (we need this anyway, InceptionV3 1x299x229x3)

For my understanding, here is a quick overview of the steps to implement the score calculation:

Use the CLIPTokenizer to preprocess the prompts and get input_ids and attention_mask.

Get the pixel_values (512x512 image) from the backend output, resize to 224x224, normalize, and convert the data layout (what the CLIPImageProcessor does)

Pass input_ids, attention_mask and pixel_values to a CLIPModel, then get the logits_per_image and logits_per_text as outputs. This is also the CLIP score.

Step 1 can be done beforehand using a Python script, to pre-generate the input_ids and attention_mask and store them in a .tfrecord file. Step 2 and 3 will be done on-device in C++.

@freedomtan Is my understanding correct? Is there anything else I need to pay attention to?

YES, that's correct.

anhappdev added this to the v4.1 milestone May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable Diffusion for 4.1 or later #877

Stable Diffusion for 4.1 or later #877

freedomtan commented Apr 30, 2024

freedomtan commented May 6, 2024

freedomtan commented May 6, 2024

freedomtan commented May 7, 2024

freedomtan commented May 16, 2024

freedomtan commented May 17, 2024

mohitmundhragithub commented May 17, 2024

freedomtan commented May 17, 2024

aswib commented May 21, 2024

freedomtan commented May 21, 2024

freedomtan commented Jun 18, 2024

freedomtan commented Jun 21, 2024 •

edited

Loading

freedomtan commented Jun 24, 2024

freedomtan commented Jun 25, 2024

freedomtan commented Jun 28, 2024 •

edited

Loading

freedomtan commented Jul 2, 2024

freedomtan commented Jul 2, 2024

anhappdev commented Aug 14, 2024

freedomtan commented Aug 14, 2024

Stable Diffusion for 4.1 or later #877

Stable Diffusion for 4.1 or later #877

Comments

freedomtan commented Apr 30, 2024

freedomtan commented May 6, 2024

freedomtan commented May 6, 2024

freedomtan commented May 7, 2024

freedomtan commented May 16, 2024

freedomtan commented May 17, 2024

mohitmundhragithub commented May 17, 2024

freedomtan commented May 17, 2024

aswib commented May 21, 2024

freedomtan commented May 21, 2024

freedomtan commented Jun 18, 2024

freedomtan commented Jun 21, 2024 • edited Loading

freedomtan commented Jun 24, 2024

freedomtan commented Jun 25, 2024

freedomtan commented Jun 28, 2024 • edited Loading

freedomtan commented Jul 2, 2024

freedomtan commented Jul 2, 2024

anhappdev commented Aug 14, 2024

freedomtan commented Aug 14, 2024

freedomtan commented Jun 21, 2024 •

edited

Loading

freedomtan commented Jun 28, 2024 •

edited

Loading