Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check Mobilenet V4 Large on iPhones #865

Open
freedomtan opened this issue Mar 10, 2024 · 56 comments
Open

Check Mobilenet V4 Large on iPhones #865

freedomtan opened this issue Mar 10, 2024 · 56 comments
Assignees

Comments

@freedomtan
Copy link
Contributor

Currently, I got

device Mobilenet V4 Large Mobilenet EdgeTPU
iPhone 13 220.11 617.78
iPhone 14 Pro 300.06 970.95
iPhone 15 Pro Max 332.95 1145.05

Roughly, > 300 qps for iPhone 13 should be possible.

#821 (comment)

@RSMNYS
Copy link
Contributor

RSMNYS commented Mar 10, 2024

@freedomtan please share the info how to check the model accuracy for the Mobilenet V4. What dataset do I need to use, and if we have some specific steps to setup accuracy test on the iOS device. Thanks

@freedomtan
Copy link
Contributor Author

@freedomtan please share the info how to check the model accuracy for the Mobilenet V4. What dataset do I need to use, and if we have some specific steps to setup accuracy test on the iOS device. Thanks

To validate accuracy of image classification models, we use full ImageNet 2012 validation dataset (50, 000 images) from https://www.image-net.org/index.php.

@RSMNYS
Copy link
Contributor

RSMNYS commented Mar 22, 2024

@freedomtan I've tried accuracy test for the CoreML backend, and TF backend for the Image Classification task v1, and v2. For each case it crashes after 100%. EXC_BAD_ACCESS (code=1, address=0x27c8) in compute accuracy. I'm going to check what the problem we have.

@RSMNYS
Copy link
Contributor

RSMNYS commented Mar 25, 2024

@freedomtan I've tried accuracy test for the CoreML backend, and TF backend for the Image Classification task v1, and v2. For each case it crashes after 100%. EXC_BAD_ACCESS (code=1, address=0x27c8) in compute accuracy. I'm going to check what the problem we have.

I've found that validation results were expected in another format, that I had (so only the category number, without image name). I can run the accuracy test now, but it gives 0.05% of the accuracy, so might be again dataset issue. When tried the one from our tests it gives 100 %, but we have 10 images there only.

IMG_9972

@freedomtan
Copy link
Contributor Author

freedomtan commented Mar 26, 2024

@RSMNYS I don't get it.

This is the original Mobielnet EdgeTPU model we had or a new V4 one? As far as I can remember, we checked that we can have expected accuracy number for the original one.

Please check

  1. run all the benchmark items with TFLite + CoreML Delegate backend as the baseline (optional)
  2. Mobilenet EdgeTPU's accuracy numbers (including both non-offline and offline ones)
  3. accuracy numbers of other models: as far as I can remember, all models except the MobileBERT should have good enought accuracy.

@freedomtan
Copy link
Contributor Author

FYR, on an iPhone 13, for Mobilenet EdgeTPU I got 76.21% running binary built from lastest master branch.

@RSMNYS
Copy link
Contributor

RSMNYS commented Mar 26, 2024

Thanks, all works. For iPhone 14 Pro, I have the same 76.21%. Will try with the ImageNet V2 and different optimised models based on it.

@RSMNYS
Copy link
Contributor

RSMNYS commented Apr 2, 2024

All tests were done on iPhone 14 Pro

Model Name Performance (QPS) Accuracy (%) Size, Mb
MobilenetV4_Large.mlmodel 268.58 81.82% 130.1
MobilenetV4_Large.mlpackage 251.36 82.73% 65.5
MobilenetV4_Large.mlpackage (8 bit quantization) 299.25 82.7% 33.3
MobilenetV4_Large.mlpackage (20% sparsity) 258.39 82.26% 56.6
MobilenetV4_Large.mlpackage (30% sparsity) 244.7 80.83% 50.1
MobilenetV4_Large.mlpackage (40% sparsity) 261.22 74.15% 43.6
MobilenetV4_Large.mlpackage (30% sparsity, 8 bit quantization) 299.4 80.83% 50.1

Also during the test I noticed the performance drop when device is warm (after several tests). And sometimes it drops from 300 to 200 qps. Please check also the screenshot, there you can see the tests for MobilenetV4_Large.mlpackage (8 bit quantization) only. You can see how the performance could differ. @freedomtan

IMG_0003

Here is the link to models: https://github.com/RSMNYS/mobile_models/tree/main/v4_0/CoreML

@freedomtan
Copy link
Contributor Author

freedomtan commented Apr 5, 2024

@RSMNYS thermal throttling is a well-known issue on cell phone. A typical way to get numbers we want is to cool down the device before you run a new test :-)

@freedomtan
Copy link
Contributor Author

please try to do the first 3 items and ensure that there is not thermal throttling. e.g., cold start, wait for 5 mins, and measure the performance numbers.

Note that currently we don't allow model pruning (sparsity above) for submission. If we want to allow that, we need to change our rules.

@RSMNYS
Copy link
Contributor

RSMNYS commented Apr 12, 2024

All tests were done on iPhone 14 Pro

Model Name Performance (QPS) Accuracy (%) Size, Mb
MobilenetV4_Large.mlmodel 294.85 81.2% 124
MobilenetV4_Large.mlpackage 296.93 82.73% 65.5
MobilenetV4_Large.mlpackage (8 bit quantization) 295.11 82.7% 33.3

@freedomtan
Copy link
Contributor Author

All tests were done on iPhone 14 Pro

Model Name Performance (QPS) Accuracy (%) Size, Mb
MobilenetV4_Large.mlmodel 294.85 81.2% 124
MobilenetV4_Large.mlpackage 296.93 82.73% 65.5
MobilenetV4_Large.mlpackage (8 bit quantization) 295.11 82.7% 33.3

These numbers look reasonable now. But let's see if we can further improve it.

Let's check if @colbybanbury can comment on this.

@freedomtan
Copy link
Contributor Author

MobilenetV4 was made public last week, see https://arxiv.org/abs/2404.10518 or https://arxiv.org/html/2404.10518v1
According to numbers in the paper, it should be able to get > 300 qps for iPhone 13.

@colbybanbury
Copy link

The V4 paper results use an iPhone 13 and fp16 quantization. The model was also derived from a Pytorch equivalent in order to be in (batch, channel, height, width) tensor format which I measured to be slightly faster.

I recommend using fp16 on iPhones with a version number less than 15 pro where they added int8-int8 compute.

Happy to help if needed!

@anhappdev
Copy link
Collaborator

@RSMNYS
From the paper https://arxiv.org/abs/2404.10518

for benchmarks on the Apple Neural Engine (conducted on an iPhone 13 with iOS 16.6.1, CoreMLTools 7.1, and Xcode 15.0.1 for profiling), PyTorch models were converted to CoreML’s MLProgram format in Float16 precision, with float16 MultiArray inputs to minimize input copying

@RSMNYS
Copy link
Contributor

RSMNYS commented Apr 25, 2024

@freedomtan can you point please where we can get the MobileNet V4 PyTorch model. As currently we have only tf lite one.

@colbybanbury
Copy link

The PyTorch model has yet to be officially released. Sorry for the delay!

The TensorFlow model should still get similar latency results, but let me know if I can help with anything.

@freedomtan
Copy link
Contributor Author

freedomtan commented Apr 30, 2024

@freedomtan to try it on iPhone 13 again.

@freedomtan
Copy link
Contributor Author

@freedomtan to try it on iPhone 13 again.

As I got before, on iPhone 13, it's about 220 qps

@freedomtan
Copy link
Contributor Author

Let's try to have PyTorch model (with weights from the TensorFlow model).

@RSMNYS
Copy link
Contributor

RSMNYS commented May 14, 2024

@colbybanbury can you please tell us if you use mlmodel or mlpackage CoreML models in your tests?

@colbybanbury
Copy link

I used MLPackage

@freedomtan
Copy link
Contributor Author

@RSMNYS With Xcode 16.0 beta and iOS 18 + MLPackage targeting iOS 15 or later, it's possible to get per-op time. Please check https://developer.apple.com/videos/play/wwdc2024/10161/?time=927

@freedomtan
Copy link
Contributor Author

Per-op profiling actually is possible on iOS 17.4+ / MacOS 14.4+. I wrote a little command line program and tested it on my Macbook Pro M1, see https://github.com/freedomtan/coreml_modelc_profling

@rwightman
Copy link

FWIW There's still no official weights from the paper authors, but I've trained a number of PyTorch native MobileNetV4 models and made them available in timm. The conv-medium runs quite nicely on CPU w/o much extra optimization.
https://github.com/huggingface/pytorch-image-models?tab=readme-ov-file#june-12-2024

@freedomtan
Copy link
Contributor Author

FWIW There's still no official weights from the paper authors, but I've trained a number of PyTorch native MobileNetV4 models and made them available in timm. The conv-medium runs quite nicely on CPU w/o much extra optimization. https://github.com/huggingface/pytorch-image-models?tab=readme-ov-file#june-12-2024

@rwightman: FYI, thanks to @colbybanbury, one of the co-authors of the paper, we did have MobileNetV4-Conv-Large saved_model, and tflites, see https://github.com/mlcommons/mobile_open/tree/main/vision/mobilenetV4

@freedomtan
Copy link
Contributor Author

freedomtan commented Jun 14, 2024

@RSMNYS pip install git+https://github.com/huggingface/pytorch-image-models.git then

import timm
import torch
import coremltools as ct

torch_model = timm.create_model("hf-hub:timm/mobilenetv4_conv_large.e600_r384_in1k", pretrained=True)
torch_model.eval()

# Trace the model with random data.
example_input = torch.rand(1, 3, 384, 384) 
traced_model = torch.jit.trace(torch_model, example_input)
out = traced_model(example_input)

model = ct.convert(
    traced_model,
    convert_to="mlprogram",
    inputs=[ct.TensorType(shape=example_input.shape)]
)

model.save("mobilenetv4.mlpackage")

This model takes around 3.10 ms (> 300 qps) on my iPhone 13.
iPhone 14 Pro: 2.29 ms (436)

These matche what @colbybanbury and other said in the paper. Please try to see if we can get the same performance with the TF saved_model.

Thanks @rwightman

@freedomtan
Copy link
Contributor Author

freedomtan commented Jun 14, 2024

@RSMNYS and @anhappdev
According to coremltools 8.0b1 doc on quantization, it's possible to create a calibrated quantized A8W8 PTQ model from an existing Core ML model.

I used random data as calibration data. Then I got.

unit: ms

device fp32 quantized a8w8
iphone 13 3.10 2.23
iphone 14 Pro 2.29 1.83
iphone 15 Pro 2.24 1.38

Maybe we can use "real" calibration data to check if quantized int8 models could meet accuracy thresholds.

@anhappdev
Copy link
Collaborator

Maybe we can use "real" calibration data to check if quantized int8 models could meet accuracy thresholds.

I will try to do that.

@rwightman
Copy link

@freedomtan Good to hear that. For quantization, some weights quantize 'better' (less performance drop) than others, the training hparams have an impact. I'd be curious to know how the timm weights I've trained so far fair in that regard.

@RSMNYS
Copy link
Contributor

RSMNYS commented Jun 17, 2024

@RSMNYS pip install git+https://github.com/huggingface/pytorch-image-models.git then

import timm
import torch
import coremltools as ct

torch_model = timm.create_model("hf-hub:timm/mobilenetv4_conv_large.e600_r384_in1k", pretrained=True)
torch_model.eval()

# Trace the model with random data.
example_input = torch.rand(1, 3, 384, 384) 
traced_model = torch.jit.trace(torch_model, example_input)
out = traced_model(example_input)

model = ct.convert(
    traced_model,
    convert_to="mlprogram",
    inputs=[ct.TensorType(shape=example_input.shape)]
)

model.save("mobilenetv4.mlpackage")

This model takes around 3.10 ms (> 300 qps) on my iPhone 13. iPhone 14 Pro: 2.29 ms (436)

These matche what @colbybanbury and other said in the paper. Please try to see if we can get the same performance with the TF saved_model.

Thanks @rwightman

@freedomtan can you share the versions of coremltools, torch and tensorflow please? So far I'm getting same 300 qps on my iPhone 14 Pro IOS 18 beta

@freedomtan
Copy link
Contributor Author

coremltools: 7.2
PyTorch: 2.2.2
iPhone 13 and iPhone 14 Pro running iOS 17.5.1.

@freedomtan
Copy link
Contributor Author

@freedomtan to share the models he converted.

@freedomtan
Copy link
Contributor Author

@RSMNYS
Copy link
Contributor

RSMNYS commented Jun 18, 2024

@freedomtan to share the models he converted.

@RSMNYS, https://drive.google.com/drive/folders/1rR7SsqO2ZfVI7whn8ky1biuZRIB-5AMC?usp=sharing

Thanks for sharing. So only for int8 model I got 373 qps on iPhone 14 Pro IOS 18 Beta. Another model still has <=300 qps.

@freedomtan
Copy link
Contributor Author

@freedomtan to share the models he converted.

@RSMNYS, https://drive.google.com/drive/folders/1rR7SsqO2ZfVI7whn8ky1biuZRIB-5AMC?usp=sharing

Thanks for sharing. So only for int8 model I got 373 qps on iPhone 14 Pro IOS 18 Beta. Another model still has <=300 qps.

@RSMNYS that's hard to believe. Did you fully charge the phone and avoid thermal throttling? How do you test it?
Does the Xcode performance tab show similar results?

@anhappdev Do you have any results?

@freedomtan
Copy link
Contributor Author

I upgraded the iPhone 14 Pro I tested to iOS 18. Still got simiar results in Xcode 16 beta 1.

@RSMNYS
Copy link
Contributor

RSMNYS commented Jun 20, 2024

I upgraded the iPhone 14 Pro I tested to iOS 18. Still got simiar results in Xcode 16 beta 1.

I'm using Xcode 15.4. Will upgrade today and see. Thanks

@RSMNYS
Copy link
Contributor

RSMNYS commented Jun 20, 2024

Xcode 16 Beta, Built the app with IOS SDK 18, using your model. Maximum what I have it's 300 qps for Image Classification V2. Not able to run this model in Xcode as well. Having an error: data could not be read because it's missing.

@freedomtan
Copy link
Contributor Author

Xcode 16 Beta, Built the app with IOS SDK 18, using your model. Maximum what I have it's 300 qps for Image Classification V2. Not able to run this model in Xcode as well. Having an error: data could not be read because it's missing.

Mostly you didn't use Xcode 16 beta 1 to run profiling.

@freedomtan
Copy link
Contributor Author

Xcode 16 Beta, Built the app with IOS SDK 18, using your model. Maximum what I have it's 300 qps for Image Classification V2. Not able to run this model in Xcode as well. Having an error: data could not be read because it's missing.

@RSMNYS I downloaded the models I shared to test with iPhone 14 Pro + iOS 18.0 beta. I got the same results.
Please make sure that you use the correct model and correct environment when testing the model.

I use iPhone 14 Pro w/ iOS 18 (actually 17.5.1 is fine). To run model performance profiling on iOS 18 devices, you need Xcode 16 beta.

@anhappdev
Copy link
Collaborator

anhappdev commented Jun 22, 2024

I added a permute layer to the PyTorch model to have the expected NHWC input. The accuracy is plausible, but the latency is increased from 2,30 ms to 3,99 ms (iPhone 14 Pro, iOS 17.5.1).

mobilenetv4_float32.mlpackage

import timm
import torch
import coremltools as ct

# Load the pretrained model
torch_model = timm.create_model("hf-hub:timm/mobilenetv4_conv_large.e600_r384_in1k", pretrained=True)
torch_model.eval()

# Inspect the model
print("num_classes", torch_model.num_classes)
print("data_config", timm.data.resolve_model_data_config(torch_model))

# Define a wrapper to convert NCHW to NHWC input
class WrappedModel(torch.nn.Module):
    def __init__(self, model):
        super(WrappedModel, self).__init__()
        self.model = model

    def forward(self, x):
        # Permute from NHWC to NCHW
        x = x.permute(0, 3, 1, 2)
        x = self.model(x)
        return x

wrapped_model = WrappedModel(torch_model)
wrapped_model.eval()

# Trace the wrapped model with random data
example_input = torch.rand(1, 384, 384, 3)
traced_model = torch.jit.trace(wrapped_model, example_input)
out = traced_model(example_input)

# Convert the traced model to CoreML
ml_model = ct.convert(
    traced_model,
    convert_to="mlprogram",
    inputs=[ct.TensorType(name="images", shape=(1, 384, 384, 3))],
    outputs=[ct.TensorType(name="Softmax")],
)

ml_model.short_description = "hf-hub:timm/mobilenetv4_conv_large.e600_r384_in1k"

# Save the CoreML model
ml_model.save("mobilenetv4.mlpackage")

@freedomtan
Copy link
Contributor Author

I added a permute layer to the PyTorch model to have the expected NHWC input. The accuracy is plausible, but the latency is increased from 2,30 ms to 3,99 ms (iPhone 14 Pro, iOS 17.5.1).

mobilenetv4_float32.mlpackage

2.30 ms is what I expected. Adding transpose (permute in PyTorch is compiled / translated into MIL transpose op) causes slow-down of other ops is a curious case. The models converted from TensorFlow usually come with some transpose op(s). Maybe removing the leading transpose op could increase infernce speed?

@RSMNYS please check if you can reproduce @anhappdev's results.

@RSMNYS
Copy link
Contributor

RSMNYS commented Jun 24, 2024

@freedomtan @anhappdev With Anh's model I have 245.8 qps for performance. And the accuracy is 82.43%.

@freedomtan
Copy link
Contributor Author

@freedomtan @anhappdev With Anh's model I have 245.8 qps for performance. And the accuracy is 82.43%.

@RSMNYS How about the original (2.30 ms) one.

@freedomtan
Copy link
Contributor Author

freedomtan commented Jun 25, 2024

I ran two models (w/ and w/o leading transpose ops) on iPhone 14 Pro running iOS 18 and got numbers close to what @anhappdev reported. The permutation/transpose op is a interensting. If the transpose op is the problem (e.g., if removing the leading transpose from the TF-based mobilenet v4 model we used), then we can simple do the NHWC -> NCHW transformation in preprocessing stage (see https://github.com/mlcommons/mobile_app_open/blob/master/flutter/cpp/datasets/imagenet.cc#L97-L99).

@anhappdev
Copy link
Collaborator

I compared the latency of each ops and saw that the transpose ops itself takes only 59 microseconds, but every other ops (conv, relu, ...) also take longer time compared to the version without the transpose ops. I first guessed that it has something to do with the memory layout, but adding .contiguous() after .permute() has no effect.

@freedomtan
Copy link
Contributor Author

I compared the latency of each ops and saw that the transpose ops itself takes only 59 microseconds, but every other ops (conv, relu, ...) also take longer time compared to the version without the transpose ops. I first guessed that it has something to do with the memory layout, but adding .contiguous() after .permute() has no effect.

Yes, other ops are slowed down. That's why it's intriguing.

@freedomtan
Copy link
Contributor Author

@RSMNYS will try to remove the leading transpose op to see if we can get performance numbers matching PyTorch model.

@RSMNYS
Copy link
Contributor

RSMNYS commented Jul 1, 2024

Hi guys! So here are 2 graphs for the models (CoreML) with and without transpose layer:
Screenshot 2024-07-02 at 00 40 40
Screenshot 2024-07-02 at 00 41 05
After removing the leading transpose layer I have performance as 363 qps. Here is the link to the model if you want to try: https://www.dropbox.com/s/o0l2xic7en2nmhb/MobilenetV4_Large_no_transpose.mlpackage.zip?dl=1

@freedomtan
Copy link
Contributor Author

@RSMNYS to check if we can reduce the inference latency of the MobileBERT model.

@anhappdev
Copy link
Collaborator

anhappdev commented Jul 10, 2024

Quantization

Here is the Python script to convert, quantize and test the model:
mobilenetv4_pytorch.py

Model Latency Accuracy Note
mobilenetv4_fp32.mlpackage 2,31 ms 82.11%
mobilenetv4_w8.mlpackage 2,16 ms 78.69%
mobilenetv4_w8a8.mlpackage 233,55 ms 81.14% Calibration with ImageNet data. Almost all the ops are run on CPU, some on GPU.
Freedom's fp32 model 2,31 ms 82.11%
Freedom's int8 model 1,96 ms 63.96% Calibration with random data. All ops are run on ANE
RSMNYS's no transpose model - - Not able to run because of wrong input shape.

Performance is tested with XCode 16.0b1 on iPhone 14 Pro (iOS 17.5.1).
I don't have a device with the latest chipset (A17 pro, M4) to test the increased performance expected from a W8A8 model.

Accuracy is tested on a Macbook with the Python script above.

@anhappdev
Copy link
Collaborator

I added a script to test the accuracy on a Macbook. The comment above is updated with the accuracy number.

@freedomtan
Copy link
Contributor Author

Quantization

Here is the Python script to convert, quantize and test the model: mobilenetv4_pytorch.py

Model Latency Accuracy Note
mobilenetv4_fp32.mlpackage 2,31 ms 82.11%
mobilenetv4_w8.mlpackage 2,16 ms 78.69%
mobilenetv4_w8a8.mlpackage 233,55 ms 81.14% Calibration with ImageNet data. Almost all the ops are run on CPU, some on GPU.
Freedom's fp32 model 2,31 ms 82.11%
Freedom's int8 model 1,96 ms 63.96% Calibration with random data. All ops are run on ANE
RSMNYS's no transpose model - - Not able to run because of wrong input shape.
Performance is tested with XCode 16.0b1 on iPhone 14 Pro (iOS 17.5.1). I don't have a device with the latest chipset (A17 pro, M4) to test the increased performance expected from a W8A8 model.

Accuracy is tested on a Macbook with the Python script above.

@anhappdev I cannot download your models, it says something like "No preview available. File is in owner's trash", and I cannot find how to download it.

81.14% looks good enough. I really don't understand why your a8w8 model is on CPU only :-(

@anhappdev
Copy link
Collaborator

anhappdev commented Jul 11, 2024

@freedomtan I updated the link. You can download the models from here:
https://drive.google.com/drive/folders/1-Mloub0e41mkYs09i9tQZl95dTnGdZ6f?usp=sharing

I am not sure why the w8a8 model I converted run on CPU. Can you share your conversion script?

@freedomtan
Copy link
Contributor Author

@anhappdev I got expected numbers with your w8a8 one. Xcode 16 beta 2/iOS 18 beta 3

model iPhone 14 Pro (iOS 18 beta 3) iPhone 15 Pro (iOS 18 beta 3)
mobilenetv4_w8a8.mlpackage 1.85 1.45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants