Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transfer-tune option tries only one-round for each task #23

Open
iasakura opened this issue Oct 11, 2021 · 2 comments
Open

transfer-tune option tries only one-round for each task #23

iasakura opened this issue Oct 11, 2021 · 2 comments

Comments

@iasakura
Copy link

Hello everyone.

We are interested in optimizing vision DNN models for Jetson devices, so we tried to use TenSet dataset for optimizing DNN models on Jetson Xavier NX.
With some modification to use auto_scheduler.RPCRunner in tune_network.py, we were able to tune networks on Jetson Xavier NX.
We evaluated models available in tune_network.py with --n-trials 10000, and we found that Ansor with pretrained model by TenSet founds good programs than Ansor without TenSet model in the first thousands of trials and final results are slightly better.

We expected that enabling --transfer-tune option makes the result better, because --trasfer-tune looks to improve cost models by using measured result on real device. However, when we tried --transfer-tune option, it caused slower programs in all models. The following are the results:

ResNet 18

compiler execution time # trials
w/o transfer tune 7.21 10044
w/ transfer tune 10.24 1692

ResNet 50

compiler execution time # trials
w/o transfer tune 15.56 10044
w/ transfer tune 22.62 1692

MobileNet v2

compiler execution time # trials
w/o transfer tune 2.49 10048
w/ transfer tune 2.9 2048

MobileNet v3

compiler execution time # trials
w/o transfer tune 3.06 10048
w/ transfer tune 3.53 3328

Wide ResNet 50

compiler execution time # trials
w/o transfer tune 35.32 10044
w/ transfer tune 48.8 1692

DenseNet 121

compiler execution time # trials
w/o transfer tune 15.61 10044
w/ transfer tune 17.88 4604

Inception v3

compiler execution time # trials
w/o transfer tune 29.08 10015
w/ transfer tune 45.46 3487

We use the following commands for evaluation:

n_trials=10000
target="cuda -keys=cudagpu -arch=sm_72 -max_num_threads=1024 -max_threads_per_block=1024 -registers_per_block=65536 -shared_memory_per_block=49152 -thread_warp_size=32"
target_host="llvm -keys=arm_cpu -mtriple=aarch64-linux-gnu -mattr=+neon"
# w/o transfer tune
python3 tune_network.py --network ${model} --n-trials ${n_trials} ----cost-model xgb-no-update --load-model xgb.pkl --target "$target" --target-host "$target_host"
# w/ transfer tune
python3 tune_network.py --network ${model} --n-trials ${n_trials} ----cost-model xgb-no-update --transfer-tune --load-model xgb.pkl --target "$target" --target-host "$target_host"

For investigating the slower results of transfer tune, we read the related code to transfer tune option and found some seemingly strange points in its implementation:

  • It only tune each task for only one round, even if we give much more trial counts. In ResNet 50, normal Ansor w/ TenSet model tries 10044 trials, but transfer tune only does 1692 times.
  • It only uses fine-tuned models for the last half of tasks. The first half of tasks are always tuned by the given model.

Could you please tell me the intension of these implementation or how to improve the result of transfer tuning?

@iasakura iasakura changed the title transfer-tune option tries only 64 trials for each task transfer-tune option tries only one-round for each task Oct 12, 2021
@merrymercy
Copy link
Collaborator

@ruochen99

@ruochen99
Copy link
Collaborator

Thank you for bringing up this issue! Transfer learning is not a complete feature in our model yet. The purpose of the --transfer-tune option is mostly to test how useful transfer learning is in terms of improving the cost model. We have only tested the effect of transfer learning on a small number of trials so we ignored the later parts of this procedure. A quick fix to your problem could be changing this line

self.num_measures_per_round = min(tune_option.num_measures_per_round, num_measure_trials // len(self.tasks))

into

self.num_measures_per_round = num_measure_trials // len(self.tasks)

After this modification, the algorithm will collect measurement data on the first half of tasks, train a local model, and use it on the second half of tasks. However, I'm also uncertain about how well transfer learning would perform on a large number of trials.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants