Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIXKD: TOWARDS EFFICIENT DISTILLATION OF LARGE-SCALE LANGUAGE MODELS #14

Open
izhx opened this issue Jun 17, 2021 · 1 comment
Open

Comments

@izhx
Copy link
Owner

izhx commented Jun 17, 2021

因为 task-specific data 有限,难蒸馏出好 student model,所以用 mixup 增强数据,teacher 对插值的输入预测,student 学习该预测结果(loss_3,带调节系数超参),直接学插值输入对应的插值标签(loss_2,带调节系数超参),学习原始数据(loss_1,无系数)。提供了一些理论证明,在GLUE上做了实验有效。

image

信息

  • 主要作者:Kevin J Liang, Weituo Hao(共一)
  • 单位:Duke University & Facebook AI, Duke University
  • 论文链接

1 学习到的新东西:

  1. 知识蒸馏时数据过少,student 没太多机会学习 teacher

    While knowledge distillation can be a powerful technique, if the size of the available data is small, then the student has only limited opportunities to learn from the teacher. This may make it much harder for knowledge distillation to close the gap between student and teacher model performance.

    ...

    Instead, we propose using the augmented samples to further query the teacher model, whose large size often allows it to learn more powerful features.

  2. mixup让决策边界更平化、泛化。

    Mixup’s vicinal risk minimization tends to result in smoother decision boundaries and better generalization, while also being cheaper to compute than methods such as backtranslation. Mixup was initially proposed for continuous data, where interpolations between data points remain in-domain; its efficacy was demonstrated primarily on image data, but examples in speech recognition and tabular data were also shown to demonstrate generality.

  3. 对文本表示插值可以视为一种特殊的 Manifold mixup.

2 通过Related Work了解到了哪些知识

  1. Model Compression:
    1. task-specific knowledge distillation: 为特定任务蒸馏模型。在 task-specific data 上预训练 student model,能很好提升性能[1]。
    2. Patient knowledge distillation (PKD): 除了学习 teacher 的logits,也学习中间层表示[2]。
  2. Data Augmentation in NLP:
    1. EDA [3] : 使用 rule-based 操作进行增强,如同义词替换,词插入、交换、删除。
    2. Back-translation: 翻译到另一种语言再翻译回来。
    3. paraphrase generation [4] :
    4. MixText [5] : 半监督的文本分类模型,提了个 TMix,其实就是对文本表示插值。

3 实验验证任务,如果不太熟悉,需要简单描述

在GLUE的几个数据上蒸馏 BERT base,打败了全部 baseline。

数据越少效果越好(mixup的特点)。

用 t-SNE 可视化 SST-2 数据上的 [CLS] 表示,普通模型(a)分不开,MixKD学的(b)分的比较明显。

image

4 在你认知范围内,哪些其它任务可以尝试

其他蒸馏任务。

5 好的句子

以句子为单位收集,读起来不错,觉得有机会用上,就摘抄。

  1. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their applicability to low-resource (both memory and computation) platforms.
  2. However, large-scale neural network systems are prone to memorize training instances, and thus tend to make inconsistent predictions when the data distribution is altered slightly.
  3. To address these issues, we propose MixKD, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability.
  4. Concretely, in addition to the original training examples, the student model is encouraged to mimic the teacher’s behavior on the linear interpolation of example pairs as well.
  5. In settings where computation may be limited (e.g. mobile, edge devices), such characteristics may preclude such powerful models from deployment entirely.
  6. this is akin to a student learning more from a teacher by asking more questions to further probe the teacher’s an- swers and thoughts.
  7. but may be too bulky or slow for certain applications.
  8. Reducing the number of layers makes such models significantly more portable and efficient, but at the expense of accuracy.
  9. Intuitively, MixKD allows the student model additional queries to the teacher model, granting it more opportunities to absorb the latter’s richer representations.

References

  1. Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read Students Learn Better: On the Importance of Pre-training Compact Models. arXiv preprint arXiv:1908.08962, 2019.
  2. Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient Knowledge Distillation for BERT Model Compression. arXiv preprint arXiv:1908.09355, 2019a.
  3. Jason Wei and Kai Zou. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. arXiv preprint arXiv:1901.11196, 2019.
  4. Ashutosh Kumar, Satwik Bhattamishra, Manik Bhandari, and Partha Talukdar. Submodular Optimization-based Diverse Paraphrasing and Its Effectiveness in Data Augmentation. North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, 2019.
  5. Jiaao Chen, Zichao Yang, and Diyi Yang. MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification. Association for Computational Linguistics, July 2020.
@izhx
Copy link
Owner Author

izhx commented Jun 17, 2021

文中公式(8)的 L_MSE 估计写错了,应该是交叉熵。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant