UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost


Please cite:
title={UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost},
author={Wu, Zhen and Wu, Lijun and Qi, Meng and Xia, Yingce and Xie, Shufang and Qin, Tao and Dai, Xinyu and Liu, Tie-Yan},
booktitle={Proceedings of the The 2021 Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies,
Volume 1 (Long Papers)},


Transformer architecture achieves great success in abundant natural language processing tasks. The over-parameterization of the Transformer model has motivated plenty of works to alleviate its overfitting for superior performances. With some explorations, we find simple techniques such as dropout, can greatly boost model performance with a careful design. Therefore, in this paper, we integrate different dropout techniques into the training of Transformer models. Specifically, we propose an approach named $\mathtt{UniDrop}$ to unites three different dropout techniques from fine-grain to coarse-grain, i.e., feature dropout, structure dropout, and data dropout. Theoretically, we demonstrate that these three dropouts play different roles from regularization perspectives. Empirically, we conduct experiments on both neural machine translation and text classification benchmark datasets. Extensive results indicate that Transformer with $\mathtt{UniDrop}$ can achieve around $1.5$ BLEU improvement on IWSLT14 translation tasks, and better accuracy for the classification even using strong pre-trained RoBERTa as backbone.