Sequence Generation with Mixed Representations

Published:

Please cite:
@inproceedings{wu2020seqgenmix,
title={Sequence Generation with Mixed Representations},
author={Wu, Lijun and Xie, Shufang and Xia, Yingce and Fan, Yang ad=nd Qin, Tao and Zhou, Wengang and Li, Houqiang and Liu, Tie-Yan},
booktitle={International Conference on Machine Learning},
year={2020}
}

Abstract

Tokenization is the first step of many natural language processing (NLP) tasks and plays an important role for neural NLP models. Tokenization methods such as byte-pair encoding and SentencePiece, which can greatly reduce the large vocabulary size and deal with out-of-vocabulary words, have shown to be effective and are widely adopted for sequence generation tasks. While various tokenization methods exist, there is no common acknowledgement which one is the best. In this work, we propose to leverage the mixed representations from different tokenizers for sequence generation tasks, which can take the advantages of each individual tokenization method. Specifically, we introduce a new model architecture to incorporate mixed representations and a co-teaching algorithm to better utilize the diversity of different tokenization methods. Our approach achieves significant improvements on neural machine translation tasks with six language pairs, as well as an abstractive summarization task.

[PDF] [CODE]