Sequence Prediction with Unlabeled Data by Reward Function Learning


Please cite:
title={Sequence Prediction with Unlabeled Data by Reward Function Learning.},
author={Wu, Lijun and Zhao, Li and Qin, Tao and Lai, Jianhuang and Liu, Tie-Yan},


Reinforcement learning (RL), which has been successfully applied to sequence prediction, introduces rewardas sequence-level supervision signal to evaluate the quality of a generated sequence. Existing RL approaches use the ground-truth sequence to define reward, which limits the application of RL techniques to labeled data. Since labeled data is usually scarce and/or costly to collect, it is desirable to leverage large-scale unlabeled data. In this paper, we extend existing RL methods for sequence prediction to exploit unlabeled data. We propose to learn the reward function from labeled data and use the predicted reward as pseudo reward for unlabeled data so that we can learn from unlabeled data using the pseudo reward. To get good pseudo reward on unlabeled data, we propose a RNN-based reward network with attention mechanism, trained with purposely biased data distribution. Experiments show that the pseudo reward canprovide good supervision and guide the learning process on unlabeled data. We observe significant improvements on both neural machine translation and text summarization.