Paper, Tags: #nlp
The dominant approach to subword segmentation: Byte Pair Encoding. Keeps the most frequent words intact, splits the rare ones into multiple tokens. BPE splits words into unique sequences, meaning that for each word, a model observes only one segmentation, which may prevent a model from better learning the compositionality of words and being robust to segmentation errors. Subwords into which rare words are segmented end up poorly understood.
We show that BPE itself incorporates the ability to produce multiple segmentations of the same word. It builds a vocabulary of subwords and a merge table, specifying which subwords have to be merged into a bigger subword, and the priority of the merges.
During segmentation, words are first split into sequences of characters, then the learned merge operations are applied to merge the characters into larger, known symbols, until no merge can be done.
Our dropout method stochastically corrupts the segmentation procedure of BPE, producing multiple segmentations within the same fixed BPE framework.
Natural way to fix this problem is enabling multiple segmentation candidates (Kudo, 2018) as subword regularization, on-the-fly data sampling and is non-specific to NMT architecture. But it's quite complicated.
It's a data compression where the most common pair of consecutive bytes of data is replaced by a byte that doesn't occur within that data.
- aaabdaaabac
- ZabdZabac, Z=aa
- ZYdZYac, Z=aa, Y=ab
- XdXac, Z=aa, Y=ab, X=ZY
From Wikipedia, I don't think this is the way it's used in NLP.
To perform subword tokenization, BPE is slightly modified in that the frequently occurring subword pairs are merged together instead of being replaced by another byte. So rare word 'athazagoraphobia' would be split up into ['_ath', 'az', 'agor', 'aphobia'].
- First step, initialize vocabulary
- Represent each word in the corpus as a combination of the characters along with the special end of word token.
- Iteratively count character pairs in all tokens of vocabulary
- Merge every occurrence of the most frequent pair, add the new character n-gram to the vocabulary
- Repeat step #4 until the desired number of merge operations are completed or the desired vocab size achieved (hyperparameter of the algorithm).
BPE enables the encoding of rare words with appropriate subword tokenization without introducing any 'unknown' tokens. Another tutorial
From paper:
- Initialize token vocabulary with character vocabulary, and merge table with an empty table
- Each word: sequence of tokens + end of word symbol
- Iteratively, count all pairs of tokens and merges the most frequent pair into a new token. Token added to vocab, and merge operation added to table
- Do this until vocab size reached.
Exploits the innate ability of BPE to be stochastic. The merge table remains same, but the segmentation procedure is altered. During segmentation, at each merge step some merges are randomly dropped with probability p. We use p=0.1 during training and p=0 during inference. #TODO why? For Chinese and Japanese, we use p=0.6 to match the increase in length of segmented sentences for other languages.
We hypothesize that exposing a model to different segmentations might result in better understanding of the whole words as well as their subword units.
Results indicate that using BPE-Dropout on the source side is more beneficial than using it on the target side. We speculate it's more important for the model to understand a source sentence, than being exposed to different ways to generate the same target sentence.
The improvement with respect to normal BPE are consistent no matter the vocabulary size. But we see that the effect from using BPE-Dropout vanishes when a corpora size gets bigger.
Sentences segmented with BPE-Dropout are longer. There's a danger that models trained with BPE-Dropout might use more fine-grained segmentation in inference and hence slow inference down.
- When using BPE, frequent sequences of characters rarely appear in a segmented text as individual tokens rather than being a part bigger ones. BPE-Dropout alleviates this issue
- Using BPE-Dropout leads to a better understanding of rare tokens
- Models trained with BPE-Dropout are more robust to misspelled input.
Only rare words are split into subwords. This forces frequent sequences of characters to mostly appear in a segmented text as part of bigger tokens, and not as individual tokens. #TODO
Adaptive dropout rates for different merges, and in-depth analysis of other pathologies in learned token embeddings for different segmentations.