Paper, Tags: #nlp, #ner
We apply pre-trained LMs to low-resource NER for historic German, and we show that character-based pre-trained LMs cope well with low-resource datasets.
Our pre-trained character-based LMs improve upon classical CRF-based methods and biLSTMs by boosting F1 score performance by up to 6%. Code in Github.
Traditionally NER systems are built with CRFs, and recent systems use biLSTM-CRFs and pre-trained word embeddings.
More recent approaches propose methods to produce different representations for the same word depending on its context (Peters et al., 2018, Akbik et al., 2018, Deviln et al., 2018)
This paper is based on the work of Riedl and Padó, 2018. They showed how to build a model for German NER with state-of-the-art performance for both contemporary and historical texts. For historical texts, they used transfer-learning with labeled data from other high-resource domains (CoNLL-2003 or GermEval). They showed that using biLSTM-CRF and word embeddings, performance was better than CRFs with hand-coded features.
We use the same low-resource datasets for historic German. We use only unlabeled data via pre-trained LMs and word embeddings, and we also introduce a novel LM pre-training objective that uses only contemporary texts for training to achieve comparable state-of-the-art results on historical texts.
We use contextual string embeddings as proposed by Akbik et al., 2018, we use the Flair library Akbik et al., 2018 to train all NER and pre-trained LMs. We use FastText (Wikipedia and Crawl) as word embeddings.
The Flair library allows the use of stacking different embeddings types.
Contextualized string embeddings were trained with a biLSTM on two historic datasets. This process is called 'pre-training'. We add CRF on top.
We use the same two datasets for Historic German as used by Riedl and Padó, 2018.
- The first corpus is the collection of Tyrolean periodicals and pewspapers from LFT, which consists of ~87k tokens from 1926.
- The second corpus is a collection of Austrian newspapers from ONB, with ~35k tokens between 1710 and 1873.
The tagset includes locations, organizations, persons and miscellaneous. No MISC are found in the ONB dataset, and only a few are annotated in LFT. Both corpora have some challenges:
- They are relatively small compared to CoNLL-2003 and GermEval
- They have a different language variety (German and Austrian)
- They include a high rate of OCR errors.
We use a) FastText embeddings trained on German Wikipedia articles, b) FastText embeddings trained on Common Crawl, and c) character embeddings (Lample et al., 2016b).
We use pre-trained FastText embeddings without subword information, since subword information could harm the performance of our system.
Combining pre-trained FastText for Wikipedia and Common Crawl leads to an F1 score of 72.50% on the LFT dataset. Adding character embeddings has a positive impact of 2% and yields 74.50%, higher than Riedl and Padó, 2018, who used transfer-learning with more labeled data. The result on the ONB corpus is also higher than Padós.
We trained contextualized string embeddings as proposed by Akbik et al., 2018, and train LMs on 2 datasets from the European collection of historical newspapers.
- First corpus consists of articles from HHA covering 741M tokens from 1888-1945.
- Second corpus consists of articles from WZ covering 891M tokens from 1703-1875.
We choose the two corpora because they have a temporal overlap with the LFT corpus (1926), and ONB corpus (1710-1873).
Additionally we use the multilingual BERT model for comparison. We perform a per-layer analysis of the multilingual BERT model on the development set to find the best layer for our task. For the German LM, we use the same pre-trained LM for German as used in Akbik et al., 2018.
The temporal aspect of training data for the LMs has deep impact on the performance.
The LMs trained on contemporary data like the German Wikipedia (Akbik et al., 2018) or multilingual BERT don't perform very well on the ONB dataset, the LM trained on HHA performs better due to the temporal overlap.
We also considered the masked LM objective of Deviln et al., 2018, but this technique can't be used since they use a subword-based LM, in contrast to our character-based LM. We introduce a novel MLM, SMLM, that randomly adds noise during training.
Mail goal of SMLM is to transfer a corpus from one domain ('clean', contemporary texts) into another ('noisy', historical texts). SMLM uses the vocabulary (characters) from the target domain and injects them into the source domain.
With this technique we create a synthetic corpus that 'emulates' OCR errors or spelling mistakes, without having any data from the target domain.
To use SMLM, we extract all vocabulary (characters) from ONB and LFT datasets. This is the target vocabulary. Then we obtain a corpus of contemporary texts for German with ~388M tokens. During training, we iterate over all characters in the contemporary corpus, leave the character unchanged 90% of times, otherwise 20% of times replace the character with a masked character not existing in the target vocabulary, 80% times randomly replace character by a symbol from the target vocabulary.
The use of pre-trained character-based LMs boosts performance for both LFT and ONB datasets. A corpus with a large degree of temporal overlap with the downstream task performs better than corpus with little to no temporal overlap
Using CRF-based methods outperform traditional Bi-LSTM in lowresource settings. This shortcoming can now be eliminated by using Bi-LSTMs in combination with pre-trained language models.