Paper, Tags: #nlp, #architectures, #ner
In this paper we present a novel NN architecture that automatically detects word- and character-level features using a hybrid, bidirectional LSTM and CNN architecture. We also propose a novel method of encoding partial lexicon matches.
Given only tokenized text and publicly available word embeddings, our system is competitive on the CoNLL-2003 dataset and surpasses the previously reported performance by 2.13 F1 points.
Simple feed-forward NNs restrict the use of context to a fixed sized window around each word, and since they depend solely on word embeddings, it's unable to explot explicit character level features such as prefix and suffix.
BiLSTMs can take into account an effectively infinite amount of context on both sides of a word and eliminate the problem of limited context that applies to any feed-forward model.
We present a hybrid model if biLSTMs and CNNs that learns both character- and word-level features. We also propose a new lexicon encoding scheme and matching algorithm that can make use of martial matches.
We use biLSTMs, and CNNs to induce character-level features. The extracted features from the CNN are fed into a fLSTM and bLSTM, the output of each network at each time stel is decoded by a linear layer and a logñsoftmax layer. Then they are added together to produce the final output.
To extract character features using a CNN, for each word we employ a convolution and a max layer to extract a new feature vector. The hyper-parameters of the CNN are window size and output vector size.
We experiment with Senna, GloVe and word2vec. All words are lowercased before passing through the lookup table to convert to their corresponding embedding. The pre-trained embeddings are allowed to be modified during training. Additional word-level features:
- Capitalization feature
- Lexicons
We implement the NN using the torch7 library, and training and inference are done on a per-sentence level. We train the network to maximize the sentence-level log-likelihood.
Evaluation was performed on CoNLL-2003 NER shared task dataset, which taggs four types of named entities: location, organization, person and miscellaneous.
- All digits sequences are replaced by a single '0'
- Before training, we group sentences by word length into mini-batches and shuffle them.
The BLSTM-CNN models fail to converge around 5-10% of the time depending on the feature set. BLSTM-CNN models significantly outperform the BLSTM models when give nthe same feature set.
We obtain a large, significant improvement when trained word embeddings are used, as opposite to random embeddings, regardless of the additional features used.