What do you learn from context? Probing for sentence structure in contextualized word representations, Tenney et al., 2019
Paper, Tags: #nlp
We introduce a novel edge probing task design and construct a broad suite of sub-sentence tasks derived from the traditional structured NLP pipeline. We probe word-level contextual representations form 4 recent models and investigate how they encode sentence structure across a range of syntactic, semantic, local and long-range phenomena.
Existing models trained on LM and translation produce strong representations for syntactic phenomena, but only offer comparably small improvements on semantic tasks over a non-contextual baseline.
Our goal in this work is to understand where these contextual representations improve over conventional word embeddings. Previous work has explored many token-level properties of these representations, such as their ability to capture POS tags, or word-sense disambiguation. We expand this further by introducing a suite of edge probing tasks covering a broad range of syntactic, semantic, local and long-range phenomena.
We focus on asking what information is encoded at each position, and how well it encodes structural information about that word's role in the sentence. Is this information primarily syntactic in nature, or do the representations also encode higher-level semantic relationships?
We use a probing model that sees only the contextual embeddings from a fixed, pretrained encoder. The model can access only embeddings within given spans and must predict properties. We call the technique 'edge probing' because we decompose each structured task into a set of graph edges, which we can predict independently using a common classifier architecture.
Code can be found at Github.
We define a novel 'edge probing' framework motivated by the need for a uniform set of metrics and architectures across tasks. Formulation can be seen in the paper.
We focus on eight core NLP labeling tasks:
- POS
- Constituent labeling: assigning a non-terminal label for a span of tokens within the phrase-structure parse of the sentence (noun phrase, verb phrase)
- Dependency labeling: seeks to predict the functional relationship of one token relative to another (modifier-head relationship, subject-object relationship)
- Named entity labeling
- Semantic role labeling (SRL): is the task of imposing predicate-argument structure onto a natural language sentence
- Mary pushed John
- Mary: pusher, John: pushee
- Coreference
- Semantic proto-role (SPR) labeling: annotating fine-grained, non-exclusive semantic attributes
- Mary pushed John
- is the pusher 'aware' of doing the pushing?
- Relation classification (Rel.): task of predicting the real-world relation that holds between 2 entities, given an inventory of symbolic relation types
- Mary is walking to work
- Mary-work: Entity-Destination relation
We use the annotations in the OntoNotes 5.0 corpus (Weischedel et al., 2013) for these tasks:
- POS
- Constituents
- Named entities
- Semantic roles
- Coreference
In all cases, we simply cast the original annotation into our edge probing format. The OntoNotes corpus doesn't contain annotations for:
- Dependencies
- Proto-roles
- Semantic relations
For these, we use the English Web Treebank portion of the Universal Dependencies. For relation classification, we use the Semeval 2010 Task 8 dataset.
The probing architecture is illustrated in Figure 1 in paper. We take a list of contextual vectors and integer spans as inputs, and use a projection layer followed by the self-attention pooling operator. Pooling is only within the bounds of a span, the only information our model can access about the rest of the sentence is that provided by the contextual embeddings.
The span representations are concatenated and fed into a two-layer MLP followed by a sigmoid output layer.
We use four recent contextual encoder models:
- CoVe (McCann et al., 2017)
- ELMo
- GPT
- BERT
Question to answer is: what do contextual representations encode that conventional word embeddings don't?
- Lexical baselines
- Randomized ELMo
- Word-level CNN
- Comparison of representation models
- ELMo and GPT outperform CoVe, meaning that the information in their word representation makes it easier to recover details of sentence structure.
- Using ELMo-style scalar mixing instead of concatenation improves performance on both BERT and GPT. We attribute this to the most relevant information being contained in the intermediate layers, and the top layers are overly specialized to perform next-word prediction
- Genre effects
- Our probing suite is drawn mostly from newswire and web text. This is a good match for the Billion Word Benchmark (BWB) used to train ELMo, but a weaker match for the Books Corpus used to train GPT. To control for this, we train a clone of the GPT model on the BWB. This model performs slightly better on our probing suite than the Books Corpus-trained model, but still underperforms ELMo
- Encoding of syntactic vs. semantic information
- ELMo, CoVe and GPT show the largest gains on tasks considered to be largely syntactic (dependency and constituent labeling), and smaller gains on tasks considered to require more semantic reasoning
- Semantic role labeling benefits greatly from contextual encoders overall, but this is predominantly due to better labeling of core roles, which are known to be closely tied to syntax.
- Effect of architecture
- How much of the model's performance can be attributed to the architecture, rather than knowledge from pretraining? We compare to an orthonormal encoder (structurally identical to ELMo but containing no information in the recurrent weights). The orthonormal encoder improves significantly on the lexical baseline
- Encoding non-local context
- how much information is carried over long distances in the sentence? We extend our lexical baseline with a convolutional layer. Adding a CNN of width 3(+- 1 token) closes 72% of the gap between the lexical baseline and full ELMo
We've introduced a suite of 'edge probing' tasks designed to probe the sub-sentential structure of contextualized word embeddings. We use these tasks to explore how contextual embeddings improve on their lexical (context-independent) baselines.
In general, contextualized embeddings improve over their non-contextualized counterparts largely on syntactic tasks in comparison to semantic tasks, suggesting that these embeddings encode sntax more so than higher-level semantics. The performance of ELMo can't be fully explained by a model with access to local context, suggesting that the contextualized representations do encode distant linguistic information.