SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.
Implementing: U-Net: Machine Reading Comprehension with Unanswerable Questions, Fudan University & Liulishuo Lab
This paper decomposes the problem of Machine Reading Comprehension with unanswerable questions into three sub-tasks: answer pointer, no-answer pointer, and answer verifier.
- an answer pointer to predict a canidate answer span for a question;
- a no-answer pointer to avoid selecting any text span when a question has no answer; and
- an answer verifier to determine the probability of the "unanswerability" of a question with candidate answer information.
Introduce a universal node and process the question and its context passage as a single contiguous sequence of tokens, which greatly improves the conciseness of UNet.
Represent the MRC problem as: given a set of tuples (Q, P, A), where Q = (q1, q2, · · · , qm) is the question with m words, P = (p1, p2, · · · , pn) is the context passage with n words, and A = prs:re is the answer with rs and re indicating the start and end points, the task is to estimate the conditional probability P(A|Q, P).
Embedding: Embed both the question and the passage with Glove embedding and Elmo embedding. Use POS embedding, NER embedding, and a feature embedding that includes the exact match, lower-case match, lemma match, and a TF-IDF feature. Get the question representation Q = qmi=1 and the passage representation P = pni=1, where each word is represented as a d-dim embedding by combining the features/embedding described above.
Universal Node: We create a universal node u, to learn universal information from both passage and question. This universal node is added and connects the passage and question at the phase of embedding, and then goes along with the whole representation, so it is a key factor in information representation.
We concatenated question representation, universal node representation, passage representation together as:
V = [Q, u, P] = [q1, q2 . . . qm, u, p1, p2, · · · , pn]
Word-level Fusion: Then we first use two-layer bidirectional LSTM (BiLSTM) to fuse the joint representation of question, universal node, and passage.
Hl = BiLSTM(V)
Hh = BiLSTM(Hl)
Hf = BiLSTM([Hl;Hh])
Thus, H = [Hl; Hh; Hf] represents the deep fusion information of the question and passage on word-level. When a BiLSTM is applied to encode representations, it learns the semantic information bi-directionally. Since the universal node u is between the question and passage, its hidden states hm+1 can learn both question and passage information.
First divide H into two representations: attached passage Hq and attached question Hp, and let the universal node representation hm+1 attached to both the passage and question,
Hq = [h1, h2, · · · , hm+1]
Hp = [hm+1, hm+2, · · · , hm+n+1]
Note hm+1 is shared by Hq and Hp. Here the universal node works as a special information carrier, and both passage and question can focus attention information on this node so that the connection between them is closer than a traditional biattention interaction.
We first compute the affine matrix of Hlq and Hlp by
S = (ReLU(W1Hlq))TReLU(W2Hlp)
Where, W1 and W2 are learnable parameters. Next, a bi-directional attention is used to compute the interacted representation ^Hlq and ^Hlp.
^Hlq = Hlp × softmax(ST)
^Hlp = Hlq × softmax(S)
where softmax(·) is column-wise normalized function. We use the same attention layer to model the interactions for all the three levels, and get the final fused representation ^Hlq, ^Hhq, ^Hfq for the question and passage respectively
We concatenate all the history information: we first concatenate the encoded representation H and the representation after attention ^H. We pass the concatenated representation H through a BiLSTM to get HA.
HA = BiLSTM�[Hl; Hh; Hf; ^Hl; ^Hh; ^Hf]
where the representation HA is a fusion of information from different levels. Then we concatenate the original embedded representation V and HA for better representation of the fused information of passage, universal node, and question.
A = [V;HA]
Finally, we use a self-attention layer to get the attention information within the fused information. The self-attention layer is constructed the same way as:
^A = A × softmax(ATA)
where ^A is the representation after self-attention of the fused information A. Next we concatenated representation HA and ^A and pass them through another BiLSTM layer:
O = BiLSTM[HA; ^A]
Now O is the final fused representation of all the information. At this point, we divide O into two parts: OP , OQ, representing the fused information of the question and passage respectively.
OP = [o1, o2, · · · , om]
OQ = [om+1, om+2, · · · , om+n+1]
Note for the final representation, we attach the universal node only in the passage representation OP . This is because we need the universal node as a focus for the pointer when the question is unanswerable.
The prediction layer receives fused information of passage OP and question OQ, and tackles three prediction tasks:
- answer pointer,
- no-answer pointer and
- answer verifier
First, we summarize the question information OQ into a fixed-dim representation cq.
where Wq is a learnable weight matrix and oQi represents the ith word in the question representation. Then we feed cq into the answer pointer to find boundaries of answers , and the classification layer to distinguish whether the question is answerable.
We use two trainable matrices Ws and We to estimate the probability of the answer start and end boundaries of the ith word in the passage, αi and βi.
αi ∝ exp(cqWsoPi)
βi ∝ exp(cqWeoPi)
Note here when the question is answerable, we do not consider the universal node in answer boundary detection, so we have i > 0.
The loss function for the answerable question pairs is:
LA = −(log αa + log βb)
where a and b are the ground-truth of the start and end boundary of the answer
We use the same pointer for questions that are not answerable
LNA = −(log α0 + log β0) − (log α`a* + log β`b*)
Here α0 and β0 correspond to the position of the universal node, which is at the front of the passage representation Op. For this scenario, the loss is calculated for the universal node.
Additionally, since there exits a plausible answer for each unanswerable question in SQuAD 2.0, we introduce an auxiliary plausible answer pointer to predict the boundaries of the plausible answer. where α` and β` are the output of the plausible answer pointer; a∗ and b∗ are the start and end boundary of the unanswerable answer.
Answer verifier applies a weighted summary layer to summarize the passage information into a fixed-dim representation cq And we use the weight matrix obtained from the answer pointer to get two representations of the passage.
F = [cq; om+1; cs; ce]
This fixed F includes the representation cq representing the question information, and cs and ce representing the passage information. Since these representations are highly summarized specially for classification, we believe that this passage-question pair contains information to distinguish whether this question is answerable.
Finally, we pass this fixed vector F through a linear layer to obtain the prediction whether the question is answerable.
pc = σ(WTf F)
where σ is a sigmoid function, Wf is a learnable weight matrix. Here we use the cross-entropy loss in training.
LAV = − (δ · log pc + (1 − δ) · (log (1 − pc)))
where δ ∈ {0, 1} indicates whether the question has an answer in the passage.
We jointly train the three tasks by combining the three loss functions. The final loss function is:
L = δLA + (1 − δ)LNA + LAV
where δ ∈ {0, 1} indicates whether the question has an answer in the passage, LA, LNA and LAV are the three loss functions of the answer pointer, no-answer pointer, and answer verifier.
Refering:
- Attention and Memory in Deep Learning and NLP
- Glove Vectors
- Elmo
- Loss function
- BiLSTM
- Softmax
- Affine matrix