SQUAD 2.0

SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.

Implementing: U-Net: Machine Reading Comprehension with Unanswerable Questions, Fudan University & Liulishuo Lab

This paper decomposes the problem of Machine Reading Comprehension with unanswerable questions into three sub-tasks: answer pointer, no-answer pointer, and answer verifier.

an answer pointer to predict a canidate answer span for a question;
a no-answer pointer to avoid selecting any text span when a question has no answer; and
an answer verifier to determine the probability of the "unanswerability" of a question with candidate answer information.

Introduce a universal node and process the question and its context passage as a single contiguous sequence of tokens, which greatly improves the conciseness of UNet.

Represent the MRC problem as: given a set of tuples (Q, P, A), where Q = (q₁, q₂, · · · , q_m) is the question with m words, P = (p₁, p₂, · · · , p_n) is the context passage with n words, and A = p_{r_s:r_e} is the answer with r_s and r_e indicating the start and end points, the task is to estimate the conditional probability P(A|Q, P).

1) Unified Encoding

Embedding: Embed both the question and the passage with Glove embedding and Elmo embedding. Use POS embedding, NER embedding, and a feature embedding that includes the exact match, lower-case match, lemma match, and a TF-IDF feature. Get the question representation Q = q^m_i=1 and the passage representation P = pⁿ_i=1, where each word is represented as a d-dim embedding by combining the features/embedding described above.

Universal Node: We create a universal node u, to learn universal information from both passage and question. This universal node is added and connects the passage and question at the phase of embedding, and then goes along with the whole representation, so it is a key factor in information representation.

We concatenated question representation, universal node representation, passage representation together as:

V = [Q, u, P] = [q₁, q₂ . . . q_m, u, p₁, p₂, · · · , p_n]

Word-level Fusion: Then we first use two-layer bidirectional LSTM (BiLSTM) to fuse the joint representation of question, universal node, and passage.

H^l = BiLSTM(V)

H^h = BiLSTM(H^l)

H^f = BiLSTM([H^l;H^h])

Thus, H = [H^l; H^h; H^f] represents the deep fusion information of the question and passage on word-level. When a BiLSTM is applied to encode representations, it learns the semantic information bi-directionally. Since the universal node u is between the question and passage, its hidden states h_m+1 can learn both question and passage information.

2) Multi-Level Attention

First divide H into two representations: attached passage H_q and attached question H_p, and let the universal node representation h_m+1 attached to both the passage and question,

H_q = [h₁, h₂, · · · , h_m+1]

H_p = [h_m+1, h_m+2, · · · , h_m+n+1]

Note h_m+1 is shared by H_q and H_p. Here the universal node works as a special information carrier, and both passage and question can focus attention information on this node so that the connection between them is closer than a traditional biattention interaction.

We first compute the affine matrix of H^l_q and H^l_p by

S = (ReLU(W₁H^l_q))^TReLU(W₂H^l_p)

Where, W₁ and W₂ are learnable parameters. Next, a bi-directional attention is used to compute the interacted representation ^{^}H^l_q and ^{^}H^l_p.

^{^}H^l_q = H^l_p × softmax(S^T)

^{^}H^l_p = H^l_q × softmax(S)

where softmax(·) is column-wise normalized function. We use the same attention layer to model the interactions for all the three levels, and get the final fused representation ^{^}H^l_q, ^{^}H^h_q, ^{^}H^f_q for the question and passage respectively

3) Final Fusion

We concatenate all the history information: we first concatenate the encoded representation H and the representation after attention ^{^}H. We pass the concatenated representation H through a BiLSTM to get HA.

H^A = BiLSTM�[H^l; H^h; H^f; ^{^}H^l; ^{^}H^h; ^{^}H^f]

where the representation H^A is a fusion of information from different levels. Then we concatenate the original embedded representation V and H^A for better representation of the fused information of passage, universal node, and question.

A = [V;H^A]

Finally, we use a self-attention layer to get the attention information within the fused information. The self-attention layer is constructed the same way as:

^{^}A = A × softmax(A^TA)

where ^{^}A is the representation after self-attention of the fused information A. Next we concatenated representation H^A and ^{^}A and pass them through another BiLSTM layer:

O = BiLSTM[H^A; ^{^}A]

Now O is the final fused representation of all the information. At this point, we divide O into two parts: OP , OQ, representing the fused information of the question and passage respectively.

O^P = [o¹, o², · · · , o^m]

O^Q = [o^m+1, o^m+2, · · · , o^m+n+1]

Note for the final representation, we attach the universal node only in the passage representation O^P . This is because we need the universal node as a focus for the pointer when the question is unanswerable.

4) Prediction

The prediction layer receives fused information of passage O^P and question O^Q, and tackles three prediction tasks:

answer pointer,
no-answer pointer and
answer verifier

First, we summarize the question information O^Q into a fixed-dim representation c_q.

where W_q is a learnable weight matrix and o^Q_i represents the i^th word in the question representation. Then we feed c_q into the answer pointer to find boundaries of answers , and the classification layer to distinguish whether the question is answerable.

i) Answer Pointer

We use two trainable matrices W_s and W_e to estimate the probability of the answer start and end boundaries of the i^th word in the passage, α_i and β_i.

α_i ∝ exp(c_qW_so^P_i)

β_i ∝ exp(c_qW_eo^P_i)

Note here when the question is answerable, we do not consider the universal node in answer boundary detection, so we have i > 0.

The loss function for the answerable question pairs is:

L_A = −(log α_a + log β_b)

where a and b are the ground-truth of the start and end boundary of the answer

ii) No-Answer Pointer

We use the same pointer for questions that are not answerable

L_NA = −(log α₀ + log β₀) − (log α^`_a* + log β^`_b*)

Here α₀ and β₀ correspond to the position of the universal node, which is at the front of the passage representation O_p. For this scenario, the loss is calculated for the universal node.

Additionally, since there exits a plausible answer for each unanswerable question in SQuAD 2.0, we introduce an auxiliary plausible answer pointer to predict the boundaries of the plausible answer. where α` and β` are the output of the plausible answer pointer; a^∗ and b^∗ are the start and end boundary of the unanswerable answer.

iii) Answer Verifier

Answer verifier applies a weighted summary layer to summarize the passage information into a fixed-dim representation c_q And we use the weight matrix obtained from the answer pointer to get two representations of the passage.

F = [c_q; o_m+1; c_s; c_e]

This fixed F includes the representation c_q representing the question information, and c_s and c_e representing the passage information. Since these representations are highly summarized specially for classification, we believe that this passage-question pair contains information to distinguish whether this question is answerable.

Finally, we pass this fixed vector F through a linear layer to obtain the prediction whether the question is answerable.

p^c = σ(W^T_f F)

where σ is a sigmoid function, W_f is a learnable weight matrix. Here we use the cross-entropy loss in training.

L_AV = − (δ · log p^c + (1 − δ) · (log (1 − p^c)))

where δ ∈ {0, 1} indicates whether the question has an answer in the passage.

Training

We jointly train the three tasks by combining the three loss functions. The final loss function is:

L = δL_A + (1 − δ)L_NA + L_AV

where δ ∈ {0, 1} indicates whether the question has an answer in the passage, L_A, L_NA and L_AV are the three loss functions of the answer pointer, no-answer pointer, and answer verifier.

Refering:

Attention and Memory in Deep Learning and NLP
Glove Vectors
Elmo
Loss function
BiLSTM
Softmax
Affine matrix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NOTES.md

NOTES.md

SQUAD 2.0

1) Unified Encoding

2) Multi-Level Attention

3) Final Fusion

4) Prediction

i) Answer Pointer

ii) No-Answer Pointer

iii) Answer Verifier

Training

Files

NOTES.md

Latest commit

History

NOTES.md

File metadata and controls

SQUAD 2.0

1) Unified Encoding

2) Multi-Level Attention

3) Final Fusion

4) Prediction

i) Answer Pointer

ii) No-Answer Pointer

iii) Answer Verifier

Training