Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2018, ACL, Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations #80

Open
Sepideh-Ahmadian opened this issue Sep 12, 2024 · 2 comments
Assignees
Labels
literature-review Summary of the paper related to the work

Comments

@Sepideh-Ahmadian
Copy link
Member

Sepideh-Ahmadian commented Sep 12, 2024

Paper
Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations

Introduction
This article is part of a series of efforts that have used language models for data augmentation. They assume there is an invariance (i.e., the change would be natural) when a word in a sentence is substituted with other words that are paradigmatically related. The new words are suggested by a bi-directional language model at the word positions for each word in a sentence.

Main problem
It is difficult to establish a universal rule for transforming language in a way that preserves class labels while being applicable across various domains (generalization).
This work suggests word substitutions based on paradigmatic relations. Previous efforts that only considered synonym substitution using WordNet were limited, as the number of synonyms for each word is restricted. This research also considers the label when determining contextual substitutions. For instance, in the sentence "the actors are fantastic", the synonyms for fantastic in a positive label might be funny, while in a negative label, it could be dull. This method has been tested to ensure label validity after substitution.

Illustrative Example
In this example only substitution of word actor is considered:
Original review: The actors are fantastic
Augmented sentences: The performances (films, movies, stories) are fantastic.

Input
A Sentence (Sequence of words. (e.g. The actors are fantastic))

Output
K sentences that are comprised of high probability outcome of the model
(e.g. The characters are funny (positive label), The characters are tired (negative label))

Motivation
In previous works, the word ‘actor’ in ‘The actors are fantastic’, using synset of a word in WordNet, can be replaced by, players, historian considering the average similarity of the words. However, the word actor can be replaced with non-synonyms words such as characters, movies, stories in a way that keep the sentiment, naturalness, and context.

Related works and their gaps
They have used following methods for data augmentation:

  1. using synonym lists (Zhang et al., 2015; Wang and Yang, 2015),
    In previous works, researchers used synonyms that are selected from Wordnet (Miller, 1995; Zhang et al., 2015) and similarity calculations (Wang and Yang, 2015). => Since the synonyms that have the same or near meaning is few, and it is applicable to a small set of words in a sentence can not generate numerous patterns from the original sentence.
  2. grammar induction (Jia and Liang, 2016),
  3. task-specific heuristic rules (Furstenau and Lapata, 2009; Kafle et al., 2017; Silfverberg et al., 2017),
  4. Neural decoders of autoencoders (Bergmanis et al., 2017; Xu et al., 2017; Hu et al., 2017)
  5. Encoder-decoder models (Kim and Rush, 2016; Sennrich et al., 2016; Xia et al., 2017).
  6. The most similar research to this work is by Kolomiyets et al. (2011) and Fadaee et al. (2017). Fadaee used this method to solve rare word problem in machine translation.

Contribution of this paper
They proposed a way to use context-based synonyms generation to overcome the drawback of using synonyms replacement.
They added a label parameter to conditional probability to avoid label flipping.

Proposed Method
They proposed a LM to calculate the probability of in a word in specific position based on the context. The context is also a sequence of words surrounding that word. They have used bidirectional LSTM-RNN to learn about the context and choose the relevant words from dictionary.
There is also a risk of class-label flipping. Since substitutions are suggested for all words in a sentence, the class label may change. For instance, the sentence "all actors are fantastic" could be altered to "no actors are fantastic", which would change the meaning and, consequently, the class label. To prevent that, they have used a label conditioning technique.

Experiments
Datasets:

  1. SST2,5(Sentiment classification on movie review)
  2. SUBJ (Subjectivity dataset) annotated in a way that whether a sentence was subjective or objective
  3. MPQA opinion polarity detection of short phrases
  4. RT movie review sentiment analysis
  5. TREC classification of question types

Models :

  1. LSTM-RNN
  2. CNN

Implementation
https://github.com/pfnet-research/

Gaps of the work:
This work may face some limitation regarding low-resource languages, biases from pretrained models. Additionally, it may have hard time dealing with complex sentences, since substituted words may change the structure dramatically.
I am doubtful how this paradigmatically related substitution can maintain producing proper sentences in every domain.

@Sepideh-Ahmadian Sepideh-Ahmadian added the literature-review Summary of the paper related to the work label Sep 12, 2024
@Sepideh-Ahmadian Sepideh-Ahmadian self-assigned this Sep 12, 2024
@hosseinfani
Copy link
Member

@Sepideh-Ahmadian
why "I am doubtful how this paradigmatically related substitution can maintain producing proper sentences in every domain."? like what other domains?

@Sepideh-Ahmadian
Copy link
Member Author

Sure @hosseinfani,
Consider analyzing cancer-related data in the medical domain. My concern is the following: take the sentence "The tumor is benign" and its augmented version "The tumor is harmless". The word benign has a specific meaning in this context. Although the word harmless might appear in medical descriptions (such as a CT scan report), if the model suggests harmless as the augmented version of benign, it could confuse the model during a classification task, as the model may fail to understand the specific terminology of the domain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
literature-review Summary of the paper related to the work
Projects
None yet
Development

No branches or pull requests

2 participants