We present a dataset where the tweets are labeled directly by their authors. To our knowledge, this is the first attempt to create noise-free examples of intended sarcasm.
We investigate how third-party annotators perceive the sarcastic intention of the authors reflected in iSarcasm. These labels capture the intention of the authors with an F-score of 0.616, indicating that sarcasm is a subjective phenomenon that is challenging even for humans to detect.
iSarcasm contains 4,484 English tweets, each with an associated intended sarcasm label provided by its author, with a ratio of roughly 1:5 of sarcastic to non-sarcastic labels.
The way sarcasm is defined varies greatly across socio-cultural backgrounds. Dress et al. (2008) notice that members of collectivist cultures tend to express sarcasm in a more subtle way than individualists. They also point out gender differences. Those who identify as females seem to have a more self-deprecating attitude when using sarcasm.
Rockwell and Theriot (2001) find some cultures to associate sarcasm with humour more than others. There also seem to be cultures who do not use sarcasm at all. An utterance that is intended sarcastic by its author might not be perceived as such by the audience.
Two methods were used so far to label texts for sarcasm: distant supervision and manual labelling.
Text-based models, contextual models.
The hashtags might not mark sarcasm, but constitute the subject or object of the conversation --> false positives.
The intended sarcasm that they do capture might be of a specific flavor, either when tweeted by a specific account or when carrying a given tag. Building a model trained on this dataset might, therefore, be biased to a flavour of sarcasm, being unable to capture other flavours, increasing the risk of false negatives.
Finally, if a text does not contain the predefined tags, it is considered non-sarcastic.
Sarcasm perception varies across socio-cultural contexts.
Joshi et al. (2016a) provide more insight into this problem on the Riloff dataset. They present the dataset, initially labelled by Americans, to be labelled by Indians who are trained linguists. They find higher disagreement between Indian and American annotators, than between annotators of the same nationality.
As we saw, both labelling methods use a proxy for detecting intended sarcasm, in the form of predefined tags, predefined sources, or third-party annotators. They may create noisy labels, in terms of both false positives and false negatives.
In a survey, twitter users had to submit one sarcastic tweet from them + 3 non-sarcastic, just text and no retweets. Users had to explain why it was sarcastic and what they would say to convey the same message non-sarcastically. we asked for their age, gender, birth country and region, and current country and region.
We asked a human trained linguist to label each sarcastic tweet as belonging to one of the categories of ironic speech:
- sarcasm: tweets that contradict a knowable state of affairs and are critical towards an addressee;
- irony: tweets that contradict a knowable state of affairs but are not obviously critical towards an addressee;
- satire
- understatement
- overstatement
- rhetorical question
- invalid
Our aim is to estimate the human performance in detecting sarcasm as intended by the authors. Previous studies have shown gender and country to be the variables most influential on this definition.
We received 1.2k responses, about 52% from the UK, 32% from US, and others from Canada, Australia. 48 females vs. 52 males, 68% less than 35 years old.
After substracting the invalid tweets as marked by the linguist, we end up with 777 sarcastic vs. 3.7k non-sarcastic tweets.
Further, the explanations, rephrases, and user demographic information, might prove invaluable for future modelling and analysis purposes. As such, we are happy to provide all these to researchers who contact us, under an agreement to protect the privacy of the contributors according to the consent form.
Others tweets that only humans understand seem to allow more possible interpretations. One example is “I’m buzzing to get back to my double workouts tomorrow”. Depending on the background of the person reading, it might be perceived as sarcastic (e.g. by a sedentary person) or non-sarcastic (e.g. by an athlete).