Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rauh dictionary wrong negated forms #24

Open
Astelix opened this issue Mar 25, 2019 · 8 comments
Open

Rauh dictionary wrong negated forms #24

Astelix opened this issue Mar 25, 2019 · 8 comments

Comments

@Astelix
Copy link

Astelix commented Mar 25, 2019

In the (german) "data_dictionary_Rauh" the negated forms should be "nicht ..." instead of "not ...". For substantivated forms ending on "...ung" it should be "keine".

@kbenoit
Copy link
Owner

kbenoit commented Mar 25, 2019

Thanks @Astelix! @stefan-mueller want to verify and fix?

@stefan-mueller
Copy link
Collaborator

Thanks! I am aware of this, but the original dictionary indicates negations though "not" in the categories neg_negative and neg_positive. Thus, changing the forms to "nicht" or "keine" would also imply changing the entries in the original dictionary. Otherwise, negations will not be detected. I am not sure whether we should touch the dictionary entries. What do you think?

library(quanteda.dictionaries)

head(data_dictionary_Rauh$neg_positive, 15)
#>  [1] "not aalen"             "not abbauwürdig"      
#>  [3] "not abfangschirm"      "not abgefahren"       
#>  [5] "not abgeheilt"         "not abgehend"         
#>  [7] "not abgeklärtheit"     "not abgelagert"       
#>  [9] "not abgemacht"         "not abgeschlossenheit"
#> [11] "not abgesichert"       "not abgestimmt"       
#> [13] "not abgeworben"        "not abgleich"         
#> [15] "not abgleichen"

@Astelix
Copy link
Author

Astelix commented Mar 25, 2019

From the original dictionary:

                                       pattern      replacement          feature 

sentiment
1999 (nicht|nichts|kein|keine|keinen) aalen NOT_aalen NOT_aalen -1
21164 (nicht|nichts|kein|keine|keinen) aalglatt NOT_aalglatt NOT_aalglatt
1
21165 (nicht|nichts|kein|keine|keinen) aasen NOT_aasen NOT_aasen 1
21166 (nicht|nichts|kein|keine|keinen) aasig NOT_aasig NOT_aasig 1
17540 (nicht|nichts|kein|keine|keinen) abandon NOT_abandon NOT_abandon 1
21167 (nicht|nichts|kein|keine|keinen) abarbeiten NOT_abarbeiten NOT_abarbeiten 1

@kbenoit
Copy link
Owner

kbenoit commented Mar 27, 2019

It would work as from the original dictionary if it's structured as a regular expression dictionary. Unlike glob patterns, the regex would permit us to prefix each positive word with the negation possibilities.

library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
dicttest <- dictionary(list(neg_positive = c("^(nicht|nichts|kein|keine|keinen)$ ^abarbeiten$")))

txt <- c(
  "etwas nicht abarbeiten und etwas keine abarbeiten",
  "etwas abarbeiten und keinen abarbeiten"
)

tokens(txt) %>%
  tokens_lookup(dictionary = dicttest, valuetype = "regex", exclusive = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "etwas"        "NEG_POSITIVE" "und"          "etwas"       
## [5] "NEG_POSITIVE"
## 
## text2 :
## [1] "etwas"        "abarbeiten"   "und"          "NEG_POSITIVE"

We don't currently have a valuetype set in the dictionary object class, but we do have an open issue for it (#1264). This would be a good argument for adding that attribute, so that the lookup functions used that as the default rather than "glob". That would enable us to make sure that every dictionary was associated with the correct pattern matching type (valuetype).

@stefan-mueller
Copy link
Collaborator

That would be a very elegant solution. I just asked Christian Rauh what he thinks about this idea.

@ChRauh
Copy link

ChRauh commented Mar 27, 2019

Great to see interest in the dictionary and thanks again for including it into your fantastic package!

On the issue: The dictionary is structured such that it matches valuetype = "regex" . Thus (and also more generally), I'd consider adding a valuetype attribute to the dictionary object class as very convenient from the user perspective.

Note, however, that I would still suggest to first replace the negation patterns in the original text with a compound marker such as "NOT_[token]" (maybe via tokens_replace()) before retrieving the dictionary counts via tokens_lookup() or dfm(). This makes a difference when aggregating the counts to some sentiment score.

For example, directly counting dictionary terms in the string 'nicht abarbeiten' would retrieve one negative and one negated negative hit. Yet having this replaced with 'NOT_abarbeiten' beforehand would retrieve only the negated negative hit.

Hope this helps...

@kbenoit
Copy link
Owner

kbenoit commented Mar 27, 2019

Thanks @ChRauh that's a good point. Could be done in two stages:

dicttest <-
  dictionary(list(
    neg_positive = c("^(nicht|nichts|kein|keine|keinen)$ ^abarbeiten$"),
    positive = "^abarbeiten$"
  ))

txt <- c(
  "etwas nicht abarbeiten und etwas keine abarbeiten",
  "etwas abarbeiten und keinen abarbeiten"
)

dfm(txt, dictionary = dicttest, valuetype = "regex")
## Document-feature matrix of: 2 documents, 2 features (0.0% sparse).
## 2 x 2 sparse Matrix of class "dfm"
##        features
## docs    neg_positive positive
##   text1            2        2
##   text2            1        2

tokens(txt) %>%
  tokens_lookup(dicttest["neg_positive"], valuetype = "regex", exclusive = FALSE) %>%
  tokens_lookup(dicttest, valuetype = "regex", exclusive = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "etwas"        "NEG_POSITIVE" "und"          "etwas"       
## [5] "NEG_POSITIVE"
## 
## text2 :
## [1] "etwas"        "POSITIVE"     "und"          "NEG_POSITIVE"

@ChRauh
Copy link

ChRauh commented Mar 27, 2019

@kbenoit Yes, 'piping' it in that order does the trick. Learned something, thanks!
Maybe also a useful example for the helpfile in which @stefan-mueller has already flagged the replacement issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants