Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big dataset of unsolicited reviews found #98

Open
Mahmoud-s-programs opened this issue Nov 4, 2024 · 2 comments
Open

Big dataset of unsolicited reviews found #98

Mahmoud-s-programs opened this issue Nov 4, 2024 · 2 comments

Comments

@Mahmoud-s-programs
Copy link

I have written a Python script that filters out explicit terms and stores the data in a new file with the same format. Consequently, the number of reviews dropped from 63000 to 11000. I checked the results and surely enough the remaining reviews do not contain any of the explicit terms that were specified in the script including those that have suffixes and prefixes.
filtering_explicits

My issue is that some of the terms are not aspects as noted by Dr. Fani. Here is the list of terms in English that resulted in a deletion of review:
كتاب - book
كتب - books
مؤلف - author (male)
كاتب - writer (male)
مؤلفة - author (female)
كاتبة - writer (female)
رواية - novel
روايات - novels
قصة - story
قصص - stories
حكاية - tale
حكايات - tales
مجلد - volume
جزء - part or section
فصول - chapters
فصل - chapter
شخصية - character
بطل - hero or protagonist (male)
بطلة - heroine or protagonist (female)
أبطال - heroes or protagonists
عدو - enemy or antagonist
أعداء - enemies or antagonists
صديق - friend (male)
صديقة - friend (female)
أصدقاء - friends
حبكة - plot
حدث - event
أحداث - events
نهاية - ending or conclusion
بداية - beginning or start
ذروة - climax
حل - resolution or solution
عقدة - conflict or knot (in a story)
أسلوب - style
لغة - language
تعبير - expression
وصف - description
سرد - narration
حوار - dialogue
كلمة - word
جملة - sentence
مفردات - vocabulary
مصطلحات - terminology or terms
As you can see, some of the terms may be implicit or aspects that would affect the review.

@hosseinfani
Copy link
Member

hosseinfani commented Nov 4, 2024

@Mahmoud-s-programs
thank you.
you can also paste your code here (not image), also the link to the dataset git, etc.

@Sepideh-Ahmadian
Do you think we can assume that these reviews can have the implicit aspect term as we can remove the book name (from the book id) from the review text?

@Mahmoud-s-programs
Copy link
Author

Mahmoud-s-programs commented Nov 5, 2024

dataset link [https://github.com/mohamedadaly/LABR/tree/master/data] filename: reviews.tsv

code:

import pandas as pd
import re
#Loading the dataset
column_names = ['rating', 'review_id', 'user_id', 'book_id', 'review']
df = pd.read_csv('arabic_reviews.tsv', sep='\t', names=column_names, encoding='utf-8')
df['review'] = df['review'].astype(str)
df.dropna(subset=['review'], inplace=True)

#Defining and normalizing explicit terms
explicit_terms = [
    'كتاب', 'كتب', 'مؤلف', 'كاتب', 'مؤلفة', 'كاتبة', 'رواية', 'روايات', 'قصة', 'قصص',
    'حكاية', 'حكايات', 'مجلد', 'جزء', 'فصول', 'فصل',
    'شخصية', 'بطل', 'بطلة', 'ابطال', 'عدو', 'اعداء', 'صديق', 'صديقة', 'اصدقاء',
    'حبكة', 'حدث', 'احداث', 'نهاية', 'بداية', 'ذروة', 'حل', 'عقدة',
    'اسلوب', 'لغة', 'تعبير', 'وصف', 'سرد', 'حوار', 'كلمة', 'جملة', 'مفردات', 'مصطلحات', 'مسرحية'
]

def normalize_arabic(text):
    # Normalizing certain letters
    text = re.sub("[إأآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "و", text)
    text = re.sub("ئ", "ي", text)
    text = re.sub("ة", "ه", text)
    text = re.sub("گ", "ك", text)
    return text

explicit_terms_normalized = [normalize_arabic(term) for term in explicit_terms]

#Creating regex patterns for explicit terms with prefixes and suffixes
prefixes = ['ال', 'و', 'ف', 'ب', 'ك', 'ل', 'لل', 'س']  # Common Arabic prefixes
suffixes = ['ه', 'ها', 'ك', 'ي', 'هم', 'نا', 'كم', 'هن', 'تن', 'ا', 'ات', 'ون', 'ين', 'ان', 'ين', 'ه', 'ة']  # Common Arabic suffixes

def create_regex_pattern(term):
    prefix_pattern = '(' + '|'.join(prefixes) + ')?'  # Optional prefixes
    suffix_pattern = '(' + '|'.join(suffixes) + ')?'  # Optional suffixes
    pattern = prefix_pattern + term + suffix_pattern
    return pattern

regex_patterns = [create_regex_pattern(term) for term in explicit_terms_normalized]
combined_pattern = '|'.join(regex_patterns)
compiled_pattern = re.compile(combined_pattern)
#Filtering out reviews containing explicit terms
def contains_explicit_term(review):
    review_normalized = normalize_arabic(review)
    return bool(compiled_pattern.search(review_normalized))

mask = ~df['review'].apply(contains_explicit_term)
df_filtered = df[mask].reset_index(drop=True)

#Storing the filtered dataset
df_filtered.to_csv('arabic_reviews_filtered.tsv', sep='\t', index=False, header=False, encoding='utf-8')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants