Big dataset of unsolicited reviews found #98

Mahmoud-s-programs · 2024-11-04T00:36:54Z

I have written a Python script that filters out explicit terms and stores the data in a new file with the same format. Consequently, the number of reviews dropped from 63000 to 11000. I checked the results and surely enough the remaining reviews do not contain any of the explicit terms that were specified in the script including those that have suffixes and prefixes.

My issue is that some of the terms are not aspects as noted by Dr. Fani. Here is the list of terms in English that resulted in a deletion of review:
كتاب - book
كتب - books
مؤلف - author (male)
كاتب - writer (male)
مؤلفة - author (female)
كاتبة - writer (female)
رواية - novel
روايات - novels
قصة - story
قصص - stories
حكاية - tale
حكايات - tales
مجلد - volume
جزء - part or section
فصول - chapters
فصل - chapter
شخصية - character
بطل - hero or protagonist (male)
بطلة - heroine or protagonist (female)
أبطال - heroes or protagonists
عدو - enemy or antagonist
أعداء - enemies or antagonists
صديق - friend (male)
صديقة - friend (female)
أصدقاء - friends
حبكة - plot
حدث - event
أحداث - events
نهاية - ending or conclusion
بداية - beginning or start
ذروة - climax
حل - resolution or solution
عقدة - conflict or knot (in a story)
أسلوب - style
لغة - language
تعبير - expression
وصف - description
سرد - narration
حوار - dialogue
كلمة - word
جملة - sentence
مفردات - vocabulary
مصطلحات - terminology or terms
As you can see, some of the terms may be implicit or aspects that would affect the review.

hosseinfani · 2024-11-04T06:40:43Z

@Mahmoud-s-programs
thank you.
you can also paste your code here (not image), also the link to the dataset git, etc.

@Sepideh-Ahmadian
Do you think we can assume that these reviews can have the implicit aspect term as we can remove the book name (from the book id) from the review text?

Mahmoud-s-programs · 2024-11-05T01:32:35Z

dataset link [https://github.com/mohamedadaly/LABR/tree/master/data] filename: reviews.tsv

code:

import pandas as pd
import re
#Loading the dataset
column_names = ['rating', 'review_id', 'user_id', 'book_id', 'review']
df = pd.read_csv('arabic_reviews.tsv', sep='\t', names=column_names, encoding='utf-8')
df['review'] = df['review'].astype(str)
df.dropna(subset=['review'], inplace=True)

#Defining and normalizing explicit terms
explicit_terms = [
    'كتاب', 'كتب', 'مؤلف', 'كاتب', 'مؤلفة', 'كاتبة', 'رواية', 'روايات', 'قصة', 'قصص',
    'حكاية', 'حكايات', 'مجلد', 'جزء', 'فصول', 'فصل',
    'شخصية', 'بطل', 'بطلة', 'ابطال', 'عدو', 'اعداء', 'صديق', 'صديقة', 'اصدقاء',
    'حبكة', 'حدث', 'احداث', 'نهاية', 'بداية', 'ذروة', 'حل', 'عقدة',
    'اسلوب', 'لغة', 'تعبير', 'وصف', 'سرد', 'حوار', 'كلمة', 'جملة', 'مفردات', 'مصطلحات', 'مسرحية'
]

def normalize_arabic(text):
    # Normalizing certain letters
    text = re.sub("[إأآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "و", text)
    text = re.sub("ئ", "ي", text)
    text = re.sub("ة", "ه", text)
    text = re.sub("گ", "ك", text)
    return text

explicit_terms_normalized = [normalize_arabic(term) for term in explicit_terms]

#Creating regex patterns for explicit terms with prefixes and suffixes
prefixes = ['ال', 'و', 'ف', 'ب', 'ك', 'ل', 'لل', 'س']  # Common Arabic prefixes
suffixes = ['ه', 'ها', 'ك', 'ي', 'هم', 'نا', 'كم', 'هن', 'تن', 'ا', 'ات', 'ون', 'ين', 'ان', 'ين', 'ه', 'ة']  # Common Arabic suffixes

def create_regex_pattern(term):
    prefix_pattern = '(' + '|'.join(prefixes) + ')?'  # Optional prefixes
    suffix_pattern = '(' + '|'.join(suffixes) + ')?'  # Optional suffixes
    pattern = prefix_pattern + term + suffix_pattern
    return pattern

regex_patterns = [create_regex_pattern(term) for term in explicit_terms_normalized]
combined_pattern = '|'.join(regex_patterns)
compiled_pattern = re.compile(combined_pattern)
#Filtering out reviews containing explicit terms
def contains_explicit_term(review):
    review_normalized = normalize_arabic(review)
    return bool(compiled_pattern.search(review_normalized))

mask = ~df['review'].apply(contains_explicit_term)
df_filtered = df[mask].reset_index(drop=True)

#Storing the filtered dataset
df_filtered.to_csv('arabic_reviews_filtered.tsv', sep='\t', index=False, header=False, encoding='utf-8')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big dataset of unsolicited reviews found #98

Big dataset of unsolicited reviews found #98

Mahmoud-s-programs commented Nov 4, 2024

hosseinfani commented Nov 4, 2024 •

edited

Loading

Mahmoud-s-programs commented Nov 5, 2024 •

edited by hosseinfani

Loading

Big dataset of unsolicited reviews found #98

Big dataset of unsolicited reviews found #98

Comments

Mahmoud-s-programs commented Nov 4, 2024

hosseinfani commented Nov 4, 2024 • edited Loading

Mahmoud-s-programs commented Nov 5, 2024 • edited by hosseinfani Loading

hosseinfani commented Nov 4, 2024 •

edited

Loading

Mahmoud-s-programs commented Nov 5, 2024 •

edited by hosseinfani

Loading