Skip to content

tgchacko/Sentiment-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sentiment Analysis of Reviews

Table of Contents

Project Overview

Data Sources

Data Description

Tools

EDA Steps

Data Preprocessing Steps and Inspiration

Graphs/Visualizations

Choosing the Algorithm for the Project

Assumptions

Model Evaluation Metrics

Results

Recommendations

Limitations

Future Possibilities of the Project

References

Project Overview

The primary goal of this project is to effectively analyze customer reviews to understand the sentiment and quality perception of products based on user-generated content. The analysis aims to identify patterns and trends in the data that provide insights into customer satisfaction and product quality. Additionally, the project seeks to classify each review based on the sentiment expressed for each product, aiding in the qualitative assessment of feedback.

Data Sources

The primary dataset used for this analysis contains detailed information about product reviews, including text reviews, ratings, and other metadata.

Reviews Dataset

Data Description

The dataset consists of 5,68,411 rows and 10 columns, including unique identifiers for reviews, products, and users, as well as textual data for reviews and summaries. The columns are:

  1. Id: Unique identifier for each review.
  2. ProductId: Unique identifier for the product being reviewed.
  3. ProfileName: Name of the user profile.
  4. HelpfulnessNumerator: Number of users who found the review helpful.
  5. HelpfulnessDenominator: Number of users who indicated whether they found the review helpful or not.
  6. Score: Rating given to the product by the reviewer.
  7. Time: Timestamp when the review was posted.
  8. Unemployment: The unemployment rate in the region
  9. Summary: Summary of the review.
  10. Text: Full text of the review.

Data Description1

Data Description2

Tools

Libraries

Below are the links for details and commands (if required) to install the necessary Python packages:

EDA Steps

Exploratory Data Analysis (EDA) involved exploring the reviews data to answer key questions, such as:

  1. What is the distribution of scores?
  2. How do review lengths vary?
  3. What are the common themes in positive and negative reviews?

Data Preprocessing Steps and Inspiration

  1. Data Cleaning:

Handling Missing Values: Any missing values are identified and removed to ensure the quality of the data. Removing Duplicates: Duplicate entries are checked and removed to ensure the uniqueness of data points for accurate analysis. Consistency Checks: Ensuring that helpfulness numerators do not exceed denominators and standardizing text data for uniformity.

  1. Data Transformation:

Converting Data Types: The data type for the column 'Time' is changed to 'datetime' from 'int64'. Feature Engineering: New features such as helpfulness ratio, text length, and summary length are generated.

  1. Text Preprocessing

  1. Tokenization: Breaking down the text into individual words or tokens.
  2. Stop Words Removal: Eliminating common words that offer little value for analysis.
  3. Lemmatization: Converting words into their base form.
  4. Vectorization: Transforming text data into numerical format using techniques like Count Vectorization and TF-IDF.

Inspiration for Data Preprocessing

The inspiration for the specific preprocessing steps comes from typical challenges encountered in natural language processing and sentiment analysis tasks, particularly noise reduction, dimensionality reduction, and bias removal.

Graphs/Visualizations

Distribution of Text Length

Distribution of Summary Length

Distribution of Review Scores

Distribution of Helpfulness Ratio

Scatter plot of Demand vs Rating(Score)

Word Cloud - Good Reviews

Word Cloud - Bad Reviews

Choosing the Algorithm for the Project

  1. Logistic Regression - TFIDF: Uses term frequency-inverse document frequency (TFIDF) to weigh words based on their importance and logistic regression for binary classification, providing a balance of interpretability and performance.

  2. Naive Bayes - Count Vectorizer: Utilizes the count vectorizer to transform text data into token counts and applies Naive Bayes for probabilistic classification, effective for large datasets and capturing word frequency.

  3. Logistic Regression - Count Vectorizer: Combines count vectorizer for token counts with logistic regression to predict sentiment, suitable for linear relationships and high-dimensional data.

  4. Naive Bayes - TFIDF: Employs TFIDF to emphasize important words and Naive Bayes for classification, balancing word importance and probabilistic predictions.

  5. NLTK SIA Polarity Scores: Utilizes the Sentiment Intensity Analyzer from NLTK to quickly assess sentiment polarity scores, offering a simple and fast sentiment analysis approach.

Assumptions

  1. Independence of Features: Assuming that words are independent of each other.
  2. Linear Relationships: Assuming linear separability of sentiment based on word presence.
  3. Text Preprocessing Decisions: Assuming preprocessing steps adequately capture important features.
  4. Quality and Completeness of Data: Assuming the dataset accurately represents the population of interest.
  5. Sentiment Labeling Accuracy: Assuming sentiment labels are correct.

Model Evaluation Metrics

  1. Accuracy: Measures the proportion of total predictions that were correct.
  2. Precision: Measures the accuracy of positive predictions.
  3. Recall(Sensitivity): Measures the ability to find all relevant cases within a dataset.
  4. F1 Score: The harmonic mean of precision and recall.

Results

Breakdown of Each Model's Performance

  1. Logistic Regression - TFIDF: Accuracy: 91.29%, Precision: 0.84 (negative), 0.93 (positive), Recall: 0.73 (negative), 0.96 (positive), F1-Score: 0.79 (negative), 0.95 (positive)

Logistic Regression with TF-IDF Results

  1. Naive Bayes - Count Vectorizer: Accuracy: 89.42%, Precision: 0.77 (negative), 0.93 (positive), Recall: 0.73 (negative), 0.94 (positive), F1-Score: 0.75 (negative), 0.93 (positive)

Naive Bayes Classifier with Count Vectorizer Results

  1. Logistic Regression - Count Vectorizer: Accuracy: 91.64%, Precision: 0.84 (negative), 0.93 (positive), Recall: 0.76 (negative), 0.96 (positive), F1-Score: 0.80 (negative), 0.95 (positive)

Logistic Regression with Count Vectorizer Results

  1. Naive Bayes - TFIDF: Accuracy: 85.38%, Precision: 0.90 (negative), 0.85 (positive), Recall: 0.37 (negative), 0.99 (positive), F1-Score: 0.52 (negative), 0.91 (positive)

Naive Bayes Classifier with TF-IDF Results

  1. NLTK SIA Polarity Scores: Accuracy: 81.97%, Precision: 0.74 (negative), 0.83 (positive), Recall: 0.26 (negative), 0.97 (positive), F1-Score: 0.39 (negative), 0.89 (positive)

NLTK SIA Polarity Scores

Models Accuracy

Count of Products in Each Segment

Model Accuracy Precision (Negative) Precision (Positive) Recall (Negative) Recall (Positive) F1-Score (Negative) F1-Score (Positive)
Logistic Regression - TF-IDF 91.29% 0.84 0.93 0.73 0.96 0.79 0.95
Naive Bayes - Count Vectorizer 89.42% 0.77 0.93 0.73 0.94 0.75 0.93
Logistic Regression - Count Vectorizer 91.64% 0.84 0.93 0.76 0.96 0.80 0.95
Naive Bayes - TF-IDF 85.38% 0.90 0.85 0.37 0.99 0.52 0.91

Balanced Performance: Logistic Regression – Count Vectorizer stands out as the best model due to its high accuracy and balanced precision and recall across both classes.

Recommendations

  1. Implement targeted improvements based on feedback from reviews.
  2. Use sentiment analysis results to guide product development and marketing strategies.
  3. Continuously update and refine the models with new data for improved accuracy.

Limitations

  1. Data Quality: Potential inaccuracies due to underreporting or subjective nature of reviews.
  2. Model Limitations: Models may not capture all nuances of sentiment in reviews.
  3. External Factors: Other factors not included in the analysis can impact sentiment.

Future Possibilities of the Project

  1. Advanced Predictive Modeling: Explore advanced models like NBEATS, NHITS, PatchTST, VARMAX, VAR, and KATS for enhanced accuracy.
  2. Store-Specific/Product-Specific Analysis: Conduct detailed analysis for each product category in each store to uncover unique patterns and optimize models for individual characteristics.
  3. External Factors Integration: Incorporate additional factors like economic indicators, social events, and regional factors for a comprehensive approach.

References

  1. Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O’Reilly Media, Inc.
  2. Jurafsky, D., & Martin, J. H. (2019). Speech and Language Processing (3rd ed.).