GitHub - recodehive/Scrape-ML: For new data generation Semi-supervised-sequence-learning-Project we have writtern a python script to fetch📊, data from the 💻, imdb website 🌐 and converted into txt files.

IMDB Movie review Scrapping

Scraping the movie review ✏️ using python programming language💻.

🔍Welcome to the IMDb Movie Review Scraper project! 🌟.

This Python script is designed to scrape movie reviews from IMDb, to facilitate analysis and research. The IMDb Movie Review Scraping project aims to gather a new dataset by automatically extracting movie reviews from IMDb. This dataset will support various natural language processing tasks, including sentiment analysis and recommendation systems. Using web scraping techniques, such as Beautiful Soup, movie reviews are collected, preprocessed, and structured into a CSV format suitable for analysis, including Support Vector Machine classification. 📈

Features

Semi-supervised-sequence-learning-Project : replication process is done over here and for further analysis creation of new data is required.

Scraping Movie Reviews 🕵️‍♂️

Movie_review_imdb_scrapping.ipynb - The script fetches user reviews from IMDb, providing access to a diverse range of opinions and feedback for different movies. It utilizes BeautifulSoup, a powerful Python library for web scraping, to extract data from IMDb's web pages efficiently and accurately. 🎥🔎

Customizable Scraper 🛠️

rename_files.ipynb - Users can customize the scraper to target specific time periods, ratings, and other parameters, enabling focused data collection based on their requirements. This flexibility allows researchers, analysts, and enthusiasts to tailor the scraping process to their specific needs.

CSV Output 📁

convert_texts_to_csv.ipynb - The scraped data is saved into a CSV file, allowing for easy import into data analysis software or further processing. The CSV format ensures compatibility with a wide range of tools and platforms, making it convenient to incorporate the scraped data into various workflows and projects. 💾💼

Getting Started

Dependencies

Make sure you have the following dependencies installed:

Python 3.x
BeautifulSoup (Install using pip install beautifulsoup4
Pandas (Install using pip install pandas

Installation

Fork the Semi-supervised-sequence-learning-Project/ repository Link to `Semi-supervised-sequence-learning-Project' Follow these instructions on how to fork a repository

Clone the Repository to your local machine

using SSH:

git clone [email protected]:your-username/Semi-supervised-sequence-learning-Project.git

Or using HTTPS:

git clone https://github.com/your-username/Semi-supervised-sequence-learning-Project.git

Navigate to the project directory.

cd Semi-supervised-sequence-learning-Project

Troubleshooting

Dependency Installation Issues

If you encounter issues while installing dependencies such as BeautifulSoup or Pandas, try the following:

Ensure you're using the correct version of Python (check the project's requirements).
Use pip to install the necessary libraries:
```
pip install beautifulsoup4 pandas
```
If you encounter permission errors, try adding --user to the installation command:
```
pip install --user beautifulsoup4 pandas
```

For missing or outdated dependencies, create a virtual environment and install the required packages:

python -m venv env
source env/bin/activate  # On Windows use `env\Scripts\activate`
pip install -r requirements.txt

Scraping Errors

If the script fails to fetch reviews or if there are changes to the website:

Inspect the Website: The structure of the HTML may have changed. Use browser developer tools (F12) to inspect the elements you're scraping.
Update Selectors: Modify the CSS selectors or XPath in the script to match the current structure of the webpage.

Check for Blocked Requests: Websites may block scraping requests. Use headers in your requests to mimic a regular browser:

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
response = requests.get(url, headers=headers)

CSV Format Issues

If you're facing problems with the CSV file format:

Ensure Proper Formatting: Verify that the CSV file is correctly formatted. Each field should be separated by commas, and text fields should be enclosed in quotes if they contain commas.
Check Encoding: Ensure the file is saved with UTF-8 encoding to prevent issues with special characters.
Verify Column Names: If your script requires specific column names, ensure they match exactly.

Usage

Starting the Streamlit app

Navigate to the Web_app directory

cd Web_app

Install requirements with pip

pip install -r requirements.txt

Run the Streamlit app

streamlit run streamlit_app.py

Uploading the CSV file

When prompted by the app, upload a CSV (comma separated value) file containing the reviews.

Demo Link

Streamlit app link: https://scrape-review-analysis.streamlit.app

Contribution

Contributions are welcome! If you have any suggestions for improvements or new features, please feel free to submit a pull request. Your contributions help make this project better for everyone.

Contribution

This project thanks all the contributors for having your valuable contribution to our project

Final Dataset

🔬Here is the Link to Final Dataset: Drive Link containing the scraped IMDb movie reviews. This dataset can be used for analysis, research, or any other purposes you require. 📦

Support

For any issues regarding the scraper, feel free to open an issue on GitHub. We'll be happy to assist you with any problems or inquiries you may have. 🛠️

Back to top

*

🌐 Connect with Me

Thank you for visiting! Feel free to reach out through any of the links above.

Name		Name	Last commit message	Last commit date
Latest commit History 242 Commits
.github		.github
.ipynb_checkpoints		.ipynb_checkpoints
.vscode		.vscode
IMDB/train		IMDB/train
Movie Genre Classification		Movie Genre Classification
Smart_select features		Smart_select features
Tesseract-OCR		Tesseract-OCR
Web_app		Web_app
assets		assets
backlog		backlog
data_scrapped		data_scrapped
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Learn.md		Learn.md
Movie_review_imdb_scrapping.ipynb		Movie_review_imdb_scrapping.ipynb
Movie_review_rotten_tomatoes.ipynb		Movie_review_rotten_tomatoes.ipynb
README.md		README.md
image-1.png		image-1.png
image-2.png		image-2.png
image.png		image.png
main.py		main.py
results.csv		results.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMDB Movie review Scrapping

Getting Started

Troubleshooting

Dependency Installation Issues

Scraping Errors

CSV Format Issues

Usage

Contribution

Contribution

This project thanks all the contributors for having your valuable contribution to our project

Final Dataset

Support

🌐 Connect with Me

About

Releases

Packages

Contributors 56

Languages

License

recodehive/Scrape-ML

Folders and files

Latest commit

History

Repository files navigation

IMDB Movie review Scrapping

Getting Started

Troubleshooting

Dependency Installation Issues

Scraping Errors

CSV Format Issues

Usage

Contribution

Contribution

This project thanks all the contributors for having your valuable contribution to our project

Final Dataset

Support

🌐 Connect with Me

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 56

Languages

Packages