Predict IMDB movie rating

.===================================================================================

Predict IMDB movie rating

by Chuan Sun (sundeepblue at gmail dot com)

https://twitter.com/sundeepblue

Scrapy project @ NYC Data Science Academy

8/14/2016

===================================================================================

STEP 1:

Fetch a list of 5000 movie titles and budgets from www.the-numbers.com

This step will generate a JSON file 'movie_budget.json'

$ scrapy crawl movie_budget -o movie_budget.json

===================================================================================

STEP 2:

Load 5000+ movie titles from the JSON file 'movie_budget.json'

Then search those titles from IMDB website to get the real IMDB movie links

It will generate a JSON file 'fetch_imdb_url.json' containing movie-link pairs

$ scrapy crawl fetch_imdb_url -o fetch_imdb_url.json

===================================================================================

STEP 3:

Scrape 5000+ IMDB movie information

This step will load the JSON file 'fetch_imdb_url.json', go into each movie page, and grab data

This step will generate a JSON file 'imdb_output.json' (20M) containing detailed info of 5000+ movies

It will also download all available posters for all movies.

A total of 4907 posters can be downloaded (998MB). Note that I am not sure if I can upload all those posters into github, so I only uploaded a few. You can see from my code how to use scrapy to grab them all.

$ scrapy crawl imdb -o imdb_output.json

===================================================================================

STEP 4:

Perform face recognition to count face numbers from all posters

This step will save result into JSON file 'image_and_facenumber_pair_list.json'

$ python detect_faces_from_posters.py

===================================================================================

STEP 5:

Load two JSON files 'imdb_output.json' and 'image_and_facenumber_pair_list.json'

Parse all variables into valid format.

Generate a final CSV table containing 28 variables that can be loaded in R or Pandas

The output will be a CSV file 'movie_metadata.csv' (1.5MB)

"movie_title"
"color"
"num_critic_for_reviews"
"movie_facebook_likes" "duration"
"director_name"
"director_facebook_likes"
"actor_3_name" "actor_3_facebook_likes"
"actor_2_name"
"actor_2_facebook_likes"
"actor_1_name" "actor_1_facebook_likes"
"gross"
"genres"
"num_voted_users"
"cast_total_facebook_likes" "facenumber_in_poster"
"plot_keywords"
"movie_imdb_link"
"num_user_for_reviews"
"language"
"country"
"content_rating"
"budget"
"title_year"
"imdb_score"
"aspect_ratio"

$ python parse_scraped_data.py

===================================================================================

STEP 6:

Load the 'movie_metadata.csv' file in RStudio, and perform EDA and LASSO regression

$ > run the RStudio

$ > load the file 'movie_rating_prediction.R'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predict IMDB movie rating

STEP 1:

STEP 2:

STEP 3:

STEP 4:

STEP 5:

STEP 6:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
full		full
movie		movie
README.md		README.md
detect_faces_from_posters.py		detect_faces_from_posters.py
fetch_imdb_url.json		fetch_imdb_url.json
image_and_facenumber_pair_list.json		image_and_facenumber_pair_list.json
imdb_output.json		imdb_output.json
movie_budget.json		movie_budget.json
movie_metadata.csv		movie_metadata.csv
movie_rating_prediction.r		movie_rating_prediction.r
parse_scraped_data.py		parse_scraped_data.py
scrapy.cfg		scrapy.cfg

sundeepblue/movie_rating_prediction

Folders and files

Latest commit

History

Repository files navigation

Predict IMDB movie rating

STEP 1:

STEP 2:

STEP 3:

STEP 4:

STEP 5:

STEP 6:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages