Tennis-prediction

This is my first project in Data Science. I'm a fan of tennis, after practicing during 7 years. I missed the court. What could be more interesting than studying Data Science with a little bit of nostalgia. We tried to answer : Can tennis outcomes be predicted thanks to data available on the internet ?

I did this project with a friend of mine : Florian DECONINCK, we had 3 weeks to do the analysis and create a dashboard on a website to see latest tweets regarding tennis, see different results of tennis matches and link our predictions to it.

We reached an accuracy of 65% by using elo rating of player in a Logistic Regression. That is a good score already but we then tried to do a more deeper analysis.

We tried to compute more features to see if we can beat this accuracy :

We performed different steps to reach our goal :

Data collection : we got our data from a website that freely gives a pre-populated database with tennis results and a lot of information on players. This website is Ultimate tennis statistics
Data cleaning : we checked all features to see if some important data was missing. We saw that we had matches duplicates, we decided to drop those duplicates and reverse half of the matches to get a balanced sample. We also dropped features that we weren't interested in like match_id, player_rank (we preferred elo_rating which is more accurate) and surface. We handled NA values by dropping records where missing values were essential for future steps. We also saw that missing values were mostly coming from atypical tournaments like Davis Cup or ATP Finals.
Feature scaling : we then scaled our features for the Logistic Regression
Feature Engineering : We computed new statistics as first serve success percentage, winning on first serve percentage, aces, percentage of matches won, and also head to head statistics between players. We then performed feature difference to have one feature for each statistics representing the difference of levels between the 2 players on this statistics.
Feature selection : We performed recursive feature elimination to keep only the most important features.

Modeling : We got a 64% accuracy, this model didn't outperform our previous model, our features don't seem to bring more information than elo rating.

In conclusion, with more time to research really important features, with more time to do exploratory analysis and statistics tests to evaluate our features, we could expect to improve our model. We were a little bit "disappointed" not by our work but by the results, but we will definitely come back on this study, with more knowledge (we are starting a major in Data Science) and with more time to create a good model, and study its return on investment.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
2018 matches		2018 matches
big analysis balanced		big analysis balanced
big analysis		big analysis
elo rating prediction		elo rating prediction
images		images
rank point prediction		rank point prediction
rank prediction		rank prediction
.gitignore		.gitignore
Goals.md		Goals.md
README.md		README.md
Tennis first attempt.ipynb		Tennis first attempt.ipynb
conseils.txt		conseils.txt
matches.csv		matches.csv
variables.txt		variables.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tennis-prediction

About

Releases

Packages

Contributors 2

Languages

Thrynk/Tennis-prediction

Folders and files

Latest commit

History

Repository files navigation

Tennis-prediction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages