Skip to content

Play with Elasticsearch querying data from UCI Machine Learning Repository

Notifications You must be signed in to change notification settings

sulmanen/es-movies

Repository files navigation

Elasticsearch is a good fulltext search engine.

  • Wikipedia search is powered by Elasticsearch.
  • The Guardian joins access log data with social network data using Elasticsearch to give editors an idea of how public is reponding to articles.
  • StackOverflow fulltext search is powered by Elasticsearch. They use the more like this feature to find similar answers.
  • GitHub uses Elasticsearch to query 130 billion lines of code

Prerequisites

Docker and Python 2.7 with pip or easy_istall and internet access.

  1. Get code. git clone [email protected]:sulmanen/es-movies.git
  2. Fire up elasticsearch. docker-compose up
  3. Verify. curl http://localhost:9200
  4. Deps. pip install requests && pip install BeautifulSoup
  5. Create index. ./et index create 0
  6. Create alias. ./et index alias movies 0
  7. Verify alias. curl http://localhost:9200/_aliases
  8. Load data. python2.7 import-movies.py
  9. Fire up crappy ui. http-server
  10. Navigate to http://localhost:8080/

Excercises

We are using UCI Movies Dataset of over 10k films. The titles are from late 1800's to 1999.

Find all the Academy Awards winners in the database. AA stands for winning an Academy Award.

Find the film Elmer Gantry in the raw data. Did it win an Academy Award?

  1. Find all the Academny Award winners excluding those who were just nominated (AAN).
  2. Try to filter all those movies which contain the word 'Vampire'. How many are there? What's up with the score.
  1. The Best films are not in any particular order. Let's see if we can use a function score to order the results after matches have been made. Perhaps the field_value_factor or the decay functions can help us order our movies.

  2. Something isn't right. Let's look at what our index looks like. curl http://localhost:9200/movies. What's the problem?

Creating an index mapping.

Tuning relevance in Elasticsearch is a dance between the index and the query. Let's add some mappings! In order to change the mappings, we will create a new index named 1. There are some ready made mappings. But is there something we should change to make the function score work?

./et create index 1 ./et reindex 0 1 ./et index alias movies 1 0

You can update the index in production in this way without downtime, and also roll back if the new index has a problem.

Find academy award winners in drama category?

'Dram' is the keyword to find Dramas. Can you find drama academy award winners?

Once you start typing into the typeahead field the experience isn't very satisfying. Let's create a typeahead index.

Let's add language analyzers into the mix

They have inherent weaknesses, so let's add the original field to the side of the analyzed one.

But what about stop words?

Bigrams for efficiently matching names

Exact phrase matching

Fuzzy query and minimum should match

About

Play with Elasticsearch querying data from UCI Machine Learning Repository

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published