search-engine

We designed and implemented a web search engine based on the vector space model. This search engine uses the non-relational database mongodb and the tookit of natural language processing nltk.

Dataset

File: data.csv

Source: https://dataverse.harvard.edu/dataset.xhtml?id=3010077

Description: This is a reusable publicly-available dataset for “media bias” studies. The content of this dataset is publish date, title, subtitle and text for 3824 news articles. These articles are collected by a project within 3 months from December of 2016 to march 2017. The source of these news articles are from ABC News, CNN news, The Huffington Post, BBC News, DW News, TASS News, Al Jazeera News, China Daily and RTE News. All of them are collected by using RSS feeds of each news sites. (2017-3-31)

Install MongoDB

$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 68818C72E52529D4
$ sudo echo "deb http://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
$ sudo apt-get update
$ sudo apt-get install -y mongodb-org
$ sudo systemctl start mongod
$ sudo systemctl enable mongod

Create a database

$ mongo
$ use search-engine

Performing the ingestion of the articles in the database

$ python3 ingestion.py

Run the search algorithm

$ python3 search.py

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

search-engine

Dataset

Install MongoDB

Create a database

Performing the ingestion of the articles in the database

Run the search algorithm

Files

README.md

Latest commit

History

README.md

File metadata and controls

search-engine

Dataset

Install MongoDB

Create a database

Performing the ingestion of the articles in the database

Run the search algorithm