Assignment 2 for the Mining Big Datasets Course of AUEB's MSc in Business Analytics.
You are provided with a subset of the high energy physics theory citation network, comprising authors, articles, journals, and citations between articles. The dataset contains:
- 29,555 articles with id, title, year, journal, and abstract
- 15,420 authors with names
- 836 journals with names
- 352,807 citations among papers You can download the dataset (Citation Dataset) from moodle in CSV format. The dataset files include:
- ArticleNodes.csv: Information about Article nodes (id, title, year, journal, and abstract).
- AuthorNodes.csv: Article id and the name of the author(s).
- Citations.csv: Information about citations between articles (articleId,--[Cites]->, articleId).
Model the data as a property graph by designing the appropriate entities and assigning the relevant labels, types, and properties. Include attributes that describe each node and edge type without repetitions. Ensure nodes are connected only when necessary.
Create a graph database on Neo4j and load the citation network elements using the provided CSV files. You can load the dataset directly from the CSV files using the Neo4j browser, Neo4j import tool, or any supported programming language. Consider creating proper indexes on your model properties to improve loading and query response times.
Execute the following queries using the Cypher language:
- Identify the top 5 authors with the most citations from other papers.
- Determine the top 5 authors with the most collaborations with different authors.
- Find the author who has written the most papers without collaborations.
- Discover the author who published the most papers in 2001.
- Identify the journal with the most papers about "gravity" in 1998.
- Find the top 5 papers with the most citations.
- Retrieve papers that mention both "holography" and "anti de sitter" in the abstract.
- Find the shortest path between two authors ('C.N. Pope' and 'M. Schweda').
- Repeat the previous query but only using edges between authors and papers.
- Find all authors with shortest path lengths > 25 from author 'Edward Witten' considering only edges between authors and articles.
Your deliverable should include:
- Report.pdf:
- Detailed graph model description.
- Commands used for importing files to the database.
- Cypher code for required queries with results.
- Program/Script: Implementations for any step of the assignment.
- queries.cy: A text file containing the queries expressed in Cypher language.