Skip to content

Latest commit

 

History

History
63 lines (33 loc) · 2.71 KB

README.md

File metadata and controls

63 lines (33 loc) · 2.71 KB

📰 BigData Project : Arxiv_Analysis (Computer Science)

Get to know some interesting things concerning the academic frontier in CS by analysing numerous papers in ArXiv.

Group Leader: Jianshu Zhang

Group Member: Yanfu Kai; Ziheng Peng

Adviser : Prof. Run Wang

Report Pre

Our work

image

Arxiv_Analysis Project Structure

This project is structured as follows:

  • ./crawler_utils: Contains utilities for crawling data from Arxiv.

  • ./dataset: To replicate the whole project, you need to download bert-base-uncased. And all the csv file can be reproduced by running ./crawler_utils/crawl.py, ./dataset/prepocess.py, ./dataset/trans_to_bert.py.

  • ./results: The result of data analysis.

  • ./tools: Includes every tools used for analysis and the outputs will be saved in ./visualization. Below is a list of the scripts along with a brief description of their purpose:

    cata_kmeans.py: Performs K-Means clustering on the dataset to identify distinct groups based on characteristics.

    cata_num_rank.py: Rank the number of different catagories from 11/30/2022 - 12/01/2023 .

    cata_rela_cs.py: Analyzes the relationship between different categories .

    cata_rela_sum.py: Summarizes the relationships between categories by using a network.

    cata_wordcloud.py: Generates a word cloud from categorical data to visualize the frequency or importance of categories.

    month_inter.py: Try to find the statistic regularity of the interval of the initial and the last submission.

    month_statistic.py: Interprets monthly data, possibly to identify trends or patterns over time.

    _ rela_cs_radar.py_: Creates a radar chart to show the relationship of cs with other catagories.

    year_statistic.py: Calculates yearly statistics to provide insights into long-term trends.

  • ./visualization: Visualization of data analysis results, containing various and appropriate figures.

  • ./test_if_spark_can_work.py: Test the Spark environment setup.

Tasks

image

Algorithm Design

image

Few examples of our visualization

image