Get to know some interesting things concerning the academic frontier in CS by analysing numerous papers in ArXiv.
Group Leader: Jianshu Zhang
Group Member: Yanfu Kai; Ziheng Peng
Adviser : Prof. Run Wang
This project is structured as follows:
-
./crawler_utils
: Contains utilities for crawling data from Arxiv. -
./dataset
: To replicate the whole project, you need to download bert-base-uncased. And all the csv file can be reproduced by running ./crawler_utils/crawl.py, ./dataset/prepocess.py, ./dataset/trans_to_bert.py. -
./results
: The result of data analysis. -
./tools
: Includes every tools used for analysis and the outputs will be saved in ./visualization. Below is a list of the scripts along with a brief description of their purpose:cata_kmeans.py
: Performs K-Means clustering on the dataset to identify distinct groups based on characteristics.cata_num_rank.py
: Rank the number of different catagories from 11/30/2022 - 12/01/2023 .cata_rela_cs.py
: Analyzes the relationship between different categories .cata_rela_sum.py
: Summarizes the relationships between categories by using a network.cata_wordcloud.py
: Generates a word cloud from categorical data to visualize the frequency or importance of categories.month_inter.py
: Try to find the statistic regularity of the interval of the initial and the last submission.month_statistic.py
: Interprets monthly data, possibly to identify trends or patterns over time._
rela_cs_radar.py
_: Creates a radar chart to show the relationship of cs with other catagories.year_statistic.py
: Calculates yearly statistics to provide insights into long-term trends. -
./visualization
: Visualization of data analysis results, containing various and appropriate figures. -
./test_if_spark_can_work.py
: Test the Spark environment setup.