Simpler Dataset Files #120

pmbittner · 2024-01-27T19:01:51Z

Currently, datasets are given as markdown files with lots of unused columns:

Project name	Domain	Source code available (yes/no)?	Is it a git repository (yes/no)?	Repository URL	Clone URL	Estimated number of commits
apache-httpd	web server	y	y	https://github.com/apache/httpd	https://github.com/DiffDetective/httpd.git	32,927
berkeley-db-libdb	database system	y	y	https://github.com/berkeleydb/libdb	https://github.com/DiffDetective/libdb.git	7

Our dataset loader in fact only uses the project name and clone URL. Hence, dataset files and the loading should be simplified. The columns for Domain, and Repository URL are interesting but not essential. So maybe these could stay in the files but be the last two columns.

Also, except for line 2 of the file, markdown files with just a single table like this are actually CSV files with | as separator instead of , or ;.So maybe we could reuse our CSV IO classes here.

The text was updated successfully, but these errors were encountered:

pmbittner added the Good First Issue Low hanging fruits label Jan 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simpler Dataset Files #120

Simpler Dataset Files #120

pmbittner commented Jan 27, 2024

Simpler Dataset Files #120

Simpler Dataset Files #120

Comments

pmbittner commented Jan 27, 2024