Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simpler Dataset Files #120

Open
pmbittner opened this issue Jan 27, 2024 · 0 comments
Open

Simpler Dataset Files #120

pmbittner opened this issue Jan 27, 2024 · 0 comments
Labels
Good First Issue Low hanging fruits

Comments

@pmbittner
Copy link
Member

Currently, datasets are given as markdown files with lots of unused columns:

Project name Domain Source code available (yes/no)? Is it a git repository (yes/no)? Repository URL Clone URL Estimated number of commits
apache-httpd web server y y https://github.com/apache/httpd https://github.com/DiffDetective/httpd.git 32,927
berkeley-db-libdb database system y y https://github.com/berkeleydb/libdb https://github.com/DiffDetective/libdb.git 7

Our dataset loader in fact only uses the project name and clone URL. Hence, dataset files and the loading should be simplified. The columns for Domain, and Repository URL are interesting but not essential. So maybe these could stay in the files but be the last two columns.

Also, except for line 2 of the file, markdown files with just a single table like this are actually CSV files with | as separator instead of , or ;.So maybe we could reuse our CSV IO classes here.

@pmbittner pmbittner added the Good First Issue Low hanging fruits label Jan 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good First Issue Low hanging fruits
Projects
None yet
Development

No branches or pull requests

1 participant