Study of research software in repositories. Contact: @karacolada
We want to investigate how research software projects have changed over time, how they evolve and how they differ between disciplines by analysing relevant software repositories. This will help us gain a better understanding of ongoing processes in the research software community and of how they can be supported. It will also supply evidence about which practices aid to build and maintain software with a wider community engagement.
We consider multiple approaches to finding relevant software repositories:
- existing datasets about software mentions in research (e.g. CZI Software Mentions, Softcite Dataset)
- crawling publicly available publications for in-text Git(Hub) links (e.g. ePrints)
The following is not meant to be a fixed set of things we want to find, but rather some ideas that may help guide what data is worth collecting. We hypothesise that
- research software repositories evolve in four stages
- no engagement: sparse commits, no issues, few authors, no license, no DOI citation
- publication: DOI, license, usage guidelines, some watchers/stars, some issues created and resolved by repository maintainers
- low engagement: external users create issues, maintainers resolve issues, forks
- community engagement: external users create and resolve issues, merge requests
- research software repositories that employ good practices reach higher stages (earlier)
This list of indicators is meant for brainstorming. Not all data listed here will be collected in the end.
- Issues: can be queried using GitHub API, naturally show timeline of repository. Interesting first data points might be creation date, creator, resolver, closing date. Interesting future data points might be labels, linked merge requests.
- Commits: can be queried using GitHub API, naturally show timeline of repository. Rather than the commit contents, we would be more interested in the metadata (author, date) to deduce commit frequency and team dynamics.
- Content history: we might be able to gather this data from Git commits to a specific file (license, README, contributing). Interesting data points might be the date when it got added, and in case of the README, section titles and when they were updated as well as the presence of a DOI and whether that DOI is CrossRef (paper) or DataCite (dataset, software).
- Engagement history: Forks, Watchers, Stars. GitHub API will only return info about the current stats, but GitHub Archive or GitTorrent might be a helpful resource.
To contextualise the result, we record information about the initial publication (title, author, ...). This could later be used to find the publication on CrossRef etc. and collect further information such as:
- affiliation
- mention of developers to the software as authors of the publication
- topic
Clone this repository, then fetch the submodules:
cd rse-repo-analysis
git submodule init
git submodule update
Most folders contain their own README file summarising their contents and how the scripts should be used. More details can be found in the wiki.
src
: any source code for the main bulk of this work (collecting and analysing RSE GitHub repositories)data
: data used for the main bulk of this worktex
: preliminary report on the main bulk of this worksoftware-mentions
: submodule containing a fork of the Chan Zuckerberg Initiative's Software Mentions RepositorySSI-notebooks
: our own scripts handling the CZI Software Mentions dataset
As this project is developed using Python, it is recommended to set up a new virtual environment, e.g. using Conda.
Inside the environment, install the following packages or use environment.yml
:
pandas
Jupyter
matplotlib
PyGithub
for access to the GitHub APIlxml
for parsing XMLspdfminer.six
for parsing PDFs- not to be confused with
pdfminer
, which is no longer actively maintained
- not to be confused with
pySpark
- depends on Java (e.g. OpenJDK)
- depends on
pyarrow
emoji
Levenshtein
unidecode
pydriller
wordcloud
seaborn
tol_colors
It is advised to create an access token to authorise with the GitHub API, otherwise you will quickly run into the requests limit.
You can create a token here.
Scripts making use of the GitHub API in this project will usually check for a file called config.cfg
and expect your access token to be in there.
As access tokens should be kept secret, files named config.cfg
will not be tracked by Git.
To provide the code with your access token, simply create a copy of config_example.cfg
, fill in your data and rename the copy to config.cfg
.
This file should be located in the root directory of this repository, but you might also need a copy of it in the software-mentions
directory if you want to work with the code in the submodule.
Depending on what you are trying to do, you will need to download datasets and place them in the correct spot of the repository. This might be changed to configurable paths in the future, but for now, that is out of scope.
The code in software-mentions
expects the CZI dataset in its root directory, i.e. software-mentions/data
.
You can download the dataset here and extract it into the correct location.
Here, we list some works that we make use of.
Istrate, A. M., Li, D., Taraborelli, D., Torkar, M., Veytsman, B., & Williams, I. (2022). A large dataset of software mentions in the biomedical literature. arXiv preprint arXiv:2209.00693.