Mining Software Engineering Data from GitHub

by Georgios Gousios and Diomidis Spinellis

This repository contains the code and slides for a tutorial delivered at the 39th International Conference on Software Engineering.

Tutorial description
- Green open access pre-print
- Publisher's page
Tutorial notes (Slides in print format)

Abstract

GitHub is the largest collaborative source code hosting site built on top of the Git version control system. The availability of a comprehensive API has made GitHub a target for many software engineering and online collaboration research efforts. In our work, we have discovered that a) obtaining data from GitHub is not trivial, b) the data may not be suitable for all types of research, and c) improper use can lead to biased results. In this tutorial, we analyze how data from GitHub can be used for large-scale, quantitative research, while avoiding common pitfalls. We use the GHTorrent dataset, a queryable offline mirror of the GitHub API data, to draw examples from and present pitfall avoidance strategies.

Bibtex record

@inproceedings{GS17,
  author = {Gousios, Georgios and Spinellis, Diomidis},
  title = {Mining Software Engineering Data from GitHub},
  booktitle = {Proceedings of the 39th International Conference on Software Engineering Companion},
  series = {ICSE-C '17},
  year = {2017},
  isbn = {978-1-5386-1589-8},
  location = {Buenos Aires, Argentina},
  pages = {501--502},
  numpages = {2},
  doi = {10.1109/ICSE-C.2017.164},
  publisher = {IEEE Press},
  address = {Piscataway, NJ, USA},
  keywords = {GHTorrent, GitHub, empirical software engineering, git},
  url = {/pub/mining-soft-eng-data-github.pdf},
  github = {ghtorrent/tutorial}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
a		a
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
ghtorrent-tutorial.pdf		ghtorrent-tutorial.pdf
ghtorrent.Rmd		ghtorrent.Rmd
ghtorrent.bib		ghtorrent.bib
github-bq.png		github-bq.png
ieee.csl		ieee.csl
index.md		index.md
rxjs-ghtorrent.db		rxjs-ghtorrent.db
rxjs-mongo.tar.gz		rxjs-mongo.tar.gz
streaming.rb		streaming.rb
theme.css		theme.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mining Software Engineering Data from GitHub

Abstract

Bibtex record

About

Releases

Packages

Contributors 2

Languages

License

ghtorrent/tutorial

Folders and files

Latest commit

History

Repository files navigation

Mining Software Engineering Data from GitHub

Abstract

Bibtex record

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages