layout | title | tagline |
---|---|---|
page |
The GHTorrent project |
{% include JB/setup %}
Welcome to the GHTorrent project, an effort to create a scalable, queriable, offline mirror of data offered through the Github REST API.
Follow @ghtorrent on Twitter for project updates and exiting research done with GHTorrent.
##What does GHTorrent do?
GHTorrent monitors the Github public event time line. For each event, it retrieves its contents and their dependencies, exhaustively. It then stores the raw JSON responses to a MongoDB database, while also extracting their structure in a MySQL database.
GHTorrent works in a distributed manner. A RabbitMQ message queue sits between the event mirroring and data retrieval phases, so that both can be run on a cluster of machines. Have a look at this presentation and read this paper if you want to know more. Here is the source code.
The project releases the data collected during that period as downloadable archives.
Currently (Jan 2015), MongoDB stores around 4TB of JSON data (compressed), while MySQL more than 1.5 billion rows of extracted metadata. A large part of the activity of 2012, 2013, 2014 and 2015 has been retrieved, while we are also going backwards to retrieve the full recorded history of important projects.
GHTorrent needs contributions on the following fronts:
-
API keys: We can run multiple GHTorrent worker instances concurrently. To go over Github's API rate limit, we need multiple Github API keys provided by users. If you use GHTorrent for your reseach, please consider donating a key.
-
Linking and analysis: GHTorrent currently only does limited analysis and linking withing the the dataset (user geolocation). There are many possibilities for expansion. One could for example think of linking commits to issues.
-
Reporting bugs: Please use Github's issue tracker here to report any data consistency issues you have found.
We are doing research on software repositories. Github is an exciting new data source for us, one that has several of the problems we are facing as data miners solved. The uniformity of data will allow scaling of research to hundreds or thousands of repositories spanning across multiple languages and application domains.
Initially the project offered the data through the Bittorrent network (gh: from GitHub, torrent: from Bittorrent). As currently the data is only offered through HTTP, the name signifies a torrent of data coming from GitHub.
Have a look at the following presentation for a short introduction.
If you find this dataset useful and want to use it in your work, please cite the following paper:
Georgios Gousios: The GHTorrent dataset and tool suite. MSR 2013: 233-236
{%highlight text%} @inproceedings{Gousi13, author = {Gousios, Georgios}, title = {The GHTorrent dataset and tool suite}, booktitle = {Proceedings of the 10th Working Conference on Mining Software Repositories}, series = {MSR '13}, year = {2013}, isbn = {978-1-4673-2936-1}, location = {San Francisco, CA, USA}, pages = {233--236}, numpages = {4}, url = {http://dl.acm.org/citation.cfm?id=2487085.2487132}, acmid = {2487132}, publisher = {IEEE Press}, address = {Piscataway, NJ, USA}, } {%endhighlight%}