layout | title | tagline |
---|---|---|
page |
Querying MongoDB programmatically |
While the GHTorrent project offers downloadable versions of the MongoDB raw dataset, downloading and restoring them to MongoDB can be very time consuming. For this reason, we have created a publicly available version of the data as they are collected by our main MongoDB server. The only prerequisite is to have a MongoDB client (command line, graphical or program library) and SSH installed on your machine.
To obtain access, please send us your public key as described here.
-
When we contact you back, you will be able to setup an SSH tunnel with the following command:
ssh -L 27017:dutihr.st.ewi.tudelft.nl:27017 [email protected]
. Keep in mind that no shell will be allocated in the open SSH session. -
You will then be able to connect to our server using the command:
mongo -u ghtorrentro -p ghtorrentro github
.
Here is an example session:
{% highlight bash%}
$ ssh -L 27017:dutihr.st.ewi.tudelft.nl:27017 [email protected] PTY allocation request failed on channel 2
$ mongo -u ghtorrentro -p ghtorrentro github MongoDB shell version: 3.0.3 connecting to: github
db.events.count() 401209493 db.commits.count() 311041915
{% endhighlight %}
Have a look here.
Due to its heavy load, the MongoDB server cannot process non-indexed field searches within the 100 sec time limit. To address this situation, we recommend querying MySQL first to get references to the data you want and then use MongoDB to get the raw data.
Below are the fields that MongoDB uses as indexes. Make sure your query hits those, otherwise querying is going to be extremely slow (and will overload our server as well).
<script src="http://gist-it.appspot.com/https://github.com/gousiosg/github-mirror/blob/master/lib/ghtorrent/adapters/mongo_persister.rb?slice=21:41"> </script>-
The hosting machine, while powerful, is not capable of processing the data very quickly. At the time of this writing, the data is more than 10TB.
-
Other people may be using the machine as well. Make sure that you do not run very heavy queries. It is better to run many small queriess (e.g. in a loop) than aggregation queries. Make sure you only query on indexed fields.
-
Queries running in excess of 100 seconds are killed without any warning.
-
At any time the machine may become unavailable.
-
Some data may be missing; if you are willing to provide workers to collect them, please contact us.
-
The data is provided in kind to help other people to do research with Please do not abuse the service.
-
The data is offered as is without any explicit or implicit quality or service guarantee from our part.
-
All operations are logged for security purposes.