The Open Archives Initiative (OAI) river allows to harvest metadata from OAI data providers into ElasticSearch. It is provided as a plugin.
The river allows to automatically harvest RDF / DC / XML metadata from OAI data providers (OAI servers). Metadata is internally represented as resources, using Resource Description Framework of the W3C Semantic Web Initiative. These resources are serialized into JSON for Elasticsearch schema-less indexing.
Setting it up is as simple as executing the following against ElasticSearch:
curl -XPUT 'localhost:9200/_river/my_oai_harvester_1/_meta' -d '{
"type" : "oai",
"oai" : {
"url" : "http://pub.uni-bielefeld.de",
"set" : "bi",
"metadataPrefix" : "oai_dc"
}
}'
This call will create a river that harvests all the Dublin Core metadata within the data provider pub.uni-bielefeld.de
within the set bi
.
In case of a failover, the OAI river will automatically be started on another ElasticSearch node, and continue indexing from the last OAI harvest state.
Many OAI rivers can run in parallel. A list of OAI servers can be found [here] (http://www.openarchives.org/Register/BrowseSites)
Prerequisites:
-
Elasticsearch 0.90.0.Beta1+
-
Java 7 Runtime Environment
Binaries are available at Bintray:
https://bintray.com/pkg/show/general/jprante/elasticsearch-plugins/elasticsearch-river-oai
The current version is 2.1.0 (17 Mar 2013)
`bin/plugin -url http://bit.ly/YAkqeO -install elasticsearch-river-oai`
The Maven project site is here
The Javadoc API can be found here
Each OAI server describes itself with the verbs Identify
, ListSets
, and ListMetadataFormats
. Harvesting OAI servers is possible by specifying date periods. If a river is configured with a date period, e.g.
curl -XPUT 'localhost:9200/_river/my_oai_harvester_1/_meta' -d '{
"type" : "oai",
"oai" : {
"url" : "http://pub.uni-bielefeld.de",
"set" : "bi",
"metadataPrefix" : "oai_dc",
"from" : "2012-05-01T00:00:00Z",
"until" : "2012-05-02T00:00:00Z"
}
}'
This will harvest the metadata with a server defined date between 2012-05-01T00:00:00Z and 2012-05-02T00:00:00Z.
Resumption tokens are handled automatically.
If the period is harvested, the date period is increased automatically. Harvesting is throttled when the time period exceeds the current date.
To delay harvesting between time periods, you can give a parameter poll
. poll
is a Elasticsearch TimeValue type. This is convenient if updates are rare.
curl -XPUT 'localhost:9200/_river/my_oai_harvester_1/_meta' -d '{
"type" : "oai",
"oai" : {
"url" : "http://pub.uni-bielefeld.de",
"set" : "bi",
"metadataPrefix" : "oai_dc",
"from" : "2012-05-01T00:00:00Z",
"until" : "2012-05-02T00:00:00Z",
"poll" : "5m"
}
}'
Each river can index into a specified index. Example:
curl -XPUT 'localhost:9200/_river/my_oai_harvester_1/_meta' -d '{
"type" : "oai",
"oai" : {
"url" : "http://pub.uni-bielefeld.de",
"set" : "bi",
"metadataPrefix" : "oai_dc",
"from" : "2012-05-01T00:00:00Z",
"until" : "2012-05-02T00:00:00Z",
"poll" : "5m"
},
"index" : {
"index" : "metadata",
"type" : "oai"
}
}'
Bulk indexing is automatically done in order to speed up the indexing process. Each OAI response will be indexed by a single bulk. A bulk size can be defined, also a maximum size of active bulk requests to cope with high load situations. A bulk timeout defines the time period after which bulk feeds continue.
curl -XPUT 'localhost:9200/_river/my_oai_harvester_1/_meta' -d '{
"type" : "oai",
"oai" : {
"url" : "http://pub.uni-bielefeld.de",
"set" : "bi",
"metadataPrefix" : "oai_dc",
"proxyhost" : "proxyserver.domain",
"proxyport" : 3128
},
"index" : {
"index" : "metadata",
"type" : "oai",
"bulk_size" : 100,
"max_bulk_requests" : 30
}
}'
A HTTP proxy can be defined with proxyhost
and proxyport
.
curl -XPUT 'localhost:9200/_river/my_oai_harvester_1/_meta' -d '{
"type" : "oai",
"oai" : {
"url" : "http://pub.uni-bielefeld.de",
"set" : "bi",
"metadataPrefix" : "oai_dc",
"proxyhost" : "proxyserver.domain",
"proxyport" : 3128
}
}'
OAI servers can respond with a variety of error conditions. If an error occurs, the OAI river is stopped. Then, it must be started again by deleting the river and re-adjusting the harvest date period manually.
Stopping the river:
curl -XDELETE 'localhost:9200/_river/my_dnb_oai_river/'
Starting the river:
curl -XPUT 'localhost:9200/_river/my_dnb_oairiver/_meta' -d '{
"type" : "oai",
"oai" : {
"url" : "http://services.dnb.de/oai/repository",
"set" : "authorities",
"metadataPrefix" : "RDFxml",
"from" : "2012-02-24T00:00:00Z",
"until" : "2012-02-25T00:00:00Z",
"poll" : "15s",
"proxyhost" : "localhost",
"proxyport" : 3128
},
"index" : {
"index" : "dnb",
"type" : "authorities"
}
}'
The following example illustrates an indexed Dublin Core resource from Universität zu Köln.
{
"_source" : {
"dc:creator" : "Hölscher, Georg",
"dc:format" : "image/jpg",
"dc:identifier" : [
"RHG2077",
"823532",
"HT008931035",
"http://www.ub.uni-koeln.de/permalink/2012/01/rheinmono,8522"
],
"dc:title" : "Köln am Rhein",
"dc:type" : "Image",
"dc:date" : "1903"
},
"_score" : 1,
"_index" : "oai",
"_id" : "oai:contentdm.ub.uni-koeln.de:rheinmono/8522",
"_type" : "default"
}
The following example illustrates an indexed RDF resource from Deutsche Nationalbibliothek.
{
"_score" : 1,
"_index" : "dnb",
"_id" : "http://d-nb.info/gnd/1019116501",
"_type" : "authorities"
"_source" : {
"gnd:biographicalOrHistoricalInformation" : "von Mittweida, hohe Schule in Leipzig (Universität?)",
"gnd:publication" : "Den Anfang die Früchte der Weisheit zu sammlen, als der Wohledle und Wohlgelahrte Herr, Herr Johann August König, von Mittweyda, auf der hohen Schule zu Leipzig den 25 Tag des Hornungs 1751 auf einem philosophischen Ehrentage die Meisterwürde rühmlich erlangte, besinget Ul*. - 1751",
"rdf:type" : "http://d-nb.info/standards/elementset/gnd#DifferentiatedPerson",
"gnd:variantNameEntityForThePerson" : [
{
"gnd:surname" : "König",
"gnd:forename" : "Johannes August"
},
{
"gnd:surname" : "König",
"gnd:forename" : "Johannes Augustus"
}
],
"gnd:oldAuthorityNumber" : "(DE-588a)1019116501",
"gnd:gndIdentifier" : "1019116501",
"gnd:preferredNameEntityForThePerson" : {
"gnd:surname" : "König",
"gnd:forename" : "Johann August"
},
"gnd:gender" : "http://d-nb.info/standards/vocab/gnd/Gender#male",
"gnd:periodOfActivity" : "1751",
"gnd:variantNameForThePerson" : [
"König, Johannes August",
"König, Johannes Augustus"
],
"gnd:preferredNameForThePerson" : "König, Johann August"
}
}
The OAI river is using the OAI module of the xbib framework, an upcoming effort to build scalable and efficient bibliographic semantic web infrastructures, which will be released soon as open source.