Luigi for orchestration (but trying to keep task code small, cf. bl.uk). Structured paths for data artifacts. Scheduling with cron.
Data acquisition and processing results in three (kinds of) sqlite3 databases:
- a) one id-doi "mapping" database
- b) one "citations" database (doi-doi)
- c) one or more index "metadata" fetcher (id-doc)
Currently, we use a FetchGroup over multiple sqlite3 databases, but the interface would allow to use a different local or remote backing stores. A server assembles fused results from these databases on the fly (and caches expensive requests) and builds JSON responses. Cache warming can be a one-liner and can be run (in parallel and) in the background.
$ time zstdcat -T0 /usr/share/labe/data/OpenCitationsRanked/current | \
awk '{print $2}' | \
head -300000 | \
shuf | \
parallel -j 32 -I {} "curl -sL 'http://localhost:8000/doi/{}'" > /dev/null
This way we should get a good balance between a batch and on-the-fly approach:
- we need little preprocessing, we mostly turn CSV or SOLR JSON into sqlite databases
- we still can be fast through caching, which can be done forehandedly ("cache warming") or as data is actually requested; this is in essence the same work that would be needed in a batch approach, but we can do it lazily (ie. we pre-compute only about 0.5% of the results).
The server delivers JSON responses, which can be processed in catalog frontends.
$ curl -sL "http://localhost:8000/doi/10.1016/s0273-1177(97)00070-7" | jq .
{
"id": "ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTAxNi9zMDI3My0xMTc3KDk3KTAwMDcwLTc",
"doi": "10.1016/s0273-1177(97)00070-7",
"cited": [
{
"author": [
"Meljac, Claire",
"Voyazopoulos, Robert"
],
"doi_str_mv": [
"10.4000/rechercheseducations.819"
],
"format": [
"ElectronicArticle"
],
"id": "ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuNDAwMC9yZWNoZXJjaGVzZWR1Y2F0aW9ucy44MTk",
"institution": [
"DE-Zi4",
"DE-14",
"DE-Ch1",
"DE-Gla1",
"DE-D161",
"DE-Brt1",
"DE-Pl11",
"DE-Rs1",
"DE-82",
"DE-D275",
"DE-15",
"DE-105",
"DE-L229",
"DE-Bn3",
"DE-Zwi2"
],
"title": "Binet, citoyen indigne ?",
"url": [
"http://dx.doi.org/10.4000/rechercheseducations.819"
]
}
],
"unmatched": {},
"extra": {
"took": 0.001490343,
"unmatched_citing_count": 0,
"unmatched_cited_count": 0,
"citing_count": 0,
"cited_count": 1,
"cached": false
}
}