Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvest dataservices #3029

Merged
merged 72 commits into from
Jun 14, 2024
Merged
Show file tree
Hide file tree
Changes from 69 commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
9b35b43
Switch DCAT backend to not use one job for each dataset
ThibaudDauce Apr 30, 2024
cb622a8
Fix missing owner/org in new datasets
ThibaudDauce Apr 30, 2024
cdd5f01
Remove prints
ThibaudDauce Apr 30, 2024
35ad714
Refactor using two functions
ThibaudDauce Apr 30, 2024
76d99d8
Add back should_stop
ThibaudDauce Apr 30, 2024
e90ddcc
Add back autoarchive and done with failed items
ThibaudDauce Apr 30, 2024
203b394
Merge branch 'master' into sync_harvest_backend
ThibaudDauce Apr 30, 2024
5dd3c04
Always returns the graphs for debug
ThibaudDauce May 2, 2024
d8eaf45
Add test for stopping due to HARVEST_MAX_ITEMS
ThibaudDauce May 2, 2024
ebf2af7
Update test backends
ThibaudDauce May 14, 2024
a2701af
Fix some tests
ThibaudDauce May 14, 2024
a163b70
Revert ID change for FakeBackend
ThibaudDauce May 14, 2024
3693bd2
Simplify SyncBackend
ThibaudDauce May 14, 2024
d5c0e29
Merge branch 'master' into sync_harvest_backend
maudetes May 21, 2024
a71b46e
fix wrong remote_id
ThibaudDauce May 21, 2024
914b69d
Merge branch 'master' into sync_harvest_backend
ThibaudDauce May 23, 2024
fc98e28
Remove dead code
ThibaudDauce May 23, 2024
20ce7e4
update comment
ThibaudDauce May 23, 2024
c3c4c27
Merge branch 'master' into sync_harvest_backend
ThibaudDauce May 27, 2024
b9b41e2
Move docstring
ThibaudDauce May 27, 2024
0a6dfa5
Switch is_done() do its own function
ThibaudDauce May 27, 2024
523c754
Rename process_datasets method
ThibaudDauce May 27, 2024
064f8eb
yield instead of callback
ThibaudDauce May 27, 2024
263c3a3
fix other backends
ThibaudDauce May 27, 2024
6b6d11e
Merge branch 'master' into sync_harvest_backend
ThibaudDauce May 29, 2024
9124fc0
Update changelog
ThibaudDauce May 29, 2024
e9b9af1
Remove unused process method
ThibaudDauce May 29, 2024
318603d
Skip dataset without remote_id
ThibaudDauce May 30, 2024
e1b1bc0
Harvest dataservices
ThibaudDauce Apr 29, 2024
d536874
Add more data inside dataservices
ThibaudDauce Apr 29, 2024
ce30a60
Revert hvd labels from rebase
ThibaudDauce May 16, 2024
3e4d901
Fix tests
ThibaudDauce May 16, 2024
ae02115
Remove duplicate code
ThibaudDauce May 27, 2024
345f05b
wip add printing graphs
ThibaudDauce May 29, 2024
735f6d5
Remove duplicate harvest update info
ThibaudDauce May 29, 2024
b83d697
Add link between dataservices and datasets
ThibaudDauce May 30, 2024
29cede6
link dataset with dataservices
ThibaudDauce May 30, 2024
2c7a24d
Use forked XSLT
ThibaudDauce Jun 3, 2024
23d8111
Merge branch 'master' into harvest_dataservices
ThibaudDauce Jun 4, 2024
aa125aa
fix merge issues
ThibaudDauce Jun 4, 2024
d667b7b
Fix tests
ThibaudDauce Jun 4, 2024
e6816c9
Improve some skips
ThibaudDauce Jun 4, 2024
f630a25
Merge branch 'master' into harvest_dataservices
ThibaudDauce Jun 4, 2024
4eec493
Remove merge tag
ThibaudDauce Jun 4, 2024
f5c6917
Separate metadata
ThibaudDauce Jun 5, 2024
c9b9940
Add comment
ThibaudDauce Jun 5, 2024
69ec784
Add dataservice card to harvester admin
ThibaudDauce Jun 5, 2024
3403944
Revert XSLT modifications
ThibaudDauce Jun 5, 2024
7c5b0ca
Remove some blanks lines
ThibaudDauce Jun 5, 2024
f5a5108
Merge branch 'master' into harvest_dataservices
ThibaudDauce Jun 5, 2024
8e3e5ef
Update changelog
ThibaudDauce Jun 5, 2024
9a9a43b
Do not duplicate datasets on each harvesting
ThibaudDauce Jun 5, 2024
e94a23f
Merge branch 'master' into harvest_dataservices
ThibaudDauce Jun 10, 2024
5fee125
Cleanup imports
ThibaudDauce Jun 10, 2024
31bfda7
Fix changelog
ThibaudDauce Jun 10, 2024
f477ef6
Fix merge
ThibaudDauce Jun 10, 2024
8f910e6
Merge branch 'master' into harvest_dataservices
ThibaudDauce Jun 10, 2024
aba27e4
Merge branch 'master' into harvest_dataservices
ThibaudDauce Jun 11, 2024
9f9a7f8
Apply suggestions from code review
ThibaudDauce Jun 11, 2024
2a0da24
add harvest metadata to API
ThibaudDauce Jun 11, 2024
dd54158
harvest metadata as readonly
ThibaudDauce Jun 11, 2024
30ba48d
Rename last_harvested_at and add harvest.created_at
ThibaudDauce Jun 11, 2024
b083c37
Fix wrong attribute
ThibaudDauce Jun 11, 2024
fbe6f22
Do not empty the datasets list if no datasets found in harvesting
ThibaudDauce Jun 11, 2024
7863547
fix casing
ThibaudDauce Jun 11, 2024
b401700
Fix tests
ThibaudDauce Jun 11, 2024
1d6714a
Save node id if it's an URL
ThibaudDauce Jun 12, 2024
cc6b5e8
Add landing page as remote_url
ThibaudDauce Jun 13, 2024
dc1a093
Merge branch 'master' into harvest_dataservices
ThibaudDauce Jun 13, 2024
d484199
Rename rdf_node_id_as_url to follow the same name as dataset :-(
ThibaudDauce Jun 14, 2024
fa816cc
Remove dynamic to harvest dataservice metadata
ThibaudDauce Jun 14, 2024
546c1be
Update udata/core/dataservices/models.py
ThibaudDauce Jun 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

## Current (in progress)

- Harvest dataservices [#3029](https://github.com/opendatateam/udata/pull/3029)
- Refactor catalog exports [#3052](https://github.com/opendatateam/udata/pull/3052)
- Add a filter to filter dataservices by dataset [#3056](https://github.com/opendatateam/udata/pull/3056)
- Fix reuses' datasets references [#3057](https://github.com/opendatateam/udata/pull/3057)
Expand Down
13 changes: 13 additions & 0 deletions js/components/harvest/item.vue
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,19 @@
:dataset="item.dataset">
</dataset-card>
</dd>
<dt v-if="item.dataservice">{{ _('Dataservice') }}</dt>
<dd v-if="item.dataservice">
<div class="card">
<div class="card-body">
<h4>
<a :title="item.dataservice.title" :href="item.dataservice.self_web_url">
{{ item.dataservice.title | truncate 80 }}
</a>
<div class="clamp-3">{{{ item.dataservice.description | markdown 180 }}}</div>
</h4>
</div>
</div>
</dd>
<dt v-if="item.errors.length">{{ _('Errors') }}</dt>
<dd v-if="item.errors.length">
<div v-for="error in item.errors">
Expand Down
10 changes: 7 additions & 3 deletions udata/api_fields.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,10 +70,14 @@ def convert_db_to_field(key, field, info = {}):
constructor_write = restx_fields.String
elif isinstance(field, mongo_fields.EmbeddedDocumentField):
nested_fields = info.get('nested_fields')
if nested_fields is None:
raise ValueError(f"EmbeddedDocumentField `{key}` requires a `nested_fields` param to serialize/deserialize.")
if nested_fields is not None:
constructor = lambda **kwargs: restx_fields.Nested(nested_fields, **kwargs)
elif hasattr(field.document_type_obj, '__read_fields__'):
constructor_read = lambda **kwargs: restx_fields.Nested(field.document_type_obj.__read_fields__, **kwargs)
constructor_write = lambda **kwargs: restx_fields.Nested(field.document_type_obj.__write_fields__, **kwargs)
else:
raise ValueError(f"EmbeddedDocumentField `{key}` requires a `nested_fields` param to serialize/deserialize or a `@generate_fields()` definition.")

constructor = lambda **kwargs: restx_fields.Nested(nested_fields, **kwargs)
else:
raise ValueError(f"Unsupported MongoEngine field type {field.__class__.__name__}")

Expand Down
31 changes: 29 additions & 2 deletions udata/core/dataservices/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,27 @@ def hidden(self):
db.Q(deleted_at__ne=None) |
db.Q(archived_at__ne=None))

@generate_fields()
class HarvestMetadata(db.DynamicEmbeddedDocument):
ThibaudDauce marked this conversation as resolved.
Show resolved Hide resolved
ThibaudDauce marked this conversation as resolved.
Show resolved Hide resolved
backend = field(db.StringField())
domain = field(db.StringField())

source_id = field(db.StringField())
source_url = field(db.URLField())

remote_id = field(db.StringField())
remote_url = field(db.URLField())

created_at = field(
db.DateTimeField(),
description="Date of the creation as provided by the harvested catalog"
)
last_update = field(
db.DateTimeField(),
description="Date of the last harvesting"
)
archived_at = field(db.DateTimeField())

@generate_fields()
class Dataservice(WithMetrics, Owned, db.Document):
meta = {
Expand Down Expand Up @@ -119,12 +140,18 @@ class Dataservice(WithMetrics, Owned, db.Document):
},
)

harvest = field(
db.EmbeddedDocumentField(HarvestMetadata),
readonly=True,
)

@function_field(description="Link to the API endpoint for this dataservice")
def self_api_url(self):
return endpoint_for('api.dataservice', dataservice=self, _external=True)

def self_web_url():
pass
@function_field(description="Link to the udata web page for this dataservice")
def self_web_url(self):
return endpoint_for('dataservices.show', dataservice=self)
ThibaudDauce marked this conversation as resolved.
Show resolved Hide resolved

# TODO
# frequency = db.StringField(choices=list(UPDATE_FREQUENCIES.keys()))
Expand Down
61 changes: 61 additions & 0 deletions udata/core/dataservices/rdf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@

from datetime import datetime
from typing import List, Optional
from rdflib import RDF, Graph, URIRef

from udata.core.dataservices.models import Dataservice, HarvestMetadata as HarvestDataserviceMetadata
from udata.core.dataset.models import Dataset, License
from udata.core.dataset.rdf import sanitize_html
from udata.harvest.models import HarvestSource
from udata.rdf import DCAT, DCT, contact_point_from_rdf, rdf_value, remote_url_from_rdf, theme_labels_from_rdf, themes_from_rdf, url_from_rdf

def dataservice_from_rdf(graph: Graph, dataservice: Dataservice, node, all_datasets: List[Dataset]) -> Dataservice :
'''
Create or update a dataset from a RDF/DCAT graph
'''
if node is None: # Assume first match is the only match
node = graph.value(predicate=RDF.type, object=DCAT.DataService)

d = graph.resource(node)

dataservice.title = rdf_value(d, DCT.title)
dataservice.description = sanitize_html(d.value(DCT.description) or d.value(DCT.abstract))

dataservice.base_api_url = url_from_rdf(d, DCAT.endpointURL)
dataservice.endpoint_description_url = url_from_rdf(d, DCAT.endpointDescription)
ThibaudDauce marked this conversation as resolved.
Show resolved Hide resolved

dataservice.contact_point = contact_point_from_rdf(d, dataservice) or dataservice.contact_point

datasets = []
for dataset_node in d.objects(DCAT.servesDataset):
id = dataset_node.value(DCT.identifier)
dataset = next((d for d in all_datasets if d is not None and d.harvest.remote_id == id), None)

if dataset is None:
# We try with `endswith` because Europe XSLT have problems with IDs. Sometimes they are prefixed with the domain of the catalog, sometimes not.
dataset = next((d for d in all_datasets if d is not None and d.harvest.remote_id.endswith(id)), None)
ThibaudDauce marked this conversation as resolved.
Show resolved Hide resolved

if dataset is not None:
datasets.append(dataset.id)

if datasets:
dataservice.datasets = datasets

license = rdf_value(d, DCT.license)
if license is not None:
dataservice.license = License.guess(license)

if not dataservice.harvest:
dataservice.harvest = HarvestDataserviceMetadata()

# If the node ID is a `URIRef` it means it links to something external, if it's not an URIRef it's often a
# auto-generated ID just to link multiple RDF node togethers. When exporting as RDF to other catalogs, we
# want to re-use this node ID (only if it's not auto-generated) to improve compatibility.
dataservice.harvest.rdf_node_id_as_url = d.identifier.toPython() if isinstance(d.identifier, URIRef) else None
ThibaudDauce marked this conversation as resolved.
Show resolved Hide resolved
dataservice.harvest.remote_url = remote_url_from_rdf(d)
dataservice.harvest.created_at = rdf_value(d, DCT.issued)
dataservice.metadata_modified_at = rdf_value(d, DCT.modified)

dataservice.tags = themes_from_rdf(d)

return dataservice
128 changes: 9 additions & 119 deletions udata/core/dataset/rdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
import logging

from datetime import date
from html.parser import HTMLParser
from typing import Optional
from dateutil.parser import parse as parse_dt
from flask import current_app
Expand All @@ -18,14 +17,14 @@

from udata import i18n, uris
from udata.core.spatial.models import SpatialCoverage
from udata.frontend.markdown import parse_html
from udata.core.dataset.models import HarvestDatasetMetadata, HarvestResourceMetadata
from udata.models import db, ContactPoint
from udata.harvest.exceptions import HarvestSkipException
from udata.models import db
from udata.rdf import (
DCAT, DCATAP, DCT, FREQ, SCV, SKOS, SPDX, SCHEMA, EUFREQ, EUFORMAT, IANAFORMAT, VCARD, RDFS,
HVD_LEGISLATION, namespace_manager, schema_from_rdf, url_from_rdf
DCAT, DCATAP, DCT, FREQ, SCV, SKOS, SPDX, SCHEMA, EUFREQ, EUFORMAT, IANAFORMAT, TAG_TO_EU_HVD_CATEGORIES, RDFS,
namespace_manager, rdf_value, remote_url_from_rdf, sanitize_html, schema_from_rdf, themes_from_rdf, url_from_rdf, HVD_LEGISLATION,
contact_point_from_rdf,
)
from udata.tags import slug as slugify_tag
from udata.utils import get_by, safe_unicode
from udata.uris import endpoint_for

Expand Down Expand Up @@ -77,44 +76,6 @@
EUFREQ.NEVER: 'punctual',
}

# Map High Value Datasets URIs to keyword categories
EU_HVD_CATEGORIES = {
"http://data.europa.eu/bna/c_164e0bf5": "Météorologiques",
"http://data.europa.eu/bna/c_a9135398": "Entreprises et propriété d'entreprises",
"http://data.europa.eu/bna/c_ac64a52d": "Géospatiales",
"http://data.europa.eu/bna/c_b79e35eb": "Mobilité",
"http://data.europa.eu/bna/c_dd313021": "Observation de la terre et environnement",
"http://data.europa.eu/bna/c_e1da4e07": "Statistiques"
}
TAG_TO_EU_HVD_CATEGORIES = {slugify_tag(EU_HVD_CATEGORIES[uri]): uri for uri in EU_HVD_CATEGORIES}


class HTMLDetector(HTMLParser):
def __init__(self, *args, **kwargs):
HTMLParser.__init__(self, *args, **kwargs)
self.elements = set()

def handle_starttag(self, tag, attrs):
self.elements.add(tag)

def handle_endtag(self, tag):
self.elements.add(tag)


def is_html(text):
parser = HTMLDetector()
parser.feed(text)
return bool(parser.elements)


def sanitize_html(text):
text = text.toPython() if isinstance(text, Literal) else ''
if is_html(text):
return parse_html(text)
else:
return text.strip()


def temporal_to_rdf(daterange, graph=None):
if not daterange:
return
Expand Down Expand Up @@ -255,18 +216,6 @@ def dataset_to_rdf(dataset, graph=None):
}


def serialize_value(value):
if isinstance(value, (URIRef, Literal)):
return value.toPython()
elif isinstance(value, RdfResource):
return value.identifier.toPython()


def rdf_value(obj, predicate, default=None):
value = obj.value(predicate)
return serialize_value(value) if value else default


def temporal_from_literal(text):
'''
Parse a temporal coverage from a literal ie. either:
Expand Down Expand Up @@ -341,29 +290,6 @@ def temporal_from_rdf(period_of_time):
# so we log the error for future investigation and improvement
log.warning('Unable to parse temporal coverage', exc_info=True)


def contact_point_from_rdf(rdf, dataset):
contact_point = rdf.value(DCAT.contactPoint)
if contact_point:
name = rdf_value(contact_point, VCARD.fn) or ''
email = (rdf_value(contact_point, VCARD.hasEmail)
or rdf_value(contact_point, VCARD.email)
or rdf_value(contact_point, DCAT.email))
if not email:
return
email = email.replace('mailto:', '').strip()
if dataset.organization:
contact_point = ContactPoint.objects(
name=name, email=email, organization=dataset.organization).first()
return (contact_point or
ContactPoint(name=name, email=email, organization=dataset.organization).save())
elif dataset.owner:
contact_point = ContactPoint.objects(
name=name, email=email, owner=dataset.owner).first()
return (contact_point or
ContactPoint(name=name, email=email, owner=dataset.owner).save())


def spatial_from_rdf(graph):
geojsons = []
for term in graph.objects(DCT.spatial):
Expand Down Expand Up @@ -503,43 +429,6 @@ def title_from_rdf(rdf, url):
else:
return i18n._('Nameless resource')


def remote_url_from_rdf(rdf):
'''
Return DCAT.landingPage if found and uri validation succeeds.
Use RDF identifier as fallback if uri validation succeeds.
'''
landing_page = url_from_rdf(rdf, DCAT.landingPage)
uri = rdf.identifier.toPython()
for candidate in [landing_page, uri]:
if candidate:
try:
uris.validate(candidate)
return candidate
except uris.ValidationError:
pass


def theme_labels_from_rdf(rdf):
'''
Get theme labels to use as keywords.
Map HVD keywords from known URIs resources if HVD support is activated.
'''
for theme in rdf.objects(DCAT.theme):
if isinstance(theme, RdfResource):
uri = theme.identifier.toPython()
if current_app.config['HVD_SUPPORT'] and uri in EU_HVD_CATEGORIES:
label = EU_HVD_CATEGORIES[uri]
# Additionnally yield hvd keyword
yield 'hvd'
else:
label = rdf_value(theme, SKOS.prefLabel)
else:
label = theme.toPython()
if label:
yield label


def resource_from_rdf(graph_or_distrib, dataset=None, is_additionnal=False):
'''
Map a Resource domain model to a DCAT/RDF graph
Expand Down Expand Up @@ -617,6 +506,9 @@ def dataset_from_rdf(graph: Graph, dataset=None, node=None):
d = graph.resource(node)

dataset.title = rdf_value(d, DCT.title)
if not dataset.title:
raise HarvestSkipException("missing title on dataset")

# Support dct:abstract if dct:description is missing (sometimes used instead)
description = d.value(DCT.description) or d.value(DCT.abstract)
dataset.description = sanitize_html(description)
Expand All @@ -634,9 +526,7 @@ def dataset_from_rdf(graph: Graph, dataset=None, node=None):
if acronym:
dataset.acronym = acronym

tags = [tag.toPython() for tag in d.objects(DCAT.keyword)]
tags += theme_labels_from_rdf(d)
dataset.tags = list(set(tags))
dataset.tags = themes_from_rdf(d)

temporal_coverage = temporal_from_rdf(d.value(DCT.temporal))
if temporal_coverage:
Expand Down
4 changes: 4 additions & 0 deletions udata/harvest/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from udata.api import api, API, fields
from udata.auth import admin_permission

from udata.core.dataservices.models import Dataservice
from udata.core.dataset.api_fields import dataset_ref_fields, dataset_fields
from udata.core.organization.api_fields import org_ref_fields
from udata.core.organization.permissions import EditOrganizationPermission
Expand Down Expand Up @@ -45,6 +46,9 @@ def backends_ids():
'dataset': fields.Nested(dataset_ref_fields,
description='The processed dataset',
allow_null=True),
'dataservice': fields.Nested(Dataservice.__read_fields__,
description='The processed dataservice',
allow_null=True),
'status': fields.String(description='The item status',
required=True,
enum=list(HARVEST_ITEM_STATUS)),
Expand Down
Loading