Harvest dataservices #3029

ThibaudDauce · 2024-04-30T07:23:41Z

To harvest dataservices we first need to harvest datasets (because dataservices reference datasets in serveDatasets attribute). But right now, datasets are harvested asynchronously (by saving an HarvestJob and then queuing these jobs independently). It means we need to wait that all datasets are done before starting harvesting dataservices. Multiple options:

Do the datasets' harvesting synchronously (previously talked in Gérer le moissonnage de gros catalogues via DCAT (CSW-DCAT ou XSLT) datagouv/data.gouv.fr#1046 (comment)) then we can just loop for datasets in the graph then loop for the dataservices in the graph in the same function. require some changes in all backends, we need to keep the HarvestJob for debug only purpose
Do the dataservices harvesting inside the finalize function that is called at the end of all the jobs. not a big fan because it adds one more class and a lot of code
Do the dataservices harvesting inside some HarvestJob (either the same model than for the datasets or a new one) and do some celery magic to dispatch all jobs with dependencies chains. not a big fan because it complexify a lot the architecture

maudetes

Some comments on the way

udata/core/dataservices/rdf.py

udata/core/dataservices/models.py

udata/harvest/backends/dcat.py

maudetes

Wow, thank you for this PR! I'm looking forward to seeing this live 👏

I haven't dived in all the weird DCAT cases, but let's add support for poorly described catalogs in following PRs if needed!

udata/core/dataservices/rdf.py

udata/core/dataservices/models.py

udata/harvest/backends/base.py

udata/core/dataservices/models.py

udata/harvest/backends/base.py

udata/core/dataservices/rdf.py

udata/core/dataset/rdf.py

Co-authored-by: maudetes <[email protected]>

maudetes

I think we're good to go for a first iteration! 👏
I've added some minor suggestion as well :)

udata/core/dataservices/models.py

udata/core/dataservices/rdf.py

Co-authored-by: maudetes <[email protected]>

ThibaudDauce and others added 13 commits April 30, 2024 11:25

Switch DCAT backend to not use one job for each dataset

9b35b43

Fix missing owner/org in new datasets

cb622a8

Remove prints

cdd5f01

Refactor using two functions

35ad714

Add back should_stop

76d99d8

Add back autoarchive and done with failed items

e90ddcc

Merge branch 'master' into sync_harvest_backend

203b394

Always returns the graphs for debug

5dd3c04

Add test for stopping due to HARVEST_MAX_ITEMS

d8eaf45

Update test backends

ebf2af7

Fix some tests

a2701af

Revert ID change for FakeBackend

a163b70

Simplify SyncBackend

3693bd2

ThibaudDauce force-pushed the harvest_dataservices branch from cece558 to 53b2cd8 Compare May 16, 2024 12:38

ThibaudDauce changed the base branch from master to sync_harvest_backend May 16, 2024 12:38

ThibaudDauce force-pushed the harvest_dataservices branch from 352d255 to 5fa9e5b Compare May 16, 2024 12:43

maudetes and others added 5 commits May 21, 2024 14:40

Merge branch 'master' into sync_harvest_backend

d5c0e29

fix wrong remote_id

a71b46e

Merge branch 'master' into sync_harvest_backend

914b69d

Remove dead code

fc98e28

update comment

20ce7e4

ThibaudDauce force-pushed the harvest_dataservices branch from 053a3bd to 0c66e1e Compare May 23, 2024 13:41

maudetes reviewed May 24, 2024

View reviewed changes

udata/core/dataservices/rdf.py Outdated Show resolved Hide resolved

udata/core/dataservices/models.py Outdated Show resolved Hide resolved

udata/harvest/backends/dcat.py Show resolved Hide resolved

udata/harvest/backends/dcat.py Outdated Show resolved Hide resolved

ThibaudDauce and others added 6 commits May 27, 2024 14:38

Merge branch 'master' into sync_harvest_backend

c3c4c27

Move docstring

b9b41e2

Switch is_done() do its own function

0a6dfa5

Rename process_datasets method

523c754

yield instead of callback

064f8eb

fix other backends

263c3a3

ThibaudDauce force-pushed the harvest_dataservices branch from 0c66e1e to 66fd737 Compare May 27, 2024 14:03

ThibaudDauce force-pushed the harvest_dataservices branch from 79c140a to 8e3e5ef Compare June 5, 2024 09:04

ThibaudDauce and others added 8 commits June 5, 2024 11:05

Update changelog

8e3e5ef

Do not duplicate datasets on each harvesting

9a9a43b

Merge branch 'master' into harvest_dataservices

e94a23f

Cleanup imports

5fee125

Fix changelog

31bfda7

Fix merge

f477ef6

Merge branch 'master' into harvest_dataservices

8f910e6

Merge branch 'master' into harvest_dataservices

aba27e4

maudetes reviewed Jun 11, 2024

View reviewed changes

Apply suggestions from code review

9f9a7f8

Co-authored-by: maudetes <[email protected]>

ThibaudDauce mentioned this pull request Jun 11, 2024

Add autoarchive for dataservice #3059

Open

ThibaudDauce added 6 commits June 11, 2024 14:46

add harvest metadata to API

2a0da24

harvest metadata as readonly

dd54158

Rename last_harvested_at and add harvest.created_at

30ba48d

Fix wrong attribute

b083c37

Do not empty the datasets list if no datasets found in harvesting

fbe6f22

fix casing

7863547

maudetes self-requested a review June 11, 2024 15:47

ThibaudDauce force-pushed the harvest_dataservices branch from 8290e1f to b401700 Compare June 11, 2024 15:56

ThibaudDauce and others added 4 commits June 11, 2024 17:56

Fix tests

b401700

Save node id if it's an URL

1d6714a

Add landing page as remote_url

cc6b5e8

Merge branch 'master' into harvest_dataservices

dc1a093

maudetes approved these changes Jun 13, 2024

View reviewed changes

udata/core/dataservices/models.py Outdated Show resolved Hide resolved

udata/core/dataservices/models.py Outdated Show resolved Hide resolved

udata/core/dataservices/rdf.py Outdated Show resolved Hide resolved

ThibaudDauce and others added 3 commits June 14, 2024 09:29

Rename rdf_node_id_as_url to follow the same name as dataset :-(

d484199

Remove dynamic to harvest dataservice metadata

fa816cc

Update udata/core/dataservices/models.py

546c1be

Co-authored-by: maudetes <[email protected]>

ThibaudDauce merged commit 14dbee6 into master Jun 14, 2024
1 check passed

ThibaudDauce deleted the harvest_dataservices branch June 14, 2024 07:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harvest dataservices #3029

Harvest dataservices #3029

ThibaudDauce commented Apr 30, 2024 •

edited

Loading

maudetes left a comment

maudetes left a comment

maudetes left a comment

Harvest dataservices #3029

Harvest dataservices #3029

Conversation

ThibaudDauce commented Apr 30, 2024 • edited Loading

maudetes left a comment

Choose a reason for hiding this comment

maudetes left a comment

Choose a reason for hiding this comment

maudetes left a comment

Choose a reason for hiding this comment

ThibaudDauce commented Apr 30, 2024 •

edited

Loading