Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvest dataservices #3029

Merged
merged 72 commits into from
Jun 14, 2024
Merged

Harvest dataservices #3029

merged 72 commits into from
Jun 14, 2024

Conversation

ThibaudDauce
Copy link
Contributor

@ThibaudDauce ThibaudDauce commented Apr 30, 2024

Fix datagouv/data.gouv.fr#1353

To harvest dataservices we first need to harvest datasets (because dataservices reference datasets in serveDatasets attribute). But right now, datasets are harvested asynchronously (by saving an HarvestJob and then queuing these jobs independently). It means we need to wait that all datasets are done before starting harvesting dataservices. Multiple options:

  1. Do the datasets' harvesting synchronously (previously talked in Gérer le moissonnage de gros catalogues via DCAT (CSW-DCAT ou XSLT) datagouv/data.gouv.fr#1046 (comment)) then we can just loop for datasets in the graph then loop for the dataservices in the graph in the same function. require some changes in all backends, we need to keep the HarvestJob for debug only purpose
  2. Do the dataservices harvesting inside the finalize function that is called at the end of all the jobs. not a big fan because it adds one more class and a lot of code
  3. Do the dataservices harvesting inside some HarvestJob (either the same model than for the datasets or a new one) and do some celery magic to dispatch all jobs with dependencies chains. not a big fan because it complexify a lot the architecture

@ThibaudDauce ThibaudDauce changed the base branch from master to sync_harvest_backend May 16, 2024 12:38
Copy link
Contributor

@maudetes maudetes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the way

udata/core/dataservices/rdf.py Outdated Show resolved Hide resolved
udata/core/dataservices/models.py Outdated Show resolved Hide resolved
udata/harvest/backends/dcat.py Show resolved Hide resolved
udata/harvest/backends/dcat.py Outdated Show resolved Hide resolved
Copy link
Contributor

@maudetes maudetes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, thank you for this PR! I'm looking forward to seeing this live 👏

I haven't dived in all the weird DCAT cases, but let's add support for poorly described catalogs in following PRs if needed!

udata/core/dataservices/rdf.py Outdated Show resolved Hide resolved
udata/core/dataservices/models.py Outdated Show resolved Hide resolved
udata/harvest/backends/base.py Outdated Show resolved Hide resolved
udata/core/dataservices/models.py Outdated Show resolved Hide resolved
udata/core/dataservices/models.py Outdated Show resolved Hide resolved
udata/harvest/backends/base.py Show resolved Hide resolved
udata/core/dataservices/rdf.py Show resolved Hide resolved
udata/core/dataservices/rdf.py Show resolved Hide resolved
udata/core/dataservices/rdf.py Outdated Show resolved Hide resolved
udata/core/dataset/rdf.py Outdated Show resolved Hide resolved
Copy link
Contributor

@maudetes maudetes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're good to go for a first iteration! 👏
I've added some minor suggestion as well :)

udata/core/dataservices/models.py Outdated Show resolved Hide resolved
udata/core/dataservices/models.py Outdated Show resolved Hide resolved
udata/core/dataservices/rdf.py Outdated Show resolved Hide resolved
@ThibaudDauce ThibaudDauce merged commit 14dbee6 into master Jun 14, 2024
1 check passed
@ThibaudDauce ThibaudDauce deleted the harvest_dataservices branch June 14, 2024 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Moissonner les dataservices en DCAT
2 participants