Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/add borger dk #10

Merged
merged 20 commits into from
Nov 27, 2023
Merged

Feat/add borger dk #10

merged 20 commits into from
Nov 27, 2023

Conversation

AJDERS
Copy link
Collaborator

@AJDERS AJDERS commented Nov 17, 2023

This PR depends on #9 .

Adds the borger.dk-scrape. borger.dk, is not a simple tree structure of categories/subcategories/articles, but rather:

  • A set of different adjacent navigation hierarchies. (Regular tree hierarchy / user-questionaires / javascript-toggle-bullshit)
  • These hierarchies are navigating a single set of categories/subcategories.
  • Each subcategory contains a set of nested articles.
  • Each article is a text and a link, and the link navigates up, down and sideways in the category/subcategory/nested article hierarchy (and also to adjacent hierarchies)

To mitigate this we:

  • Pick one navigation hierarchy (the regular tree hierarchy), and avoid the others.
  • Process each category analogous to how sundhed.dk was scraped, with the following additions:
    • We make sure to only follow links downward or sideways within a category/subcategory, i.e. that we do not leave a category once we've entered it.
    • Since there are many ways to navigate to the same article we store which urls we've visited, to avoid parsing them twice.

@AJDERS AJDERS self-assigned this Nov 17, 2023
src/tts_text/borger_dk.py Outdated Show resolved Hide resolved
src/tts_text/borger_dk.py Outdated Show resolved Hide resolved
src/tts_text/borger_dk.py Outdated Show resolved Hide resolved
src/tts_text/borger_dk.py Outdated Show resolved Hide resolved
src/tts_text/borger_dk.py Outdated Show resolved Hide resolved
src/tts_text/borger_dk.py Outdated Show resolved Hide resolved
src/tts_text/borger_dk.py Outdated Show resolved Hide resolved
src/tts_text/borger_dk.py Outdated Show resolved Hide resolved
src/tts_text/borger_dk.py Outdated Show resolved Hide resolved
src/tts_text/borger_dk.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@saattrupdan saattrupdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small extra addition. Feel free to merge in any case.

src/tts_text/utils.py Show resolved Hide resolved
@AJDERS AJDERS merged commit 3a8f0d0 into main Nov 27, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants