Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for URLs in DUD_LOAD_RULE_PATHS. #15

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

wRAR
Copy link
Member

@wRAR wRAR commented May 24, 2024

Fixes #1.

This uses treq for downloading, but there are many other options so this is open for discussion.

  • Any sync function (so requests). Pros: synchronous requests simplify the code. Cons: synchronous may be bad?
  • Any async function (listed below). Pros: can run in parallel if we want, doesn't block (is this important?). Cons: makes the code more complicated, though all DUD code is self-contained and so doesn't influence the user code design.
  • Scrapy downloader. Pros: reuses Scrapy (not a benefit by itself I think?), easy parallel downloading (I think), better logging and error handling etc. Cons: UrlCanonicalizer() is currently decoupled from Scrapy.
  • treq. Pros: straightforward. Cons: additional dep.
  • aiohttp. Pros: just a more modern thing. Cons: additional dep, requires the asyncio reactor which is a blocker.

Also this doesn't have tests for URLs, should we add mockserver, or wait until we publish the rules and use them in the tests?

Comment on lines +35 to +36
response = await maybe_deferred_to_future(treq.get(rule_path))
data = await response.text()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's handle the case when the rules were not successfully retrieved due to things like connection issues, timeouts, etc:

  1. Logging the error
  2. Terminate the crawl - I think this would be a good behavior since if the spider proceeds without any rules, the user would accumulate a lot of requests due to unfiltered requests.

rules: Set[UrlRule] = set()
full_rule_count = 0
for rule_path in rule_paths:
data = Path(rule_path).read_text()
data: str
if isinstance(rule_path, str) and self._is_url(rule_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: isinstance(rule_path, str) can be placed inside _is_url().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DUD_LOAD_POLICY_PATH: support URLs
3 participants