Add support for URLs in DUD_LOAD_RULE_PATHS. #15

wRAR · 2024-05-24T16:43:27Z

Fixes #1.

This uses treq for downloading, but there are many other options so this is open for discussion.

Any sync function (so requests). Pros: synchronous requests simplify the code. Cons: synchronous may be bad?
Any async function (listed below). Pros: can run in parallel if we want, doesn't block (is this important?). Cons: makes the code more complicated, though all DUD code is self-contained and so doesn't influence the user code design.
Scrapy downloader. Pros: reuses Scrapy (not a benefit by itself I think?), easy parallel downloading (I think), better logging and error handling etc. Cons: UrlCanonicalizer() is currently decoupled from Scrapy.
treq. Pros: straightforward. Cons: additional dep.
aiohttp. Pros: just a more modern thing. Cons: additional dep, requires the asyncio reactor which is a blocker.

Also this doesn't have tests for URLs, should we add mockserver, or wait until we publish the rules and use them in the tests?

BurnzZ · 2024-05-29T06:37:40Z

duplicate_url_discarder/url_canonicalizer.py

+                response = await maybe_deferred_to_future(treq.get(rule_path))
+                data = await response.text()


Let's handle the case when the rules were not successfully retrieved due to things like connection issues, timeouts, etc:

Logging the error

Terminate the crawl - I think this would be a good behavior since if the spider proceeds without any rules, the user would accumulate a lot of requests due to unfiltered requests.

BurnzZ · 2024-05-29T06:38:35Z

duplicate_url_discarder/url_canonicalizer.py

        rules: Set[UrlRule] = set()
        full_rule_count = 0
        for rule_path in rule_paths:
-            data = Path(rule_path).read_text()
+            data: str
+            if isinstance(rule_path, str) and self._is_url(rule_path):


minor: isinstance(rule_path, str) can be placed inside _is_url().

Add support for URLs in DUD_LOAD_RULE_PATHS.

285de46

Gallaecio approved these changes May 28, 2024

View reviewed changes

BurnzZ reviewed May 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for URLs in DUD_LOAD_RULE_PATHS. #15

Add support for URLs in DUD_LOAD_RULE_PATHS. #15

wRAR commented May 24, 2024 •

edited

Loading

BurnzZ May 29, 2024

BurnzZ May 29, 2024

		response = await maybe_deferred_to_future(treq.get(rule_path))
		data = await response.text()

Add support for URLs in DUD_LOAD_RULE_PATHS. #15

Are you sure you want to change the base?

Add support for URLs in DUD_LOAD_RULE_PATHS. #15

Conversation

wRAR commented May 24, 2024 • edited Loading

BurnzZ May 29, 2024

Choose a reason for hiding this comment

BurnzZ May 29, 2024

Choose a reason for hiding this comment

wRAR commented May 24, 2024 •

edited

Loading