KeyError when using follow_urls = True on run() #369

bernardodalfovo · 2023-06-06T13:33:57Z

How to reproduce:

from dude import select

@select(css="a")
def result_url(element):
    return {"url": element.get_attribute("href")}

if __name__ == "__main__":
    import dude

    dude.run(urls=["https://www.google.com"], follow_urls=True, ignore_robots_txt=True)

Error location:

https://github.com/roniemartinez/dude/blob/53d53c2bd840ea52fc341089313f122735dd6ab4/dude/base.py#LL65C13-L65C40

Error origin:

https://github.com/roniemartinez/dude/blob/53d53c2bd840ea52fc341089313f122735dd6ab4/dude/scraper.py#LL96C44-L96C55

The text was updated successfully, but these errors were encountered:

roniemartinez · 2023-06-06T16:44:55Z

Thanks for your report.

This is a limitation with JSON support because you cannot just append to the JSON file without breaking the format (and re-reading, appending and saving JSON is expensive as it grows). What you want here is JSON Lines (https://jsonlines.org/examples/).

You can use the example in https://github.com/roniemartinez/dude/blob/master/examples/save_per_page.py as a reference for JSONL custom storage.

Don't forget to add format="jsonl" in dude.run()

    dude.run(urls=["https://www.google.com"], follow_urls=True, ignore_robots_txt=True, format="jsonl")

bernardodalfovo · 2023-06-06T16:51:36Z

Thanks for the fast reply, Ronie.

I agree with the JSON limitation, the issue here, as far as I understood it, is that save_per_page is forced to True by the save_per_page=save_per_page or follow_urls statement on https://github.com/roniemartinez/dude/blob/53d53c2bd840ea52fc341089313f122735dd6ab4/dude/scraper.py#LL96C44-L96C55.

roniemartinez · 2023-06-06T17:17:53Z

save_per_page is forced to True

Yes, I intended to put that in there as a safe-guard since, by default (and also as a limitation), all data are saved to memory unless follow_urls ends.

follow_urls currently does not track the URLs it visited which puts it into infinite loop. You will lose the data if the program is terminated (e.g. ctrl-c)

I am open to new PRs for this (and since I am not using follow_urls with JSON format I didn't have the time to make a proper implementation).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError when using follow_urls = True on run() #369

KeyError when using follow_urls = True on run() #369

bernardodalfovo commented Jun 6, 2023 •

edited

Loading

roniemartinez commented Jun 6, 2023 •

edited

Loading

bernardodalfovo commented Jun 6, 2023 •

edited

Loading

roniemartinez commented Jun 6, 2023

KeyError when using follow_urls = True on run() #369

KeyError when using follow_urls = True on run() #369

Comments

bernardodalfovo commented Jun 6, 2023 • edited Loading

roniemartinez commented Jun 6, 2023 • edited Loading

bernardodalfovo commented Jun 6, 2023 • edited Loading

roniemartinez commented Jun 6, 2023

bernardodalfovo commented Jun 6, 2023 •

edited

Loading

roniemartinez commented Jun 6, 2023 •

edited

Loading

bernardodalfovo commented Jun 6, 2023 •

edited

Loading