Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError when using follow_urls = True on run() #369

Open
bernardodalfovo opened this issue Jun 6, 2023 · 3 comments
Open

KeyError when using follow_urls = True on run() #369

bernardodalfovo opened this issue Jun 6, 2023 · 3 comments

Comments

@bernardodalfovo
Copy link

bernardodalfovo commented Jun 6, 2023

How to reproduce:

from dude import select

@select(css="a")
def result_url(element):
    return {"url": element.get_attribute("href")}

if __name__ == "__main__":
    import dude

    dude.run(urls=["https://www.google.com"], follow_urls=True, ignore_robots_txt=True)

Error location:

https://github.com/roniemartinez/dude/blob/53d53c2bd840ea52fc341089313f122735dd6ab4/dude/base.py#LL65C13-L65C40

Error origin:

https://github.com/roniemartinez/dude/blob/53d53c2bd840ea52fc341089313f122735dd6ab4/dude/scraper.py#LL96C44-L96C55

@roniemartinez
Copy link
Owner

roniemartinez commented Jun 6, 2023

Thanks for your report.

This is a limitation with JSON support because you cannot just append to the JSON file without breaking the format (and re-reading, appending and saving JSON is expensive as it grows). What you want here is JSON Lines (https://jsonlines.org/examples/).

You can use the example in https://github.com/roniemartinez/dude/blob/master/examples/save_per_page.py as a reference for JSONL custom storage.

Don't forget to add format="jsonl" in dude.run()

    dude.run(urls=["https://www.google.com"], follow_urls=True, ignore_robots_txt=True, format="jsonl")

@bernardodalfovo
Copy link
Author

bernardodalfovo commented Jun 6, 2023

Thanks for the fast reply, Ronie.

I agree with the JSON limitation, the issue here, as far as I understood it, is that save_per_page is forced to True by the save_per_page=save_per_page or follow_urls statement on https://github.com/roniemartinez/dude/blob/53d53c2bd840ea52fc341089313f122735dd6ab4/dude/scraper.py#LL96C44-L96C55.

@roniemartinez
Copy link
Owner

save_per_page is forced to True

Yes, I intended to put that in there as a safe-guard since, by default (and also as a limitation), all data are saved to memory unless follow_urls ends.

follow_urls currently does not track the URLs it visited which puts it into infinite loop. You will lose the data if the program is terminated (e.g. ctrl-c)

I am open to new PRs for this (and since I am not using follow_urls with JSON format I didn't have the time to make a proper implementation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants