Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Astra scraper #31

Merged
merged 11 commits into from
Oct 14, 2024
Merged

Astra scraper #31

merged 11 commits into from
Oct 14, 2024

Conversation

TyHil
Copy link
Member

@TyHil TyHil commented Sep 26, 2024

Resolves #9

This Astra scraper dumps the data in JSON format with a key for each day scraped followed by a list of each room reservation in the format mentioned in issue #9 .

It stops after 90 days of fewer than 10 events. There is one persistent all day event that shows up on holidays and any semester after this one that marks "No Events Allowed". There is also a perpetual event with no time in FO 3.616, idk why. So once you get past this semester and the next you regularly have fewer than 10 events per day. I choose 90 to be safe about holidays. This works out to scraping about a year and 2 months out which takes a couple minutes on my machine.

The events are sorted by start time but this could easily be switched to room with LocationName.

@TyHil TyHil linked an issue Sep 26, 2024 that may be closed by this pull request
2 tasks
@TyHil
Copy link
Member Author

TyHil commented Sep 26, 2024

The current scraper logs in to Astra successfully but then throws the below error before exiting the sign in function:
ERROR: received DOM.documentUpdated when there's no top-level frame
@jpahm do you think you'd be able to help me out?

After that I think I'll just have to get the cookies sorted and it'll be able to make its first successful scraping request.

@jpahm
Copy link
Contributor

jpahm commented Sep 26, 2024

The current scraper logs in to Astra successfully but then throws the below error before exiting the sign in function: ERROR: received DOM.documentUpdated when there's no top-level frame @jpahm do you think you'd be able to help me out?

After that I think I'll just have to get the cookies sorted and it'll be able to make its first successful scraping request.

Thanks for looking into this! As far as that error goes, it pops up on the other scrapers occasionally as well -- I'm not actually 100% certain what causes it. It seems to just be an error that chromedp throws internally when it can't handle a documentUpdated event, though I don't think this is usually fatal. Is this error causing a panic for you or is chromedp just becoming unresponsive afterwards?

@TyHil
Copy link
Member Author

TyHil commented Sep 26, 2024

Ah ok, good to know.

It was never exiting the RefreshAstraToken function and seemed to be hanging on chromedp.WaitVisible(`body`). Working on cookie stuff now, will lyk if I need any more help.

@democat3457
Copy link
Member

It was never exiting the RefreshAstraToken function and seemed to be hanging on chromedp.WaitVisible(`body`). Working on cookie stuff now, will lyk if I need any more help.

@TyHil There seems to be a suggestion at this GitHub issue comment to add the chromedp.ByQuery argument to the WaitVisible call to explicitly specify you're querying the DOM and not just searching for text: chromedp/chromedp#440 (comment)

I tried it with chromedp.WaitVisible(`body`, chromedp.ByQuery) and the function no longer hung. Might be worth also adding it to the WaitVisible for the coursebook scraper.

TODOs:
sorting
scrape each day
look into login inputting user/pass in wrong sometimes
After this and next semester seems there's only ever 2 events, one in FO 3.616 with no time (?) and one with no location that always shows up after the current semester and says either the holiday and "Events for Future Terms" as well as "No Events Allowed". This just scrapes 90 days into that, stops at about a year and 2 months out.
Closes chromedp when not necessary
@TyHil TyHil marked this pull request as ready for review October 11, 2024 06:42
Copy link
Contributor

@jpahm jpahm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, merging now!

@jpahm jpahm merged commit 6b96e8d into develop Oct 14, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make astra scraper, schemas & docs
3 participants