Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix uploader to work without -replace flag #27

Open
jpahm opened this issue Apr 25, 2024 · 3 comments
Open

Fix uploader to work without -replace flag #27

jpahm opened this issue Apr 25, 2024 · 3 comments
Labels
Help Wanted Extra attention is needed L3 A task suitable for someone who is comfortable implementing large-scale features/projects.

Comments

@jpahm
Copy link
Contributor

jpahm commented Apr 25, 2024

The uploader is now functional, however it currently only works when the -replace flag is provided. This means that we can only ever replace all of the data in the DB rather than simply updating things that have changed. This is unwanted for a multitude of reasons, including the fact that is makes the DB far more mutable than it needs to be and performs an enormous amount of unnecessary writes.

Luckily, the main cause of this is fairly simple -- when replacing old documents with a $merge pipeline, we cannot modify the immutable _id field of the original document, or, in other words, we cannot change the _id between the old and new version.

There is a problem that comes with this, however, which I will outline with an overview of how the data collection process works:

  1. Data is scraped
  2. Scraped data is parsed, links between courses/profs/sections are created (via _id references)
  3. Parsed data is uploaded via either replacement or update via $merge aggregate

The problem lies in the second point above -- any new "links" that have been created between newly parsed courses/profs/sections will be using new _ids, not the original ones. Thus, if we were to simply ignore the new _ids when performing the $merge, we would end up with countless invalid links.

Thoughts on how to resolve this are welcome, there are multiple ways that we could implement a solution.

@jpahm jpahm added Help Wanted Extra attention is needed L2 A task suitable for someone who is comfortable helping with implementing features. labels Apr 25, 2024
@mohammadmehrab
Copy link
Contributor

I was thinking about any potential solutions to this, but I've had no real luck so far. However, would it be possible to save the original _id in a new field for each document and base the links on that field rather than the _id field? This would allow the _id fields to be updated, but still maintain the integrity of the connections between courses/profs/sections. In theory this sounds like it would work, but I'm definitely worried that in execution something major could break down.

The other solution I could think of would somehow replace the old _id fields used for the links with the new _id fields, but I feel like implementing that is very unruly and unnecessarily difficult, so I would much prefer the first option.

Another potential approach of the first solution would be instead of using _id fields to link courses/profs/sections, we could potentially use multiple identification fields, the same ones we use for merging. For example, courses would be with catalog_year, course_number, and subject_prefix. I'm not too sure of the viability of this option, but I just wanted to throw it out there as an alternative to using _id fields altogether.

I'd love anyone's feedback on this and I would also love to see others' solutions!

@democat3457
Copy link
Member

Another potential approach of the first solution would be instead of using _id fields to link courses/profs/sections, we could potentially use multiple identification fields, the same ones we use for merging. For example, courses would be with catalog_year, course_number, and subject_prefix. I'm not too sure of the viability of this option, but I just wanted to throw it out there as an alternative to using _id fields altogether.

This is what I was thinking, maybe changing the primary key to a new field calculated like f'{catalog_year}_{subject_prefix}_{course_number}', which would guarantee uniqueness of those fields and still retain links when merging.

@jpahm
Copy link
Contributor Author

jpahm commented Oct 24, 2024

I think letting Mongo auto-generate the _id field and then using out own primary key would be the ideal system. Going to bump this up to L3, anyone who wants to try undertaking it is welcome to do so. Even a small proof-of-concept could be nice.

@jpahm jpahm added L3 A task suitable for someone who is comfortable implementing large-scale features/projects. and removed L2 A task suitable for someone who is comfortable helping with implementing features. labels Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Help Wanted Extra attention is needed L3 A task suitable for someone who is comfortable implementing large-scale features/projects.
Projects
None yet
Development

No branches or pull requests

3 participants