Fix uploader to work without -replace flag #27

jpahm · 2024-04-25T21:58:06Z

The uploader is now functional, however it currently only works when the -replace flag is provided. This means that we can only ever replace all of the data in the DB rather than simply updating things that have changed. This is unwanted for a multitude of reasons, including the fact that is makes the DB far more mutable than it needs to be and performs an enormous amount of unnecessary writes.

Luckily, the main cause of this is fairly simple -- when replacing old documents with a $merge pipeline, we cannot modify the immutable _id field of the original document, or, in other words, we cannot change the _id between the old and new version.

There is a problem that comes with this, however, which I will outline with an overview of how the data collection process works:

Data is scraped
Scraped data is parsed, links between courses/profs/sections are created (via _id references)
Parsed data is uploaded via either replacement or update via $merge aggregate

The problem lies in the second point above -- any new "links" that have been created between newly parsed courses/profs/sections will be using new _ids, not the original ones. Thus, if we were to simply ignore the new _ids when performing the $merge, we would end up with countless invalid links.

Thoughts on how to resolve this are welcome, there are multiple ways that we could implement a solution.

The text was updated successfully, but these errors were encountered:

mohammadmehrab · 2024-05-01T18:24:48Z

I was thinking about any potential solutions to this, but I've had no real luck so far. However, would it be possible to save the original _id in a new field for each document and base the links on that field rather than the _id field? This would allow the _id fields to be updated, but still maintain the integrity of the connections between courses/profs/sections. In theory this sounds like it would work, but I'm definitely worried that in execution something major could break down.

The other solution I could think of would somehow replace the old _id fields used for the links with the new _id fields, but I feel like implementing that is very unruly and unnecessarily difficult, so I would much prefer the first option.

Another potential approach of the first solution would be instead of using _id fields to link courses/profs/sections, we could potentially use multiple identification fields, the same ones we use for merging. For example, courses would be with catalog_year, course_number, and subject_prefix. I'm not too sure of the viability of this option, but I just wanted to throw it out there as an alternative to using _id fields altogether.

I'd love anyone's feedback on this and I would also love to see others' solutions!

democat3457 · 2024-09-26T09:04:37Z

Another potential approach of the first solution would be instead of using _id fields to link courses/profs/sections, we could potentially use multiple identification fields, the same ones we use for merging. For example, courses would be with catalog_year, course_number, and subject_prefix. I'm not too sure of the viability of this option, but I just wanted to throw it out there as an alternative to using _id fields altogether.

This is what I was thinking, maybe changing the primary key to a new field calculated like f'{catalog_year}_{subject_prefix}_{course_number}', which would guarantee uniqueness of those fields and still retain links when merging.

jpahm · 2024-10-24T22:34:26Z

I think letting Mongo auto-generate the _id field and then using out own primary key would be the ideal system. Going to bump this up to L3, anyone who wants to try undertaking it is welcome to do so. Even a small proof-of-concept could be nice.

jpahm added Help Wanted Extra attention is needed L2 A task suitable for someone who is comfortable helping with implementing features. labels Apr 25, 2024

jpahm added L3 A task suitable for someone who is comfortable implementing large-scale features/projects. and removed L2 A task suitable for someone who is comfortable helping with implementing features. labels Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix uploader to work without -replace flag #27

Fix uploader to work without -replace flag #27

jpahm commented Apr 25, 2024 •

edited

Loading

mohammadmehrab commented May 1, 2024

democat3457 commented Sep 26, 2024

jpahm commented Oct 24, 2024

Fix uploader to work without -replace flag #27

Fix uploader to work without -replace flag #27

Comments

jpahm commented Apr 25, 2024 • edited Loading

mohammadmehrab commented May 1, 2024

democat3457 commented Sep 26, 2024

jpahm commented Oct 24, 2024

jpahm commented Apr 25, 2024 •

edited

Loading