-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix some scrambled references #2275
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Quite some time ago an "order" field was added to the Reference and PublishedReference models, to make the order explicit. However, this field was neither required to be non-null, nor required to be unique, so nothing prevented creating and publishing a project with an ill-defined reference order. In particular: (a) NewProjectVersionForm did not copy "order" from PublishedReference into Reference; so newly created versions would have unpredictable ordering. (This bug has been fixed by commit 6b42613.) (b) ReferenceFormSet tried to set "order", but only succeeded in saving the value for references that were being created/modified; so if a reference was deleted and then new references added, there could be multiple references with the same "order" value. (This bug is still present.) It is difficult, bordering on impossible - especially after a project has been published - to guess the author's intent and fix the resulting issues automatically. However, a common situation is that two versions of the same project have *identical* lists of references, and the second version has all "order" values set to None. That means that the author created a new version of the existing project and *never edited* any of the references; so it is probably safe to assume that the author intended to keep the order exactly as it was in the previous version. An example of this would be: - https://web.archive.org/web/20240418010540/https://physionet.org/content/annotation-dataset-sdoh/1.0.0/ - https://web.archive.org/web/20240710153623/https://physionet.org/content/annotation-dataset-sdoh/1.0.1/ A slightly less common situation is that the second version of the project has all the same references as the first version, but also has one or more additional references added to (what the author intended to be) the end of the list. Since those new references would have been created via ReferenceFormSet, the new references have "order" greater than the number of old references, while the old references still have "order" equal to None. Again, if the old references were never edited, it's probably reasonable to assume that the author didn't intend to change their order. An example of this would be: - https://web.archive.org/web/20240731053747/https://physionet.org/content/mimiciv/2.2/ - https://web.archive.org/web/20240806233833/https://physionet.org/content/mimiciv/3.0/ (notice that the three new references are listed as the *first* three items, because null is sorted after any non-null value.) Here, we add a data migration that should automatically fix those two cases, and other similarly unambiguous cases. It works by iterating over projects in publication order, meaning that it should be able to repair references that were copied from version A to version B, and then later copied from version B to version C. One thing worth noting here is that we do not simply change the "order" of existing Reference and PublishedReference objects. ReferenceFormSet has never honored the "order" when it comes to displaying references in the Project Content (or Copyedit) pages. That formset has always ordered references by their primary key ("id"). That logic would be difficult to change, and doing so would only make the problem worse by further obscuring the author's intent. On the other hand, it also seems impractical to revert to "id"-ordering in previews and published projects. Thus, when repairing a broken project, we actually shuffle the "descriptions" of the existing Reference objects, so that the final "order" and "id" should result in the same actual ordering. It's possible that this could lead to errors if an author is trying to edit the reference list while this migration is being deployed, but that seems fairly unlikely and less bad than the alternatives. It's not really necessary to do this for PublishedReferences, but using the same logic as for References doesn't hurt.
thanks for looking at this! |
This is the output on production:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Project references have a field called
order
which is nominallysupposed to indicate the order they're displayed.
Correct use of this field hasn't been enforced (see issue #2137), and
a lot of published projects on PhysioNet (probably fair to say most
projects that have more than one published version) have references
displayed in the wrong order, because the order wasn't copied from one
version to another.
This script should retroactively fix a large number of both published
and active projects.
I tried running it on a recent database dump from PhysioNet, and it
identified a total of 36 projects that it was able to fix. I
spot-checked
annotation-dataset-sdoh
andmimiciv
and both lookedcorrect.
There are about 12 published projects that will still require manual
fixing.
As I said in issue #2137, I don't intend to do any manual fixing of
active projects, but I do intend to add an integrity check.