-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch from pickled blobs to JSON data #1786
base: master
Are you sure you want to change the base?
Conversation
If we are moving from BLOBs to JSON then we should really use the new format. See PR #800. The new format uses the The main benefit of the new format is that it is easier maintain and debug. Instead of lists we use dictionaries. So, for example, we refer to the field "parent_family_list" instead of field number 9. Upgrades are no problem. We just read and write the raw data. When I have more time I'll update you on discussion whilst you have been away. |
Oh, that sounds like a great idea! I'll take a look at the JSON format and switch to that. Should work even better with the SQL JSON_EXTRACT(). |
There are a few places where the new format is used, so we will get some bonus performance improvements. Feel free to make changes to my existing code if you see a benefit. You may also want to have a quick look at how we serialize |
Making some progress. Turns out, the serialized format had leaked into many other places, probably for speed. Probably good candidates for business logic. |
I added a |
@Nick-Hall , I will probably need your assistance regarding the complete save/load of the to_json and from_json functions. I looked at your PR but as it touches 590 files, there is a lot there. In this PR, I can now upgrade a database, and load the people views (except for name functions which I have to figure out). |
Thanks @Nick-Hall, that was very useful. I think that I will cherry pick some of the changes (like attribute name changes, elimination of private attributes). You'll see that I did many of the same changes you made. But, one thing I found is that if we want to allow upgrades from previous versions, then we need to be able to read in blob_data, and write out json_data. I think my version has that covered. I'll continue to make progress. |
@dsblank Why are you removing the properties? The validation in the setters will no longer be called. |
@Nick-Hall , I thought that was what @prculley did for optimization, and I thought was needed. I can put those back :) |
Perhaps we could consider a solution similar to that provided by the pickle A A I expect that only a handful of classes would need to override the default methods. |
@emyoulation, nice catch! Can you turn your above comment into a feature request? I'll handle those issues after this is merged. |
Done. Thanks for the direction. |
@@ -211,17 +212,23 @@ def add_row2(self, handle, data): | |||
# add the citation as a child of the source. Otherwise we add the source | |||
# first (because citations don't have any meaning without the associated | |||
# source) | |||
if self._get_node(data[5]): | |||
if self._get_node(data["source_handle"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull "source_handle" out as a constant and use here and line 222 below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See note.
@@ -137,7 +139,7 @@ def column_description(self, data): | |||
return data[COLUMN_DESCRIPTION] | |||
|
|||
def column_participant(self, data): | |||
handle = data[0] | |||
handle = data["handle"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle = data["handle"] | |
handle = data[COLUMN_HANDLE] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See note.
@@ -148,12 +150,11 @@ def column_participant(self, data): | |||
|
|||
def column_place(self, data): | |||
if data[COLUMN_PLACE]: | |||
cached, value = self.get_cached_value(data[0], "PLACE") | |||
cached, value = self.get_cached_value(data["handle"], "PLACE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cached, value = self.get_cached_value(data["handle"], "PLACE") | |
cached, value = self.get_cached_value(data[COLUMN_HANDLE], "PLACE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See note.
value = place_displayer.display_event(self.db, event) | ||
self.set_cached_value(data[0], "PLACE", value) | ||
self.set_cached_value(data["handle"], "PLACE", value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.set_cached_value(data["handle"], "PLACE", value) | |
self.set_cached_value(data[COLUMN_HANDLE], "PLACE", value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See note.
@@ -219,7 +218,7 @@ def column_tag_color(self, data): | |||
""" | |||
Return the tag color. | |||
""" | |||
tag_handle = data[0] | |||
tag_handle = data["handle"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tag_handle = data["handle"] | |
tag_handle = data[COLUMN_HANDLE] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be more consistent, and I love consistency! But I went the other way.
@@ -116,63 +117,63 @@ def on_get_n_columns(self): | |||
return len(self.fmap) + 1 | |||
|
|||
def column_father(self, data): | |||
handle = data[0] | |||
handle = data["handle"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add constants and use instead of raw strings throughout file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See note.
COLUMN_CHANGE = 9 | ||
COLUMN_TAGS = 10 | ||
COLUMN_PRIV = 11 | ||
COLUMN_HANDLE = "handle" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the COLUMN_XXX be clearer if called KEY_XXX since they are now the key in to a Python dictionary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See note. But I think using the actual schema name is clearer.
else: | ||
return data[2:] == empty_data[2:] | ||
for key in empty_data: | ||
if key in ["change", "gramps_id", "handle"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth extracting ["change", "gramps_id", "handle"] and giving it a name to describe why these keys are skipped e.g. internal_data_keys = ["change", "gramps_id", "handle"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll leave it like this as the only note about it is in the comments here, and I think this makes it clear. Also, as far as I saw, it was used only here.
@dsblank apologies for all the nit-picky comments. |
Co-authored-by: stevenyoungs <[email protected]>
Co-authored-by: stevenyoungs <[email protected]>
@stevenyoungs, I appreciate the review! For the
If others disagree, they can make it consistent a different way in a follow-up PR. Thanks! |
@Nick-Hall, all tests are passing, and I have addressed all review comments so far. |
Yes, I agree that this is the best approach. |
I can make arguments for either approach. Consistency is of greater value. I can't imagine the strings will change frequently. |
This PR converts the database interface to use JSON data rather than the pickled blobs used since the early days.
db.serializer
a. abstracts data column name
b. contains serialize/unserialize functions
a. It does this by switching between serializers