Switch from pickled blobs to JSON data #1786

dsblank · 2024-10-10T21:56:00Z

This PR converts the database interface to use JSON data rather than the pickled blobs used since the early days.

Uses a new abstraction in the database: db.serializer
a. abstracts data column name
b. contains serialize/unserialize functions
Updates database format to 21
The conversion from 20 to 21 reads pickled blobs, and writes JSON data.
a. It does this by switching between serializers
New databases do not contained pickled blobs
Converted databases contain both fields

Nick-Hall · 2024-10-10T23:04:14Z

If we are moving from BLOBs to JSON then we should really use the new format. See PR #800.

The new format uses the to_json and from_json methods in the serialize module to build the json from the underlying classes. It comes with get_schema class methods which provide a JSON Schema that allow the validation that we already use in our unit tests.

The main benefit of the new format is that it is easier maintain and debug. Instead of lists we use dictionaries. So, for example, we refer to the field "parent_family_list" instead of field number 9.

Upgrades are no problem. We just read and write the raw data.

When I have more time I'll update you on discussion whilst you have been away.

dsblank · 2024-10-11T01:14:12Z

Oh, that sounds like a great idea! I'll take a look at the JSON format and switch to that. Should work even better with the SQL JSON_EXTRACT().

Nick-Hall · 2024-10-11T15:57:22Z

There are a few places where the new format is used, so we will get some bonus performance improvements.

Feel free to make changes to my existing code if you see a benefit.

You may also want to have a quick look at how we serialize GrampsType. Enough information is stored so that we can recreate the object, but I don't think that I chose to store all fields.

dsblank · 2024-10-12T23:49:21Z

Making some progress. Turns out, the serialized format had leaked into many other places, probably for speed. Probably good candidates for business logic.

dsblank · 2024-10-13T02:07:32Z

I added a to_dict() and from_dict() based on the to_json() and from_json(). I didn't know about the object hooks. Brilliant! That saves so much code.

dsblank · 2024-10-13T16:07:30Z

@Nick-Hall , I will probably need your assistance regarding the complete save/load of the to_json and from_json functions. I looked at your PR but as it touches 590 files, there is a lot there.

In this PR, I can now upgrade a database, and load the people views (except for name functions which I have to figure out).

Nick-Hall · 2024-10-13T17:01:29Z

@dsblank I have rebased PR #800 on the gramps51 branch. Only 25 files were actually changed.

You can also see the changes suggested by @prculley resulting from his testing and performance benchmarks.

dsblank · 2024-10-13T19:04:34Z

Thanks @Nick-Hall, that was very useful. I think that I will cherry pick some of the changes (like attribute name changes, elimination of private attributes).

You'll see that I did many of the same changes you made. But, one thing I found is that if we want to allow upgrades from previous versions, then we need to be able to read in blob_data, and write out json_data. I think my version has that covered.

I'll continue to make progress.

Nick-Hall · 2024-10-13T23:04:02Z

@dsblank Why are you removing the properties? The validation in the setters will no longer be called.

dsblank · 2024-10-14T02:47:46Z

@Nick-Hall , I thought that was what @prculley did for optimization, and I thought was needed. I can put those back :)

This reverts commit a9da731.

Nick-Hall · 2024-10-14T14:00:44Z

Perhaps we could consider a solution similar to that provided by the pickle __getstate__ and __setstate__ methods.

A get_state method in a base class could return a dictionary of public attributes by default. This could be overridden to add properties if required.

Aset_state method could write the values back. In the case of properties we could just set the corresponding private variable rather than calling the setter. The list to tuple conversion could also be done in this method.

I expect that only a handful of classes would need to override the default methods.

dsblank · 2024-11-01T20:59:14Z

@emyoulation, nice catch! Can you turn your above comment into a feature request? I'll handle those issues after this is merged.

emyoulation · 2024-11-01T21:21:17Z

Can you turn your above comment into a feature request? I

Done. Thanks for the direction.
https://gramps-project.org/bugs/view.php?id=13488

gramps/gen/lib/serialize.py

stevenyoungs · 2024-11-01T22:03:48Z

gramps/gui/views/treemodels/citationtreemodel.py

@@ -211,17 +212,23 @@ def add_row2(self, handle, data):
        # add the citation as a child of the source. Otherwise we add the source
        # first (because citations don't have any meaning without the associated
        # source)
-        if self._get_node(data[5]):
+        if self._get_node(data["source_handle"]):


Pull "source_handle" out as a constant and use here and line 222 below

stevenyoungs · 2024-11-01T22:06:53Z

gramps/gui/views/treemodels/eventmodel.py

@@ -137,7 +139,7 @@ def column_description(self, data):
        return data[COLUMN_DESCRIPTION]

    def column_participant(self, data):
-        handle = data[0]
+        handle = data["handle"]


Suggested change

handle = data["handle"]

handle = data[COLUMN_HANDLE]

stevenyoungs · 2024-11-01T22:07:18Z

gramps/gui/views/treemodels/eventmodel.py

@@ -148,12 +150,11 @@ def column_participant(self, data):

    def column_place(self, data):
        if data[COLUMN_PLACE]:
-            cached, value = self.get_cached_value(data[0], "PLACE")
+            cached, value = self.get_cached_value(data["handle"], "PLACE")


Suggested change

cached, value = self.get_cached_value(data["handle"], "PLACE")

cached, value = self.get_cached_value(data[COLUMN_HANDLE], "PLACE")

stevenyoungs · 2024-11-01T22:07:33Z

gramps/gui/views/treemodels/eventmodel.py

                value = place_displayer.display_event(self.db, event)
-                self.set_cached_value(data[0], "PLACE", value)
+                self.set_cached_value(data["handle"], "PLACE", value)


Suggested change

self.set_cached_value(data["handle"], "PLACE", value)

self.set_cached_value(data[COLUMN_HANDLE], "PLACE", value)

stevenyoungs · 2024-11-01T22:08:02Z

gramps/gui/views/treemodels/eventmodel.py

@@ -219,7 +218,7 @@ def column_tag_color(self, data):
        """
        Return the tag color.
        """
-        tag_handle = data[0]
+        tag_handle = data["handle"]


Suggested change

tag_handle = data["handle"]

tag_handle = data[COLUMN_HANDLE]

Yes, that would be more consistent, and I love consistency! But I went the other way.

stevenyoungs · 2024-11-01T22:11:23Z

gramps/gui/views/treemodels/familymodel.py

@@ -116,63 +117,63 @@ def on_get_n_columns(self):
        return len(self.fmap) + 1

    def column_father(self, data):
-        handle = data[0]
+        handle = data["handle"]


add constants and use instead of raw strings throughout file?

stevenyoungs · 2024-11-01T22:15:57Z

gramps/gui/views/treemodels/citationbasemodel.py

-COLUMN_CHANGE = 9
-COLUMN_TAGS = 10
-COLUMN_PRIV = 11
+COLUMN_HANDLE = "handle"


Would the COLUMN_XXX be clearer if called KEY_XXX since they are now the key in to a Python dictionary?

See note. But I think using the actual schema name is clearer.

gramps/gui/views/treemodels/peoplemodel.py

gramps/plugins/db/dbapi/dbapi.py

stevenyoungs · 2024-11-01T22:30:09Z

gramps/plugins/tool/check.py

-        else:
-            return data[2:] == empty_data[2:]
+        for key in empty_data:
+            if key in ["change", "gramps_id", "handle"]:


Is it worth extracting ["change", "gramps_id", "handle"] and giving it a name to describe why these keys are skipped e.g. internal_data_keys = ["change", "gramps_id", "handle"]

I think I'll leave it like this as the only note about it is in the comments here, and I think this makes it clear. Also, as far as I saw, it was used only here.

stevenyoungs · 2024-11-01T22:36:02Z

@dsblank apologies for all the nit-picky comments.
Overall looks really good

Co-authored-by: stevenyoungs <[email protected]>

dsblank · 2024-11-04T13:00:02Z

@stevenyoungs, I appreciate the review!

For the COLUMN_ suggestions, I decided to go the other way, and remove all of them. I did that for the following reasons:

There are actually more uses of just using the key than having constants, and I'd like the uses to be consistent
All of the uses of the COLUMN_ constants are inside methods of the same name. So, COLUMN_DATE was used in a method called citation_date()
Positions (like in data["tag_list"][0]) are not constants.

If others disagree, they can make it consistent a different way in a follow-up PR.

Thanks!

dsblank · 2024-11-04T14:34:22Z

@Nick-Hall, all tests are passing, and I have addressed all review comments so far.

gramps/plugins/test/tools_test.py

Nick-Hall · 2024-11-04T18:37:41Z

For the COLUMN_ suggestions, I decided to go the other way, and remove all of them.

Yes, I agree that this is the best approach.

stevenyoungs · 2024-11-04T18:42:33Z

@stevenyoungs, I appreciate the review!

If others disagree, they can make it consistent a different way in a follow-up PR.

I can make arguments for either approach. Consistency is of greater value. I can't imagine the strings will change frequently.

Database format 21: add JSON, remove pickle

a05fe5a

dsblank requested a review from Nick-Hall October 10, 2024 21:56

dsblank added 3 commits October 10, 2024 18:10

Rename new column to json_data

a8ef265

Read prev version

97d3388

Load old version

7497abd

Added to_dict, from_dict

76d622e

dsblank added 2 commits October 12, 2024 14:43

Refactor for upgrade uses

c014c15

Peoplemodel mostly working

43ea2b2

dsblank added 3 commits October 12, 2024 22:32

Save new db 21 with JSON data field

1350f5c

Docstrings

e677833

Generic needs to handle both blob and json during upgrades

7ac4f7b

dsblank added 4 commits October 13, 2024 15:23

name fixes

08869eb

black linting

99b3d2b

Removed unneeded properties on primary objects

a9da731

Use a version of Nick's to/from json funcs

bc5ac5b

dsblank added 4 commits October 13, 2024 22:49

Revert "Removed unneeded properties on primary objects"

33928a9

This reverts commit a9da731.

linting

3785e46

WIP: eventmodel, and familymodel

b211640

Use column position in model

21caaa2

dsblank added 2 commits November 1, 2024 17:12

Renamed struct to dict

ac5db94

Fix typo in from_dict docstring

f834b50