pre-queries for geojson geoms in bounding box before cluster query re… #10663

whatisgalen · 2024-03-07T08:02:21Z

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Description of Change

This PR aims to do a much simpler count_query that, if the count is 0, skips a more complex query, caches an empty tile, and raises a 404 for the MVT.

The count_query first counts how many geojson geoms for the requested node actually lie within the tile bounding box. If count >= the min_points parameter for that node config, it proceeds to the clustering query which further narrows its query with the same subquery as the count_query to speed things up. If 0 < count < min_points it proceeds to render the unclustered mvt geometry tile.

This PR also creates a method to create a MVT cache key incorporating...

nodeid, x, y, and zoom
the userid (proxy for viewable_nodegroups)
a snapshot of the database in the form of the editlog count for the nodegroup linked to the request node

...so that stale MVT/data won't override more recent MVT/data for that node/zoom/x/y/user.

Anecdotal benchmarks of this branch vs the target branch show up to a 30% speed improvement for rendering the largest clustered MVTs.

Testing This PR

To test the query-related performance enhancement, you'll want to set tile = None between lines 285 and 286 to temporarily disable MVT caching.
To test the caching of 404's, you'll want to clear out your cache

Issues Solved

#10452

Checklist

Unit tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Ticket Background

Sponsored by: @scholiumtech & Los Angeles OHR
Found by: @
Tested by: @
Designed by: @whatisgalen

Further comments

…10452

chiatt · 2024-03-07T12:16:26Z

While this looks like a nice performance improvement, it doesn't actually address #10452. A solution to #10452 would cache the fact that the database query was just performed and returned a 404, so that a subsequent request would skip any database query altogether.

whatisgalen · 2024-03-07T17:49:25Z

@chiatt we can do that, but I would strongly advise adding in some logic that embeds the state of the db into the cache key so that we avoid a case where a 404 was cached, then a new geom was created but it doesn't get included in a tile because the cache hasn't timed out yet.

For example:

cache_key = create_mvt_cache_key(node, zoom, x, y, request.user)
...

def create_mvt_cache_key(node, zoom, x, y, user, mvt_snapshot=None):
    if not mvt_snapshot:
        mvt_snapshot = models.EditLog.objects.filter(nodegroupid=str(node.nodegroup_id)).count()
    return f"mvt_{str(node.nodeid)}_{zoom}_{x}_{y}_{mvt_snapshot}_{user.id}"

…log.nodegroup.count re #10452

whatisgalen · 2024-03-08T09:06:49Z

still testing a couple alternative approaches here

…BBox re #10452

…ustering unclusterable tile re #10452

chiatt

It really seems like this should offer a nice performance improvement, but when I look at request times in Silk, I'm not seeing any improvement. It might be that I'm not testing it with the right data because I'm sure the scale and distribution of points makes a difference. Also, the additional query to the edit log is probably increasing the request time by a wee bit as well.

Also, it doesn't seem consequential, but I'm getting this 500 error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.10/dist-packages/django/core/handlers/base.py", line 197, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/django/views/generic/base.py", line 104, in view
    return self.dispatch(request, *args, **kwargs)
  File "/web_root/arches/arches/app/views/api.py", line 107, in dispatch
    return super(APIBase, self).dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/django/views/generic/base.py", line 143, in dispatch
    return handler(request, *args, **kwargs)
  File "/web_root/arches/arches/app/views/api.py", line 396, in get
    tile = bytes(cursor.fetchone()[0])
TypeError: 'NoneType' object is not subscriptable

chiatt · 2024-03-16T12:38:28Z

arches/app/views/api.py

+                            [nodeid, zoom, x, y, nodeid, resource_ids],
+                        )
+                    else:
+                        tile = ""


It seems like this alone will prevent a requery for the MVT because the value saved to the cache isn't None which is what is being checked for prior to the query.

arches/app/views/api.py

@@ -365,6 +400,10 @@ def get(self, request, nodeid, zoom, x, y):
            raise Http404()
        return HttpResponse(tile, content_type="application/x-protobuf")

+    def create_mvt_cache_key(node, zoom, x, y, user, mvt_snapshot=None):
+        if not mvt_snapshot:
+            mvt_snapshot = models.EditLog.objects.filter(nodegroupid=str(node.nodegroup_id)).count()


chiatt · 2024-03-16T13:02:35Z

@chiatt we can do that, but I would strongly advise adding in some logic that embeds the state of the db into the cache key so that we avoid a case where a 404 was cached, then a new geom was created but it doesn't get included in a tile because the cache hasn't timed out yet.

For example:
cache_key = create_mvt_cache_key(node, zoom, x, y, request.user)
...

def create_mvt_cache_key(node, zoom, x, y, user, mvt_snapshot=None):
    if not mvt_snapshot:
        mvt_snapshot = models.EditLog.objects.filter(nodegroupid=str(node.nodegroup_id)).count()
    return f"mvt_{str(node.nodeid)}_{zoom}_{x}_{y}_{mvt_snapshot}_{user.id}"

If the value is different between what is saved to the cache when a user gets a 404 vs an MVT tile, then it seems ok to return no tile even if an editor has recently created a new geometry within those bounds. That's the price of relying on a cache, and the cache timeout can always be adjusted to an admin's preference for performance over accuracy.

whatisgalen · 2024-03-16T20:41:57Z

@chiatt the SELECT COUNT query should leverage Postgres indexes and caching (vs a regular SELECT query) so while it does represent a query, it's still performant. If we want to basically rule out creating any new queries on principal, the alternative/workaround would be to cache a count of the all editlog objects every time a new editlog instance gets created. I would be a little surprised though if removing the SELECT COUNT query actually sped things up by more than a hundred milliseconds.

It really seems like this should offer a nice performance improvement, but when I look at request times in Silk, I'm not seeing any improvement. It might be that I'm not testing it with the right data because I'm sure the scale and distribution of points makes a difference. Also, the additional query to the edit log is probably increasing the request time by a wee bit as well.

How many geometries does the resource layer you're testing with have? I would expect little to no improvement for lower-density resource-layers but once you get into the tens of thousands I would expect noticeably faster responses for MVT requests.

…unt for that nodegroupid, creates editlog user in CACHE_BY_USER re #10452

…key re #10452

…cache if None in MVT cache key method re #10452

arches/app/views/api.py

arches/app/models/models.py

…d re #10452

… re #10452

whatisgalen added 2 commits March 7, 2024 00:01

pre-queries for geojson geoms in bounding box before cluster query re #…

bddfed8

…10452

fixes type issue re #10452

4e80fd6

whatisgalen added 3 commits March 7, 2024 17:09

creates cache_key create method, including db snapshot of user_x_edit…

9ad015d

…log.nodegroup.count re #10452

calls create_mvt_cache_key to make cache_key for mvt request re #10452

c1efb72

caches an empty mvt so as to not re-query for non-existent mvt re #10452

2eb521a

whatisgalen requested a review from chiatt March 8, 2024 01:25

whatisgalen assigned chiatt Mar 8, 2024

whatisgalen removed the request for review from chiatt March 8, 2024 01:36

whatisgalen unassigned chiatt Mar 8, 2024

whatisgalen added 3 commits March 8, 2024 11:16

alters pre-query to only return count of matching geoms re #10452

3f5901c

store only search_geom count instead of geoms themselves re #10452

1151cd3

narrows query of matching geoms in cluster query to those within Tile…

071f144

…BBox re #10452

whatisgalen requested a review from chiatt March 8, 2024 19:22

whatisgalen assigned chiatt Mar 8, 2024

whatisgalen added 2 commits March 8, 2024 11:30

adds control flow for when search_geom_count < min_points to avoid cl…

b9aea9a

…ustering unclusterable tile re #10452

rm redundant 404 exception re #10452

53bf50b

chiatt reviewed Mar 16, 2024

View reviewed changes

whatisgalen added 3 commits March 28, 2024 17:54

creates receiver method on post_save for EditLog to cache instance co…

1fcda6e

…unt for that nodegroupid, creates editlog user in CACHE_BY_USER re #10452

gets cached editlog_nodegroupid_count if available to make mvt_cache_…

2e7c6e8

…key re #10452

restyles to conform to line char limit, sets new editlog snapshot in …

e4688a6

…cache if None in MVT cache key method re #10452

chiatt requested changes Apr 5, 2024

View reviewed changes

arches/app/views/api.py Outdated Show resolved Hide resolved

arches/app/models/models.py Outdated Show resolved Hide resolved

whatisgalen added 4 commits April 5, 2024 11:45

removes refs to editlog or mvt_snapshot in mvt cache key create metho…

9687708

…d re #10452

rm editlog receiver due to perform implicns re #10452

bbd6f90

rm CACHE_BY_USER entry for editlog re #10452

e2a3a94

nit in settings.py re #10452

801bc3b

rm redundant cache.set statement, also retain tile from earlier query…

72bc030

… re #10452

chiatt approved these changes Apr 17, 2024

View reviewed changes

chiatt merged commit 866bb69 into dev/7.6.x Apr 17, 2024
4 checks passed

chiatt deleted the 10452_mvt_optimize branch April 17, 2024 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pre-queries for geojson geoms in bounding box before cluster query re… #10663

pre-queries for geojson geoms in bounding box before cluster query re… #10663

whatisgalen commented Mar 7, 2024 •

edited

Loading

chiatt commented Mar 7, 2024

whatisgalen commented Mar 7, 2024 •

edited

Loading

whatisgalen commented Mar 8, 2024

chiatt left a comment

chiatt Mar 16, 2024

This comment was marked as outdated.

chiatt commented Mar 16, 2024

whatisgalen commented Mar 16, 2024

pre-queries for geojson geoms in bounding box before cluster query re… #10663

pre-queries for geojson geoms in bounding box before cluster query re… #10663

Conversation

whatisgalen commented Mar 7, 2024 • edited Loading

Types of changes

Description of Change

Testing This PR

Issues Solved

Checklist

Ticket Background

Further comments

chiatt commented Mar 7, 2024

whatisgalen commented Mar 7, 2024 • edited Loading

whatisgalen commented Mar 8, 2024

chiatt left a comment

Choose a reason for hiding this comment

chiatt Mar 16, 2024

Choose a reason for hiding this comment

This comment was marked as outdated.

chiatt commented Mar 16, 2024

whatisgalen commented Mar 16, 2024

whatisgalen commented Mar 7, 2024 •

edited

Loading

whatisgalen commented Mar 7, 2024 •

edited

Loading