Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jena-geosparql - Add assembler option to disable spatial index #1344

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vtermanis
Copy link

@vtermanis vtermanis commented May 31, 2022

The Spatial Index is generated on server startup and, as per design, thereafter cannot be updated (until the next Fuseki restart).

Currently there are two options for the Spatial Index in assembler configuration:

  1. geosparql:spatialIndexFile set => Index loaded from / generated + written to disk on startup
  2. geosparql:spatialIndexFile unset => Index generated in memory

For a read + write dataset, said index is not very useful (in that startup time is wasted to re-generate the index which then is out-of-date after the next write op). This proposal adds a a new assembler option, geosparql:spatialIndexEnabled (defaulting to true) so that there now is a third mode:

  1. geosparql:spatialIndexEnabled set to false => geosparql:spatialIndexFile is ignored and no index is loaded or generated

- Now have to/from-file, in-memory and no index options
@afs afs added the GeoSPARQL label May 31, 2022
@afs
Copy link
Member

afs commented May 31, 2022

There are no tests covering the assembler change nor the functionality change.

Experts - what is the impact of no index on performance?

@vtermanis
Copy link
Author

There are no tests covering the assembler change nor the functionality change.

I did look to see what there was, but like you say, the assembler part is not currently tested. For the assembler, I presume you mean something like this? Or would it be more appropriate to explicitly call the updated assembler's createDataset and inspect the output (e.g. dataset has no spatial index in its context)?

@afs
Copy link
Member

afs commented May 31, 2022

Whatever works for the GeoSPARQL interest community.

A way like Fuseki main :: TestSecurityConfig is launching a server with a configuration and sending requests for testing.

@LorenzBuehmann
Copy link
Contributor

LorenzBuehmann commented Jun 3, 2022

Experts - what is the impact of no index on performance?

Not an expert but using Fuseki with GeoSPARQL for a longer time now ...

Containment checks can be way slower without index usage:
For example, currently, spatial containment queries that lead to point in polygon checks can make use of the index first (takes an envelope of the polygon, i.e. a rectangle to gather all points in this rectangle followed by a second check for proper point in polygon check necessary to filter points not in the polygon - for a large datasets and a small polygon this can be a huge performance gain. I don't have that exhaustive numbers at the moment though a minor example on a dataset about companies (2,374,998 in total):

The query gives the number of companies (10,270) in a small part of Germany:

SELECT (count(?c) as ?cnt) {
   BIND("POLYGON((7.654288035299954 51.82366598560922,11.257803660299954 51.82366598560922,11.257803660299954 49.59800926392628,7.654288035299954 49.59800926392628,7.654288035299954 51.82366598560922))"^^geo:wktLiteral as ?box)
  ?c spatial:withinBoxGeom(?box) . # the explicit spatial index lookup
  ?c a coy:Company ;
  geo:hasGeometry/geo:asWKT ?lit .  
  FILTER(geof:sfContains(?box, ?lit))
}

with the index lookup triple pattern it takes 0.1s, without it takes ~10s.

@neumarcx
Copy link
Contributor

neumarcx commented Jun 3, 2022 via email

@afs
Copy link
Member

afs commented Jun 10, 2022

The PR will add an option to make jena-geosparql ignore any persistent index. All lookups will only look in the geosparql RDF data. This way, queries are correct with respect to data updates but slow.

Is this the right thing to include in the codebase?

@vtermanis - at what scale have you used this? Does that usage include containment queries?

I propose merging this if there is a PR to update the
documentation
(https://github.com/apache/jena-site/blob/main/source/documentation/geosparql/geosparql-assembler.md).

Is there a reason why the index can't be updated?

@LorenzBuehmann
Copy link
Contributor

LorenzBuehmann commented Jun 12, 2022

Nitpicking: why would we call the method prepareSpatialExtension at all if the spatial index isn't enabled? All it would do is to check emptiness of the dataset (which has no benefit) and then return in the next ìf` clause -> no need to call the method

Is there a reason why the index can't be updated?

@afs The reason is the underlying datastructure of JTS, the STRtree to which items cannot be inserted once it has been built. We could allow for an update mode and switch to a Quadtree (a bit slower, but allows for insert/remove operations).
Moreover, we will have think careful about updating the "other" indexing structure of the geospatial layer as well, i.e. that literal, transformation and query rewrite part I think.

@afs
Copy link
Member

afs commented Jun 12, 2022

@LorenzBuehmann thank you for the background. jena-goesparql isn't an area I have looked into much and it has quite a high learning curve.

All - what are the implications of using Quadtree? Is it a relatively contained change in class SpatialIndex or does it have wider implications? What, very roughly, is the performance difference of an STRtree and a Quadtree? What about #1327 (PR for "allow geo index search for literals")?

@LorenzBuehmann
Copy link
Contributor

This article contains some numbers for JTS Quadtree vs STRtree: https://link.springer.com/article/10.1007/s41019-020-00147-9

It covers

  • indexing costs
  • index size
  • range queries
  • distance queries
  • point-in-polygon join query

We could keep the STRtree for read-only datasets, and I think we have to live with the Quadtree for read-write Datasets. Internally only query operation is called on the STRtree, thus changing the datastructure should be trivial.

@vtermanis
Copy link
Author

at what scale have you used this? Does that usage include containment queries?

@afs , we've only used the geof:(distance|sfWithin|sfContains) functions so far, the latter two with geof:buffer only. The scale is small for now (100k geometries).


we will have think careful about updating the "other" indexing structure of the geospatial layer as well, i.e. that literal, transformation and query rewrite part I think.

@LorenzBuehmann, do you mean because of the suggested QuadTree change for the spatial index or from a general performance perspective? (I saw your suggestion on using a different caching lib in Jira.)

@vtermanis
Copy link
Author

(sorry, one more Q @LorenzBuehmann )

we have to live with the Quadtree for read-write Datasets.

What would it mean for persistence? (From my understanding the current STRtree index is serialised to disk in full.)
For the case where a write-heavy dataset is only used sparingly for GeoSPARQL queries, is it still useful to offer the "no index" option also, i.e.:

  1. STRtree index pre-generated either to file or into memory (current mode)
  2. QuadTree index updated during writes (in memory and/or disk?)
    1. Can update geometries in data & continue to perform spatial queries
    2. If have existing large dataset, have to pre-generate initial index on startup
  3. Spatial index disabled
    1. No write & startup perf impact (if (2) persisted)
    2. GeoSPARQL queries slow(er), choose option 1 or 2 if this matters

@afs
Copy link
Member

afs commented Jun 13, 2022

(I [@vtermanis] saw your suggestion on using a different caching lib in Jira.)

JENA-2311 and PR #1235.

@LorenzBuehmann
Copy link
Contributor

do you mean because of the suggested QuadTree change for the spatial index or from a general performance perspective? (I saw your suggestion on using a different caching lib in Jira.)

@vtermanis I mean, once we allow for updates, in particular for removal we might have to address the current caching, i.e. maybe just invalidate or empty the current cache in the simplest case

What would it mean for persistence? (From my understanding the current STRtree index is serialised to disk in full.)

Yep, one of the things that would have to be discussed. I don't think JTS provides any disk-mapped datastructure, which means it remains open to when to persist the updates - that's always the case for in-memory index structures.

@afs
Copy link
Member

afs commented Jun 17, 2022

The persistence is part of jena-geosparql:

https://github.com/apache/jena/blob/main/jena-geosparql/src/main/java/org/apache/jena/geosparql/spatial/SpatialIndexStorage.java

public final void insertItems(Collection<SpatialIndexItem> indexItems) throws SpatialIndexException {

@LorenzBuehmann
Copy link
Contributor

Well, that does only add items to the index before it is finally built and remains after that immutable. It then serializes the index as Java object stream to disk. Just the collection of items though, not the underlying STR-Tree - this will be rebuild each time on startup.

But there is no mechanism yet that would write changes made to a mutable R-Tree index to disk then, i.e. it would only be changed in-memory, but the question would be how to make those changes persistent. Re-serializing the index each time the RDF graph is being changed seems to be infeasible as it is somewhat slow for larger indexes and it currently just dumps the whole index.
The main problem is just that JTS doesn't provide any on-disk index afaik.

@Aklakan
Copy link
Contributor

Aklakan commented Jun 25, 2022

Ideally there would be a persistent R-Tree implementation similar to dboe's BPlusTree.

But even just serializing the in-memory data structure as a whole rather then having to rebuild it on start-up would be an improvement.
Also, using kryo serialization (is BSD-3 compatible with Apache v2?) would most likely be faster than java serialization.
I suppose parallel de-/serialization of tree data structures should be rather trivial to implement when going with the in-memory index solution for now.

One approach is also to represent grid cells (with optional nesting) as IDs and then link spatial objects to the grid cell ids - so a kind of poor man's quad tree represented in a B+ tree. This could be implemented with the TDB machinery - but not sure whether that'd be a worthwhile endeavor.

@SimonBin
Copy link
Contributor

@vtermanis I believe you can use the geof: functions without wrapping in a GeoDS so you don't need to add this option :-)

@vtermanis
Copy link
Author

vtermanis commented Jan 29, 2023

I believe you can use the geof: functions without wrapping in a GeoDS so you don't need to add this option :-)

That's a good idea @SimonBin - but then surely Geometry Literal, Geometry Transform, Query Rewrite indexes/caching won't be available (which from my understanding are still useful for repeated queries against the same geometries).

@SimonBin
Copy link
Contributor

I see, you're right. I guess this small addition to the code is straight forward and won't hurt.

(I also noticed that the code in fact needs to be updated because currently it uses a single Cache for all DSes)

@davidmireles
Copy link

What is/was the outcome of the discussion on enabling updates to the geo-sparql spatial index? I find this to be one limiting aspect of the Jena geo-sparql implementation that a number of other triple stores provide out-of-the box, and would be a very desired addition.

@SimonBin
Copy link
Contributor

maybe we shouldn't derail this thread, but as a stop-gap solution to your concern, we have implemented a method to manually update the geospatial index, which is currently good enough for our project : https://github.com/AKSW/fuseki-mods/tree/adaptions/jena-fmod-geosparql/src/main/java/org/apache/jena/fuseki/mod/geosparql

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants