Update ElasticSearch version #3153

matthias-ronge · 2020-02-07T08:43:23Z

Version 6 of the Elasticsearch index uses a new format. The so-called mapping types have been removed. This requires changing the way Elasticsearch is addressed so that we can use a more recent version of Elasticsearch.

henning-gerhardt · 2020-02-07T09:26:33Z

Maybe the ElasticSearch (ES) integration of Hibernate is a better alternative as this resolves a few issues like

correct deleting of data in Hibernate and ES
writing similar queries instead of two different (HQL for Hibernate and ES-dialect for ES)
not adjusting ourself the changes for new ES versions (less maintenance for Kitodo.Production)
...

I know that this is a big change but I think a good approach in long term.

henning-gerhardt · 2020-11-11T09:44:10Z

According to https://www.elastic.co/support/matrix#matrix_jvm a new ElasticSearch version should be used which is supported by many Java versions. At time at writing this (11/11/2020) Elastic Search version 6.8.x. or 7.7.x has most supported Java versions

matthias-ronge · 2020-11-25T10:55:20Z

The last version 5 of ElasticSearch ran out of support on March 11, 2019. Current version at the time of writing is 7.11.

matthias-ronge · 2021-01-21T08:40:38Z

After Elastic changed the license for newer versions, which is ambiguous in some formulations, so that it can be interpreted against the operators of Production instances (even differently than it is currently intended), and since the ElasticSearch integration has to be rewritten anyway, my suggestion is to replace them entirely. Presentation uses Solr, and since Production and Presentation run on the same server in a large number of installations anyway, I would be very inclined to use Solr for Production as well. The search engine integration should be reconsidered and then implemented in a meaningful way by a search engine specialist.

The following points should be thought through beforehand and incorporated into a concept:

ElasticSearch 5 caused confusion because it made it appear that different object classes can be indexed independently, but internally it doesn't. This confusion runs through our implementation as well. It must be taken into account here that all indexed object classes must be grouped together in an index object superclass. The class should be guided by what is to be searched for and found, and what is to be displayed in the search hit list.
There is some confusion in the code about the ID number of an object because the ID of the index object is different from that of the database object.
There should only be one indexed object for each item to be searched for in the index in production. I want to say: There should be a reason why a class is being indexed, why the function is not possible via the database alone. At the moment I only see this for processes (with searchable metadata) but not for all other objects, but that would need to be discussed for every class.
A complete index structure should proceed according to the following scheme:

Production switches to a database-only mode (if not already done, because no index was found). Search functions on the index are not possible in this, a pure database operation is carried out. Some more complicated searches are not possible, but you can basically work with the application.
All existing database entries are set to null.
Each record to be indexed (marked: null) is written to a file in a special directory, in expanded form, i.e. there are no references between these files. It is then set to DONE. If this process is interrupted, it can restart on the remaning null records.
When changing objects during operation, all changed database entries are set to INDEX. They are not indexed immediately.
The index is fed from the files created with a CLI call. Starting of the external process by Production can be set in the config file. It can be deactivated if this is shall be started manually on a separate server. The feeder deletes each file it has sent to the index. It can be restarted if it is canceled.
After the first index building is finished, the database-only mode can be switched off. A thread then checks the database asynchronously and indexes all database entries that are on INDEX, and then sets them to DONE. It works asynchronously and always updates the index on changes.

This means that the search is never guaranteed to be up to date with the database, and program operation should be designed for this. This means that the search engine is used for searching, not for normal operation.

henning-gerhardt · 2021-02-01T08:17:32Z

After Elastic changed the license for newer versions, which is ambiguous in some formulations, so that it can be interpreted against the operators of Production instances (even differently than it is currently intended), and since the ElasticSearch integration has to be rewritten anyway,

I agree with you in terms but not overall. There are already forks of ElasticSearch and I hope on the open source community to create a stable and long supported fork of ElasticSearch. Our current implementation is not so bad but could be improved in many ways.

my suggestion is to replace them entirely

I did not agree. I would to move to the ElasticSearch integration of Hibernate as they already implement all your later marked / written down improvements. See my comment from 02/07/2020.

Presentation uses Solr, and since Production and Presentation run on the same server in a large number of installations anyway, I would be very inclined to use Solr for Production as well.

This is maybe true if you have only small amount of data (< 100.000 processes in .Production) and maybe not many full text enhanced data in solr. Even .Production and .Presentation use total different execution stacks (.Production with Java EE environment, .Presentation full PHP and TYPO3 environment) which are not necessary should run on one system. You will split it up if you have a lot of data (> 300.000 .Production processes with a few hundred GB on full text data for .Presentation).

The search engine integration should be reconsidered and then implemented in a meaningful way by a search engine specialist.

As long as

the new implementation did not implement the whole wheel as new technique rather then to use already existing implementations
the new integration use good unit and integration tests as current implementation with a high code coverage (> 80%)
is performance tested against a lot of data (a few million entries at least) for (first) indexing and search usage
there is a migration way from current ElasticSearch usage to the new integration

I would agree with the new integration.

matthias-ronge added 3.x dependencies Pull requests that update a dependency file labels Feb 7, 2020

matthias-ronge changed the title ~~Change the way the Elasticsearch index is addressed~~ Update ElasticSearch version Nov 25, 2020

matthias-ronge added the legal label Jan 21, 2021

Kathrin-Huber added the development fund 2021 A candidate for the Kitodo e.V. development fund. label Feb 18, 2021

Kathrin-Huber mentioned this issue Feb 24, 2021

Update Elasticsearch #4208

Closed

2 tasks

Kathrin-Huber removed the development fund 2021 A candidate for the Kitodo e.V. development fund. label Feb 24, 2021

Kathrin-Huber closed this as completed Sep 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ElasticSearch version #3153

Update ElasticSearch version #3153

matthias-ronge commented Feb 7, 2020

henning-gerhardt commented Feb 7, 2020

henning-gerhardt commented Nov 11, 2020

matthias-ronge commented Nov 25, 2020

matthias-ronge commented Jan 21, 2021 •

edited

Loading

henning-gerhardt commented Feb 1, 2021

Update ElasticSearch version #3153

Update ElasticSearch version #3153

Comments

matthias-ronge commented Feb 7, 2020

henning-gerhardt commented Feb 7, 2020

henning-gerhardt commented Nov 11, 2020

matthias-ronge commented Nov 25, 2020

matthias-ronge commented Jan 21, 2021 • edited Loading

henning-gerhardt commented Feb 1, 2021

matthias-ronge commented Jan 21, 2021 •

edited

Loading