Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ElasticSearch version #3153

Closed
matthias-ronge opened this issue Feb 7, 2020 · 5 comments
Closed

Update ElasticSearch version #3153

matthias-ronge opened this issue Feb 7, 2020 · 5 comments
Labels
dependencies Pull requests that update a dependency file legal

Comments

@matthias-ronge
Copy link
Collaborator

Version 6 of the Elasticsearch index uses a new format. The so-called mapping types have been removed. This requires changing the way Elasticsearch is addressed so that we can use a more recent version of Elasticsearch.

@matthias-ronge matthias-ronge added 3.x dependencies Pull requests that update a dependency file labels Feb 7, 2020
@henning-gerhardt
Copy link
Collaborator

Maybe the ElasticSearch (ES) integration of Hibernate is a better alternative as this resolves a few issues like

  • correct deleting of data in Hibernate and ES
  • writing similar queries instead of two different (HQL for Hibernate and ES-dialect for ES)
  • not adjusting ourself the changes for new ES versions (less maintenance for Kitodo.Production)
  • ...

I know that this is a big change but I think a good approach in long term.

@henning-gerhardt
Copy link
Collaborator

According to https://www.elastic.co/support/matrix#matrix_jvm a new ElasticSearch version should be used which is supported by many Java versions. At time at writing this (11/11/2020) Elastic Search version 6.8.x. or 7.7.x has most supported Java versions

@matthias-ronge matthias-ronge changed the title Change the way the Elasticsearch index is addressed Update ElasticSearch version Nov 25, 2020
@matthias-ronge
Copy link
Collaborator Author

The last version 5 of ElasticSearch ran out of support on March 11, 2019. Current version at the time of writing is 7.11.

@matthias-ronge
Copy link
Collaborator Author

matthias-ronge commented Jan 21, 2021

After Elastic changed the license for newer versions, which is ambiguous in some formulations, so that it can be interpreted against the operators of Production instances (even differently than it is currently intended), and since the ElasticSearch integration has to be rewritten anyway, my suggestion is to replace them entirely. Presentation uses Solr, and since Production and Presentation run on the same server in a large number of installations anyway, I would be very inclined to use Solr for Production as well. The search engine integration should be reconsidered and then implemented in a meaningful way by a search engine specialist.

The following points should be thought through beforehand and incorporated into a concept:

  • ElasticSearch 5 caused confusion because it made it appear that different object classes can be indexed independently, but internally it doesn't. This confusion runs through our implementation as well. It must be taken into account here that all indexed object classes must be grouped together in an index object superclass. The class should be guided by what is to be searched for and found, and what is to be displayed in the search hit list.

  • There is some confusion in the code about the ID number of an object because the ID of the index object is different from that of the database object.

  • There should only be one indexed object for each item to be searched for in the index in production. I want to say: There should be a reason why a class is being indexed, why the function is not possible via the database alone. At the moment I only see this for processes (with searchable metadata) but not for all other objects, but that would need to be discussed for every class.

  • A complete index structure should proceed according to the following scheme:

  1. Production switches to a database-only mode (if not already done, because no index was found). Search functions on the index are not possible in this, a pure database operation is carried out. Some more complicated searches are not possible, but you can basically work with the application.
  2. All existing database entries are set to null.
  3. Each record to be indexed (marked: null) is written to a file in a special directory, in expanded form, i.e. there are no references between these files. It is then set to DONE. If this process is interrupted, it can restart on the remaning null records.
  4. When changing objects during operation, all changed database entries are set to INDEX. They are not indexed immediately.
  5. The index is fed from the files created with a CLI call. Starting of the external process by Production can be set in the config file. It can be deactivated if this is shall be started manually on a separate server. The feeder deletes each file it has sent to the index. It can be restarted if it is canceled.
  6. After the first index building is finished, the database-only mode can be switched off. A thread then checks the database asynchronously and indexes all database entries that are on INDEX, and then sets them to DONE. It works asynchronously and always updates the index on changes.

This means that the search is never guaranteed to be up to date with the database, and program operation should be designed for this. This means that the search engine is used for searching, not for normal operation.

@henning-gerhardt
Copy link
Collaborator

After Elastic changed the license for newer versions, which is ambiguous in some formulations, so that it can be interpreted against the operators of Production instances (even differently than it is currently intended), and since the ElasticSearch integration has to be rewritten anyway,

I agree with you in terms but not overall. There are already forks of ElasticSearch and I hope on the open source community to create a stable and long supported fork of ElasticSearch. Our current implementation is not so bad but could be improved in many ways.

my suggestion is to replace them entirely

I did not agree. I would to move to the ElasticSearch integration of Hibernate as they already implement all your later marked / written down improvements. See my comment from 02/07/2020.

Presentation uses Solr, and since Production and Presentation run on the same server in a large number of installations anyway, I would be very inclined to use Solr for Production as well.

This is maybe true if you have only small amount of data (< 100.000 processes in .Production) and maybe not many full text enhanced data in solr. Even .Production and .Presentation use total different execution stacks (.Production with Java EE environment, .Presentation full PHP and TYPO3 environment) which are not necessary should run on one system. You will split it up if you have a lot of data (> 300.000 .Production processes with a few hundred GB on full text data for .Presentation).

The search engine integration should be reconsidered and then implemented in a meaningful way by a search engine specialist.

As long as

  • the new implementation did not implement the whole wheel as new technique rather then to use already existing implementations
  • the new integration use good unit and integration tests as current implementation with a high code coverage (> 80%)
  • is performance tested against a lot of data (a few million entries at least) for (first) indexing and search usage
  • there is a migration way from current ElasticSearch usage to the new integration

I would agree with the new integration.

@Kathrin-Huber Kathrin-Huber added the development fund 2021 A candidate for the Kitodo e.V. development fund. label Feb 18, 2021
@Kathrin-Huber Kathrin-Huber removed the development fund 2021 A candidate for the Kitodo e.V. development fund. label Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file legal
Projects
None yet
Development

No branches or pull requests

3 participants