Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bookreader Navigation of search results within Creative Work Series objects has incorrect sequence #470

Open
patdunlavey opened this issue Oct 8, 2024 · 3 comments

Comments

@patdunlavey
Copy link
Collaborator

We have a object type that we call "CWSBook", which is basically a Creative Work Series (digital object collection) object with a series of "Page" objects as children that have the individual page image files. We are able to display this in IA Bookreader using the v2.1 IIIF Manifest. OCR is indexed for these pages with both sequence_id and parent_sequence_id - where the latter records the order of the Pages in the CWSBook.

The problem we are having is that when a search is performed in IAB, the search results are returned sorted by relevance, and not by the page sequence, and this results in the previous/next links in the search results area to jump the user seemingly wildly back and forth.

This is because the sequence of the pages being displayed in IAB corresponds to (comes from?) the parent_sequence_id value indexed in solr, not the sequence_id, whereas \Drupal\strawberryfield\Controller\StrawberryfieldFlavorDatasourceSearchController::flavorfromSolrIndex sorts only by sequence_id, when present.

I'm able to force the correct sequence in the results and correct IAB behavior in my testing with this simple change

      if (isset($allfields_translated_to_solr['parent_sequence_id'])) {
        $fields_to_retrieve['parent_sequence_id'] = $allfields_translated_to_solr['parent_sequence_id'];
        // Creative Work Series need to first be sorted by parent_sequence_id.
        $query->sort('parent_sequence_id', QueryInterface::SORT_ASC);
      }

However, I'm not sure if there are cases where this flavorfromSolrIndex method may be called where this sort is not the correct behavior. I also wonder if sorting by sequence_id - the current behavior - is always correct? The flavorfromSolrIndex method seems to be intended to be used pretty broadly. Are there cases where only sort by relevance is correct?

@patdunlavey
Copy link
Collaborator Author

From @DiegoPino on slack:
Diego Pino 4:39 PM
So there is one issue (why i avoided it...)

if (isset($allfields_translated_to_solr['parent_sequence_id']) &&
  isset($extradata_from_item['search_api_solr_document'][$allfields_translated_to_solr['parent_sequence_id']]) && !$result_is_top_media) {
  $sequence_number = (array) $extradata_from_item['search_api_solr_document'][$allfields_translated_to_solr['parent_sequence_id']];
  if (isset($sequence_number[0]) && !empty($sequence_number[0]) && ($sequence_number[0] != 0)) {
    // We do all this checks to avoid adding a strange offset e.g a collection instead of a CWS
    $page_number_by_id[$extradata_from_item['search_api_solr_document']['id']] = $sequence_number[0];
  }
}

@patdunlavey
Copy link
Collaborator Author

@DiegoPino - in https://github.com/esmero/strawberryfield/blob/1.5.0/src/Controller/StrawberryfieldFlavorDatasourceSearchController.php#L112-L122 we have...

    //search for IAB highlight
    // if format=json, page=all and q not null
    if (($input = $request->query->get('q')) && ($format == 'json') && ($page == 'all')) {

      $snippets = $this->flavorfromSolrIndex(
        $input,
        $node->id(),
        $processor,
        $fileuuid,
        $indexes
      );

Do I understand the "search for IAB highlight" comment to mean that the conditions there (non-empty 'q' param, result format = 'json', page = 'all') indicate that we are searching for and intend to return ocr highlights? If so (and assuming the if condition is valid for this purpose), then can we use this as the test to determine if the search results should be sorted by sequence ids - both sequence_id and parent_sequence_id?

@patdunlavey
Copy link
Collaborator Author

Hi @DiegoPino, I've ascertained that, for my project, the solution I proposed here is safe/valid: I have no cases where adding sorting by parent_sequence_id would throw off the desired sorting.

For my purposes, for now, I think I'll just do a composer patch so that I'm not blocked on this.

But I know we need to think more expansively and identify cases where this assumption is not valid. What are the cases where parent_sequence_id would have a value in a strawberryflavor record where it would invalidate the sorting? How can we programmatically identify those cases, either directly inside ::flavorFromSolrIndex, or inside ::search (passing a flag into ::flavorFromSolrIndex), so that it knows not to sort on the parent sequence id)?

Closely related question: if sorting on parent_sequence_id is sometimes not correct, is always sorting on sequence_id correct? Are there times when ::flavorfromSolrIndex is called when the search result relevance actually is the correct sort?

Finally, do we handle missing values sort correctly for this case (missing values should sort last in an ASC sort, I think - we would not want to sort by parent_sequence_id and have all the records with no parent_sequence_id come out first)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant