Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CV2-4953: Storing ClassyCat data in Alegre #105

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft

Conversation

ashkankzme
Copy link
Contributor

@ashkankzme ashkankzme commented Aug 8, 2024

Description

Store successful classification results of ClassyCat in Alegre

Reference: CV2-4953

How has this been tested?

Has been tested locally and in unit tests.

Are there any external dependencies?

New SSM parameters need to be added to Presto for the Alegre endpoint URL

Have you considered secure coding practices when writing this code?

None

@ashkankzme ashkankzme marked this pull request as ready for review August 9, 2024 17:32
@ashkankzme ashkankzme changed the title Cv2 4953 CV2-4953: Storing ClassyCat data in Alegre Aug 9, 2024
Copy link
Contributor

@skyemeedan skyemeedan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this work when running locally with Alegre (i.e. a manual integration test)?

  • I think it needs the key to specify which model will be used for vectorization

Comment:
I'm still not entirely sure why we are storing again in Alegre? I believe the text, the associated vector, the cluster, and the labels are already stored in alegre or calling systems by the time we call classycat. Have we ruled out appending an additional 'context' to existing items to store the labels? Or having a separate "label propagation" function in classycat that accepts an item, and a list of items with associated labels (assumed to be in a cluster) returns the propagated label or a new one if none available?

{'doc_id': str(uuid.uuid4()), # adding a unique id for each item to not rely on the input id for uniqueness
'content': items[i]['text'],
'context': {
'input_id': items[i]['id'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think would would be helpful to put a `'source':'ClassyCat' so if we later want to query this objects there is a key to do it


# call alegre endpoint to store the results: /text/bulk_similarity/
alegre_url = os.getenv('ALEGRE_URL')
httpx.post(alegre_url + '/text/bulk_similarity/', json=final_results_to_be_stored_in_alegre)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh, does this insert endpoint work? maybe I can switch timpani to use it as well when we convert to batch mode!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We think so, but @DGaffney is currently moving all vectorization to Presto and will need to ensure this endpoint continues to work after that migration 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense Scott, I added a comment on Devin's ticket to make sure he sees it.

@skyemeedan as of this moment the endpoint works fine locally for me, feel free to test it out!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to support this after we move to presto vectorization. This is a blocking, bulk query which breaks expectations in Presto-based vectorization doubly (not-blocking, and single query per item). This is, to my knowledge, the only location in Check (minus the re-index job in Check-API) that uses bulk_similarity, and if we could get away from this pattern, that would be vastly preferable. Is there any way we can move this? Otherwise we'll be signing up for new complicated measures to support this long term.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • if the bulk_similarity is not going to be supported, then I think we can't build the classycat feature on top of it and will need to find another approach for this PR?
  • My memory is that we decided that everything will be 'default bulk', so probably we need to adjust the design of new endpoints to support bulk?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is reasonable to take on making async endpoints default bulk until after we have made text work on Presto. We've shunted in secondary features to this refactor before and the net effect is that they have greatly complicated our existing migration plan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If supporting bulk requires some refactoring of presto payload structures, wouldn't it be easier to do that before we are supporting the full text processing in live?

'input_id': items[i]['id'],
'labels': final_results[i]['labels'],
'schema_id': schema_id,
'model_name': self.llm_client.model_name}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be the model name for the text vectorization? i.e. "paraphrase-multilingual-mpnet-base-v2"?
These are the parameters timpani used:
https://github.com/meedan/timpani/blob/6021d4fcb251d83ae48ed2c1566a16fad6971450/timpani/model_service/alegre_wrapper_service.py#L122

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within context, model_name (or any parameter) can be anything. We will, however, need to specify the models parameter for Alegre itself to know how to do vectorization. That parameter is currently missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch Scott, I did not know that we should specify model_name, and on local the documents are being index only by Elasticsearch (full text search). I will update the code and redo the tests to make sure vectorization works on local.

'input_id': items[i]['id'],
'labels': final_results[i]['labels'],
'schema_id': schema_id,
'model_name': self.llm_client.model_name}}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within context, model_name (or any parameter) can be anything. We will, however, need to specify the models parameter for Alegre itself to know how to do vectorization. That parameter is currently missing.

# content is text, doc_id is the item's unique id, and context is input id, labels, schema_id, and model name
final_results_to_be_stored_in_alegre = {'documents': [
{'doc_id': str(uuid.uuid4()), # adding a unique id for each item to not rely on the input id for uniqueness
'content': items[i]['text'],
Copy link

@computermacgyver computermacgyver Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the /text/similarity endpoint this parameter is text and not content. Let's double check this is the correct name for the bulk endpoint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for /text/bulk_similarity/ we replace text with content:

# Rename "text" to "content" if present
if 'text' in params:
  params['content'] = params.get('text')
  del params["text"]

We can use both content and text for bulk but it ultimately gets renamed to content in Alegre.


# call alegre endpoint to store the results: /text/bulk_similarity/
alegre_url = os.getenv('ALEGRE_URL')
httpx.post(alegre_url + '/text/bulk_similarity/', json=final_results_to_be_stored_in_alegre)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We think so, but @DGaffney is currently moving all vectorization to Presto and will need to ensure this endpoint continues to work after that migration 😅

@ashkankzme
Copy link
Contributor Author

Did this work when running locally with Alegre (i.e. a manual integration test)?

  • I think it needs the key to specify which model will be used for vectorization

Comment: I'm still not entirely sure why we are storing again in Alegre? I believe the text, the associated vector, the cluster, and the labels are already stored in alegre or calling systems by the time we call classycat. Have we ruled out appending an additional 'context' to existing items to store the labels? Or having a separate "label propagation" function in classycat that accepts an item, and a list of items with associated labels (assumed to be in a cluster) returns the propagated label or a new one if none available?

We plan to use Alegre for KNN similarity queries, since we already have Open/ElasticSearch set up for that service, makes a lot of sense. I agree that there are optimizations to be made here, but I rather not concern ClassyCat with those atm, we are still trying to achieve basic functionality in this ticket.

Also to address the issue of already having the items stored in Alegre prior to calling classycat, I think it's a lot easier to make sure that Alegre updates existing records vs. adding duplicate ones, instead of adding additional endpoints and complexity to classycat. We can do that by allowing an optional "doc_id" for each record submitted to classycat. If that field exists in the call, I will make sure to pass that on to Alegre (as opposed to generate a new unique id) and ensure that Alegre does "upserts" rather than "inserts" into the DB. As mentioned before, I think this is work for a another time and should be a separate ticket.

Copy link
Collaborator

@DGaffney DGaffney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets pretty pretty please not use bulk_similarity for this - I know we went through how to use it the other day but I didn't fully appreciate that this was part of new functionality that would be coming through a PR

@ashkankzme ashkankzme marked this pull request as draft August 14, 2024 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants