CV2-4953: Storing ClassyCat data in Alegre #105

ashkankzme · 2024-08-08T21:32:57Z

Description

Store successful classification results of ClassyCat in Alegre

Reference: CV2-4953

How has this been tested?

Has been tested locally and in unit tests.

Are there any external dependencies?

New SSM parameters need to be added to Presto for the Alegre endpoint URL

Have you considered secure coding practices when writing this code?

None

# Conflicts: # lib/model/classycat_classify.py

skyemeedan

Did this work when running locally with Alegre (i.e. a manual integration test)?

I think it needs the key to specify which model will be used for vectorization

Comment:
I'm still not entirely sure why we are storing again in Alegre? I believe the text, the associated vector, the cluster, and the labels are already stored in alegre or calling systems by the time we call classycat. Have we ruled out appending an additional 'context' to existing items to store the labels? Or having a separate "label propagation" function in classycat that accepts an item, and a list of items with associated labels (assumed to be in a cluster) returns the propagated label or a new one if none available?

skyemeedan · 2024-08-09T20:58:56Z

lib/model/classycat_classify.py

+                {'doc_id': str(uuid.uuid4()),  # adding a unique id for each item to not rely on the input id for uniqueness
+                 'content': items[i]['text'],
+                 'context': {
+                     'input_id': items[i]['id'],


I think would would be helpful to put a `'source':'ClassyCat' so if we later want to query this objects there is a key to do it

skyemeedan · 2024-08-09T21:02:51Z

lib/model/classycat_classify.py

+
+            # call alegre endpoint to store the results: /text/bulk_similarity/
+            alegre_url = os.getenv('ALEGRE_URL')
+            httpx.post(alegre_url + '/text/bulk_similarity/', json=final_results_to_be_stored_in_alegre)


ooh, does this insert endpoint work? maybe I can switch timpani to use it as well when we convert to batch mode!

We think so, but @DGaffney is currently moving all vectorization to Presto and will need to ensure this endpoint continues to work after that migration 😅

makes sense Scott, I added a comment on Devin's ticket to make sure he sees it.

@skyemeedan as of this moment the endpoint works fine locally for me, feel free to test it out!

I'm not sure if we want to support this after we move to presto vectorization. This is a blocking, bulk query which breaks expectations in Presto-based vectorization doubly (not-blocking, and single query per item). This is, to my knowledge, the only location in Check (minus the re-index job in Check-API) that uses bulk_similarity, and if we could get away from this pattern, that would be vastly preferable. Is there any way we can move this? Otherwise we'll be signing up for new complicated measures to support this long term.

if the bulk_similarity is not going to be supported, then I think we can't build the classycat feature on top of it and will need to find another approach for this PR?

My memory is that we decided that everything will be 'default bulk', so probably we need to adjust the design of new endpoints to support bulk?

I don't think it is reasonable to take on making async endpoints default bulk until after we have made text work on Presto. We've shunted in secondary features to this refactor before and the net effect is that they have greatly complicated our existing migration plan.

If supporting bulk requires some refactoring of presto payload structures, wouldn't it be easier to do that before we are supporting the full text processing in live?

skyemeedan · 2024-08-09T21:16:38Z

lib/model/classycat_classify.py

+                     'input_id': items[i]['id'],
+                     'labels': final_results[i]['labels'],
+                     'schema_id': schema_id,
+                     'model_name': self.llm_client.model_name}}


I think this needs to be the model name for the text vectorization? i.e. "paraphrase-multilingual-mpnet-base-v2"?
These are the parameters timpani used:
https://github.com/meedan/timpani/blob/6021d4fcb251d83ae48ed2c1566a16fad6971450/timpani/model_service/alegre_wrapper_service.py#L122

Within context, model_name (or any parameter) can be anything. We will, however, need to specify the models parameter for Alegre itself to know how to do vectorization. That parameter is currently missing.

good catch Scott, I did not know that we should specify model_name, and on local the documents are being index only by Elasticsearch (full text search). I will update the code and redo the tests to make sure vectorization works on local.

computermacgyver · 2024-08-12T06:14:53Z

lib/model/classycat_classify.py

+                     'input_id': items[i]['id'],
+                     'labels': final_results[i]['labels'],
+                     'schema_id': schema_id,
+                     'model_name': self.llm_client.model_name}}


Within context, model_name (or any parameter) can be anything. We will, however, need to specify the models parameter for Alegre itself to know how to do vectorization. That parameter is currently missing.

computermacgyver · 2024-08-12T06:15:26Z

lib/model/classycat_classify.py

+            # content is text, doc_id is the item's unique id, and context is input id, labels, schema_id, and model name
+            final_results_to_be_stored_in_alegre = {'documents': [
+                {'doc_id': str(uuid.uuid4()),  # adding a unique id for each item to not rely on the input id for uniqueness
+                 'content': items[i]['text'],


For the /text/similarity endpoint this parameter is text and not content. Let's double check this is the correct name for the bulk endpoint

At least for /text/bulk_similarity/ we replace text with content:

# Rename "text" to "content" if present if 'text' in params: params['content'] = params.get('text') del params["text"]

We can use both content and text for bulk but it ultimately gets renamed to content in Alegre.

computermacgyver · 2024-08-12T06:16:38Z

lib/model/classycat_classify.py

+
+            # call alegre endpoint to store the results: /text/bulk_similarity/
+            alegre_url = os.getenv('ALEGRE_URL')
+            httpx.post(alegre_url + '/text/bulk_similarity/', json=final_results_to_be_stored_in_alegre)


We think so, but @DGaffney is currently moving all vectorization to Presto and will need to ensure this endpoint continues to work after that migration 😅

ashkankzme · 2024-08-12T18:18:12Z

Did this work when running locally with Alegre (i.e. a manual integration test)?

I think it needs the key to specify which model will be used for vectorization

Comment: I'm still not entirely sure why we are storing again in Alegre? I believe the text, the associated vector, the cluster, and the labels are already stored in alegre or calling systems by the time we call classycat. Have we ruled out appending an additional 'context' to existing items to store the labels? Or having a separate "label propagation" function in classycat that accepts an item, and a list of items with associated labels (assumed to be in a cluster) returns the propagated label or a new one if none available?

We plan to use Alegre for KNN similarity queries, since we already have Open/ElasticSearch set up for that service, makes a lot of sense. I agree that there are optimizations to be made here, but I rather not concern ClassyCat with those atm, we are still trying to achieve basic functionality in this ticket.

Also to address the issue of already having the items stored in Alegre prior to calling classycat, I think it's a lot easier to make sure that Alegre updates existing records vs. adding duplicate ones, instead of adding additional endpoints and complexity to classycat. We can do that by allowing an optional "doc_id" for each record submitted to classycat. If that field exists in the call, I will make sure to pass that on to Alegre (as opposed to generate a new unique id) and ensure that Alegre does "upserts" rather than "inserts" into the DB. As mentioned before, I think this is work for a another time and should be a separate ticket.

DGaffney

Lets pretty pretty please not use bulk_similarity for this - I know we went through how to use it the other day but I didn't fully appreciate that this was part of new functionality that would be coming through a PR

ashkankzme added 5 commits August 5, 2024 15:00

WIP: storing classycat data in alegre

dd569c5

Merge branch 'refs/heads/master' into CV2-4953

f19e9b9

# Conflicts: # lib/model/classycat_classify.py

WIP

659fd68

storing CC data in alegre tested locally and works

87669fa

updating/fixing tests for new additions to the code

1272766

ashkankzme marked this pull request as ready for review August 9, 2024 17:32

ashkankzme requested review from DGaffney, computermacgyver and skyemeedan as code owners August 9, 2024 17:32

ashkankzme changed the title ~~Cv2 4953~~ CV2-4953: Storing ClassyCat data in Alegre Aug 9, 2024

skyemeedan reviewed Aug 9, 2024

View reviewed changes

computermacgyver requested changes Aug 12, 2024

View reviewed changes

DGaffney requested changes Aug 13, 2024

View reviewed changes

ashkankzme marked this pull request as draft August 14, 2024 21:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CV2-4953: Storing ClassyCat data in Alegre #105

CV2-4953: Storing ClassyCat data in Alegre #105

ashkankzme commented Aug 8, 2024 •

edited by jira bot

Loading

skyemeedan left a comment

skyemeedan Aug 9, 2024

skyemeedan Aug 9, 2024

computermacgyver Aug 12, 2024

ashkankzme Aug 12, 2024

DGaffney Aug 13, 2024

skyemeedan Aug 13, 2024

DGaffney Aug 13, 2024

skyemeedan Aug 13, 2024

skyemeedan Aug 9, 2024

computermacgyver Aug 12, 2024

ashkankzme Aug 12, 2024

computermacgyver Aug 12, 2024

computermacgyver Aug 12, 2024 •

edited

Loading

ashkankzme Aug 12, 2024

computermacgyver Aug 12, 2024

ashkankzme commented Aug 12, 2024

DGaffney left a comment

CV2-4953: Storing ClassyCat data in Alegre #105

Are you sure you want to change the base?

CV2-4953: Storing ClassyCat data in Alegre #105

Conversation

ashkankzme commented Aug 8, 2024 • edited by jira bot Loading

Description

How has this been tested?

Are there any external dependencies?

Have you considered secure coding practices when writing this code?

skyemeedan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

computermacgyver Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashkankzme commented Aug 12, 2024

DGaffney left a comment

Choose a reason for hiding this comment

ashkankzme commented Aug 8, 2024 •

edited by jira bot

Loading

computermacgyver Aug 12, 2024 •

edited

Loading