Refactoring parse_message() #106

ashkankzme · 2024-08-13T19:34:14Z

Description

The goal of this ticket is to refactor the generic presto parse_message() func into model level input validations and separate parse_input_message() and parse_output_message() functions instead of just one parse_message() for both input and output.

Reference: CV2-5001

How has this been tested?

Has it been tested locally? Are there automated tests?

Are there any external dependencies?

Are there changes required in sysops terraform for this feature or fix?

Have you considered secure coding practices when writing this code?

Please list any security concerns that may be relevant.

…ion for classycat

…t tests)

…d into appropriate model file. untested.

skyemeedan

Overall, I think the approach looks great and easy to implement for each model.

i don't think we should have code that checks responses have specific fields in their bodies (callback_url, id, etc) bc i think that would be useful as a unit test, but we don't want to catch those errors in production.

Maye a different design philosophy, but I would think we do want to validate precisely these fields. Especially in production. In my experience, in an async context, we want to "fail fast" before the message is submitted, while we still have the request open and so still have the ability to tell the caller that something is wrong. We want to avoid errors when things are pulled off the queue because calling system won't know and hard to debug. (I think this is why messaging systems like Kafka etc are so strict about schemas). And it seems like these are fields where the presto system itself won't work if they are missing.

For example:

if the id is missing, I think we are going to have an error downstream, likely in async code, but there will be no id, so really hard to know where the error is coming from
if the callback_url is missing, we won't know until the model has processed and message is pulled of the response queue and tries to do the callback. Presto will be logging errors, but the caller is just going to see it as silently failing to return data.

skyemeedan · 2024-08-15T19:10:54Z

lib/schemas.py


 class Message(BaseModel):
    body: GenericItem
    model_name: str
    retry_count: int = 0

-def parse_message(message_data: Dict) -> Message:
+def parse_input_message(message_data: Dict) -> Message:
    body_data = message_data['body']


I guess we are implicitly validating that there is a 'body' and 'model_name' as it will error here if missing ;-)

correct, those are the system level rules enforced at presto level. I will add validation that those two fields exist before passing it on to the model level validator.

skyemeedan · 2024-08-15T20:04:10Z

lib/schemas.py

-        elif event_type == 'schema_lookup' or event_type == 'schema_create':
-            result_instance = ClassyCatSchemaResponse(**result_data)
+
+    modelClass = get_class('lib.model.', os.environ.get('MODEL_NAME'))


why does this need to be different than the model_name associated with the message? do the names not quite align?

that's a great question. I copied this from create() in lib.model.model.py

but I agree that it makes more sense to use the model_name mentioned in the input and will update the code.

this function has also some other not so great logic that needs to be refactored down the line, such as assuming every model_name input that is not video or yake will return a MediaResponse. I have outlined my ideas for this work in a separate ticket https://meedan.atlassian.net/browse/CV2-5093

skyemeedan · 2024-08-15T20:12:58Z

lib/model/classycat_schema_lookup.py

+
+        if event_type == 'schema_lookup':
+            return ClassyCatSchemaResponse(**result_data)


seems like this needs to check that either the schema_id or schema_name are not empty? It looks like ClassyCatSchemaResponse considers schema_id as optional?

the class specific validator checks that those fields exist, look at classycat_schema_lookup.py

however this function as implemented before only searches by name, not both name and id. maybe we can file a ticket for that if it's necessary?

I thought there was already a function for lookup by schema name and a separate one for id? (I just wasn't sure which this was)

skyemeedan · 2024-08-15T20:26:03Z

docs/how.to.make.model.md

@@ -0,0 +1,150 @@
+# How to make a presto model
+## Your go-to guide, one-stop shop for writing your model in presto


This is really helpful! :-)

Yes this rules

Very needed. Thank you!

skyemeedan · 2024-08-15T20:27:59Z

lib/model/classycat.py

+
+
+class ClassyCatBatchClassificationResponse(ClassyCatResponse):
+    classification_results: Optional[List[dict]] = []


seems like should be non-optional? (must at least return an empty dict?)

well, I think I mostly agree with you. however these types are only enforced upon creation of these objects, when the response object is just an empty object waiting to be populated by the process() method. if we lift the optional option, then we would have to populate them on creation or have a default. I agree that down the line we want do better typing but for now this seems like it doesn't fit our message processing model without some major refactoring work.

seems like the current default of [] would work?

yes, it will work for this specific case, I have updated and now they are in classycat_response.py. but I think overall this is a subtle problem with our specific use of pydantic in presto.

lib/model/model.py

skyemeedan · 2024-08-15T20:35:11Z

Also, I think it is super helpful to have an early draft PR like this to be able to discuss!!

skyemeedan · 2024-08-15T20:44:20Z

oh wait, I'm realizing I missed a crucial word

i don't think we should have code that checks responses have specific fields in their bodies (callback_url, id, etc)

responses

I agree we don't need to validate these fields in responses (if these are missing, they are about to fail anyway) but I think we need to validate these on inputs

DGaffney

Ok you've done a great job here but I have two things:

I hate how much we're calling parse_input_message. Is there any way we can find a way to refactor so we don't call it a billion different places?
I know you're stubbing out the validations per fingerprinter, but I think we could probably just do basic type checking of the sort we already do for unit tests for each fingerprinter and that would probably be sufficient as a first pass? We can talk more about that in the ML call next week if that's screwy or confusing.

DGaffney · 2024-08-15T18:53:54Z

lib/model/audio.py

@@ -27,3 +27,18 @@ def process(self, audio: schemas.Message) -> Dict[str, Union[str, List[int]]]:
        finally:
            os.remove(temp_file_name)
        return {"hash_value": hash_value}
+
+    @classmethod
+    def validate_input(cls, data: Dict) -> None:


Should we just be passing?

I think passing is equivalent to what we have right now for most of these models (with the exception of classycat), not counting the validations that happen inside schema.py

I do agree that we should start implementing them, and I have created a ticket for that work (CV2-5093), but the current design is backward compatible and there is no need to implement these right now? unless you think it's urgent we address it?

DGaffney · 2024-08-15T18:54:04Z

lib/model/audio.py

+
+
+    @classmethod
+    def parse_input_message(cls, data: Dict) -> Any:


Should we do nothing?

Edit: ah I see what the issue is after reading your Jira message - I think we should talk through how to not just stub all these on our next ML call, but in the meantime I think looking at what we typically unit-test for, for each model, and just doing type-checking based on the types of responses we're testing for would be appropriate.

DGaffney · 2024-08-15T20:50:16Z

docs/how.to.make.model.md

@@ -0,0 +1,150 @@
+# How to make a presto model
+## Your go-to guide, one-stop shop for writing your model in presto


Yes this rules

DGaffney · 2024-08-15T20:54:53Z

lib/schemas.py

@@ -32,34 +31,42 @@ class GenericItem(BaseModel):
    text: Optional[str] = None
    raw: Optional[Dict] = {}
    parameters: Optional[Dict] = {}
-    result: Optional[Union[ErrorResponse, MediaResponse, VideoResponse, YakeKeywordsResponse, ClassyCatSchemaResponse, ClassyCatBatchClassificationResponse]] = None
+    result: Optional[Any] = None


Is there any way to not just do an Any without it being a big headache?

I mean, I agree, any is not the best. but do we want to keep updating the schema.py file for every new model? bc the dependency on model names is a bit not great, we don't want to tangle lower tier presto infra (e.g. schema.py) to have upward dependencies, imo. is there a way to dynamically load these types without the massive headache of not having to manually paste these types in a config file for every model? hmmm not sure, maybe, but I'm not convinced it's worth our effort at this time, but happy to discuss more.

ashkankzme · 2024-08-15T21:15:11Z

oh wait, I'm realizing I missed a crucial word

i don't think we should have code that checks responses have specific fields in their bodies (callback_url, id, etc)

responses

I agree we don't need to validate these fields in responses (if these are missing, they are about to fail anyway) but I think we need to validate these on inputs

yes, exactly. the quoted checks on inputs are already being enforced by pydantic in schemas.pyas I mention in my other comments too.

ashkankzme · 2024-08-15T21:19:03Z

Ok you've done a great job here but I have two things:

I hate how much we're calling parse_input_message. Is there any way we can find a way to refactor so we don't call it a billion different places?

I know you're stubbing out the validations per fingerprinter, but I think we could probably just do basic type checking of the sort we already do for unit tests for each fingerprinter and that would probably be sufficient as a first pass? We can talk more about that in the ML call next week if that's screwy or confusing.

thanks Devin! I agree with the suggested solutions on both issues you raised. let's definitely discuss before making a decision on ml call or one on one, but my current hope is to keep the scope of this refactor limited, and have separate tickets for both reducing parse_input_message() usage and implementing model specific validations.

…y), still not green

…ention, now all tests pass :yay:

…tory field for classycatbatchresponse class.

# Conflicts: # lib/model/generic_transformer.py # test/lib/model/test_generic.py

ashkankzme · 2024-08-19T17:50:28Z

Ok you've done a great job here but I have two things:

I hate how much we're calling parse_input_message. Is there any way we can find a way to refactor so we don't call it a billion different places?

I know you're stubbing out the validations per fingerprinter, but I think we could probably just do basic type checking of the sort we already do for unit tests for each fingerprinter and that would probably be sufficient as a first pass? We can talk more about that in the ML call next week if that's screwy or confusing.

Per our convo today, I have created two tickets per your suggestion @DGaffney to address these issues in future work. Those tickets are CV2-5093 and CV2-5102.

# Conflicts: # lib/model/generic_transformer.py # test/lib/model/test_generic.py

DGaffney

# Conflicts: # .gitignore

ashkankzme added 10 commits August 13, 2024 12:31

WIP: refactoring presto parse_message()

6d1d357

WIP: updating unit tests (doesn't pass yet)

2aef392

fixing the unit tests, now they pass

350e892

WIP: implementing a sample input parsing and verification implementat…

5229b1a

…ion for classycat

implementing sample validataion code for classycat, untested

08662e0

WIP: fixing a minor bug for a class method

243c4c6

presto refactoring all tested and fixed (verified locally and ran uni…

7f6954f

…t tests)

developer guide for writing presto models

8e6f491

removing classycat garbage code

5505b4c

WIP: taking classycat response class definitions out of schemas.py an…

483a970

…d into appropriate model file. untested.

skyemeedan reviewed Aug 15, 2024

View reviewed changes

DGaffney requested changes Aug 15, 2024

View reviewed changes

ashkankzme added 5 commits August 15, 2024 14:25

PR comments

58e3979

addressing PR comments, plus some refactoring to fix the tests (mostl…

932e23c

…y), still not green

fixing model name in queue tests

238710c

fixing fasttext code and tests to follow the Presto model naming conv…

75cc530

…ention, now all tests pass :yay:

addressing PR comments, making sure classification_results is a manda…

1d794ca

…tory field for classycatbatchresponse class.

ashkankzme marked this pull request as ready for review August 19, 2024 16:45

Merge branch 'refs/heads/master' into cv2-5001-parse-message-refactor

3eb9861

# Conflicts: # lib/model/generic_transformer.py # test/lib/model/test_generic.py

ashkankzme added 3 commits August 26, 2024 11:19

updating presto gitignore

3c7b942

Merge branch 'refs/heads/master' into cv2-5001-parse-message-refactor

a2c79e3

# Conflicts: # lib/model/generic_transformer.py # test/lib/model/test_generic.py

merging with master + removing unused imports

542e3bf

DGaffney self-requested a review August 26, 2024 19:28

DGaffney approved these changes Aug 26, 2024

View reviewed changes

Merge branch 'refs/heads/master' into cv2-5001-parse-message-refactor

aa49ec8

# Conflicts: # .gitignore

fixing those units

66a6537

ashkankzme merged commit 0b8857c into master Aug 26, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring parse_message() #106

Refactoring parse_message() #106

ashkankzme commented Aug 13, 2024 •

edited by jira bot

Loading

skyemeedan left a comment

skyemeedan Aug 15, 2024

ashkankzme Aug 15, 2024

skyemeedan Aug 15, 2024

ashkankzme Aug 15, 2024

skyemeedan Aug 15, 2024

ashkankzme Aug 15, 2024

skyemeedan Aug 15, 2024

skyemeedan Aug 15, 2024

DGaffney Aug 15, 2024

computermacgyver Aug 19, 2024

skyemeedan Aug 15, 2024

ashkankzme Aug 15, 2024

skyemeedan Aug 15, 2024

ashkankzme Aug 15, 2024

skyemeedan commented Aug 15, 2024

skyemeedan commented Aug 15, 2024

DGaffney left a comment

DGaffney Aug 15, 2024

ashkankzme Aug 15, 2024

DGaffney Aug 15, 2024

DGaffney Aug 15, 2024

DGaffney Aug 15, 2024

DGaffney Aug 15, 2024

ashkankzme Aug 15, 2024

ashkankzme commented Aug 15, 2024

ashkankzme commented Aug 15, 2024

ashkankzme commented Aug 19, 2024 •

edited by jira bot

Loading

DGaffney left a comment


		if event_type == 'schema_lookup':
		return ClassyCatSchemaResponse(**result_data)

		@@ -0,0 +1,150 @@
		# How to make a presto model
		## Your go-to guide, one-stop shop for writing your model in presto



		class ClassyCatBatchClassificationResponse(ClassyCatResponse):
		classification_results: Optional[List[dict]] = []



		@classmethod
		def parse_input_message(cls, data: Dict) -> Any:

Refactoring parse_message() #106

Refactoring parse_message() #106

Conversation

ashkankzme commented Aug 13, 2024 • edited by jira bot Loading

Description

How has this been tested?

Are there any external dependencies?

Have you considered secure coding practices when writing this code?

skyemeedan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyemeedan commented Aug 15, 2024

skyemeedan commented Aug 15, 2024

DGaffney left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashkankzme commented Aug 15, 2024

ashkankzme commented Aug 15, 2024

ashkankzme commented Aug 19, 2024 • edited by jira bot Loading

DGaffney left a comment

Choose a reason for hiding this comment

ashkankzme commented Aug 13, 2024 •

edited by jira bot

Loading

ashkankzme commented Aug 19, 2024 •

edited by jira bot

Loading