Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cv2 3408 add fasttext #20

Merged
merged 9 commits into from
Aug 3, 2023
Merged

Cv2 3408 add fasttext #20

merged 9 commits into from
Aug 3, 2023

Conversation

amydunphy
Copy link
Contributor

initial addition of fasttext language identification to presto https://meedan.atlassian.net/browse/CV2-3408

Copy link
Contributor

@skyemeedan skyemeedan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool! some thoughts:

  • maybe have a class with fields instead of dict so that we can enforce structure of items?
  • I think it would be a good idea to carry an id along with the item we are predicting, so that if they get out of order (like coming off of queue, or dict keys, or if one fails) we will still be able to match up the correct items to their classifications
  • @devin does this map well to how the batches of items will come off the queue?

class FasttextModel(Model):
def __init__(self):
"""
Load fasttext model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you include the link to the model docs here?

@DGaffney
Copy link
Collaborator

DGaffney commented Aug 2, 2023

cool! some thoughts:

  • maybe have a class with fields instead of dict so that we can enforce structure of items?
  • I think it would be a good idea to carry an id along with the item we are predicting, so that if they get out of order (like coming off of queue, or dict keys, or if one fails) we will still be able to match up the correct items to their classifications
  • @devin does this map well to how the batches of items will come off the queue?

Quick responses:

  • I like the idea of the structure enforcement - wouldn't we just do that with pydantic schemas if we wanted to go that route though? I feel like that's a more targeted approach for that goal (separately, I think that this is all better done in a separate ticket)
  • IDs are also a good idea - right now, that's more or less supported by the webhook approach. If you go through the code, there's a structure where we call back on URLs to let external services know that work is done. I think when presto is live, requestors will do that in effect by injecting their IDs into that callback, without presto needing to know anything about canonicality of records. That was my thought at least (again I think that's a conversation to have outside this fairly narrow ticket)
  • Yes, this maps well - we paired while @amydunphy was working on it, I think this is about as close a translation as we can get for the moment.

@DGaffney
Copy link
Collaborator

DGaffney commented Aug 2, 2023

So this is failing because we're on huggingface_hub 0.0.8, which is forced by our old sentence-transformers module. Looks like a previous auto PR tried to update sentence-transformers - I'm going to see if I can edit that to get sentence-transformers up to speed, then we can address re-updating huggingface_hub here.

@DGaffney
Copy link
Collaborator

DGaffney commented Aug 2, 2023

Woo! Ok, this works now with some updates I made to python package versions (well, at least, we're not failing on builds anymore...)

@amydunphy the code is now failing on a test - looks like we accidentally named some variables wrongly - can you take a look and have a first pass at fixing these when you've got time?

======================================================================
ERROR: test_respond (test.lib.model.test_fasttext.TestFasttextModel)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/app/test/lib/model/test_fasttext.py", line 17, in test_respond
    response = self.model.respond(query)
  File "/app/lib/model/fasttext.py", line 30, in respond
    doc["response"] = detected_lang
NameError: name 'detected_lang' is not defined

@DGaffney
Copy link
Collaborator

DGaffney commented Aug 2, 2023

This is close enough that we can test it in QA! Thank you @amydunphy for this great work.

@amydunphy amydunphy merged commit d7053f7 into master Aug 3, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants