-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Predictions are horribly wrong #145
Comments
Well, if you didn't feed it any Spanish text before, the network will return random result. In order for the network to build representations for words (in any language) they need to appear in the training set at least N times (N=5 by default). Otherwise Magpie just has no idea what is being fed into it and might be triggered by random noise like "Washington" or "Trump" in your case. The rule is - you should test/predict on the same type of data as you train. |
The thing that worries me is the high confidence - 95% in some cases. If it does not recognize the words, should it not at least be careful about its predictions? |
I have the same issue, and have these poor results even if I use some part of the training corpus to test. |
I have trained magpie on a news dataset. I have 9 labels for my data.
I training the model and tested the following text using magpie.predict_from_text():
Más de 690 mil casos de inmigrantes esperan ser resueltos por tribunales de Inmigración WASHINGTON— La Administración Trump ha convertido las protecciones de menores en sinónimo de “lagunas legales” que el Congreso debe eliminar pero mientras tanto, sobre el terreno, tampoco ha mejorado el atasco de más de 692,000 casos pendientes en los tribunales de Inmigración, según expertos.
While I don't have ANY Spanish documents in my training samples, magpie returns a 90% chance that this text belongs to one of my labels! It even predicts similar results for 3 other categories, all of them irrelevant. I even tried to see if there are any words that are causing this, but could not find any.
What can be wrong here? I trained the data on 400-500 documents for each category, and set epochs to 30 as well as 50 (no change in results)
The text was updated successfully, but these errors were encountered: