Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem of disambiguation of ENs according to the case or spelling of terms #132

Open
aa303554 opened this issue Jul 23, 2021 · 1 comment

Comments

@aa303554
Copy link

In French entity-fishing has difficulty recognising Ireland by case and spelling. "Irlande" is the correct spelling "Ireland" is the English spelling and the others are "Irlande" written with spelling mistakes. It is not consistent according to the case certain spellings are not recognized the same, if there is a capital letter or not.
With other countries there is no such problem (I tested with Japan).

correctly recognize : irlande, irland (incorrect), irelande(incorrect), Irelande(incorrect)
only type ner LOCATION: Irlande, Ireland(incorrect)
not recognize : Irland, ireland

See below for example.

image
image

@kermitt2
Copy link
Owner

kermitt2 commented Aug 2, 2021

Thanks for the issue @aa303554 !

Currently each spelling variant relies on its own statistics as anchor in the French Wikidata, which explains these different behaviours. In case we don't have enough context for a spelling variant, the term will never been linked to the Wikidata entity - in the current state of entity-fishing.

To improve this, my idea is to use better smoothing and priors for variants and for Wikidata labels (currently not used), #72.

Please note: the example you are using is too short for the normal text disambiguation field (normal unit for text field is more a paragraph), you need to use the short text input. It might not solve the spelling error/variants problem in general, but it will work better. I think however it solves your first example:

Screenshot from 2021-08-02 23-20-55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants