-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Princeton Wordnet ID conversions #3
Comments
More relevant comments available in issue #15 (now closed). |
See also #16 (connotations) which pertains to sentiment analysis work done by Sussi. These relations have also been left out for now. |
Bart has created an SQL dump for the data that Sussi has produced. I might be able to create an in-memory SQLite db, import the data into that, and then extract the needed table(s) using JDBC. Some more information: https://grishaev.me/en/clj-sqlite/ |
Currently attempting detective work on the SQL dump using a dockerized MySQL db: # create and populate database
docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:latest
docker exec -i some-mysql sh -c 'exec mysql -uroot -p"my-secret-pw"' < /Users/rqf595/Downloads/wordnetloom-wordnet.sql
# connect to container and database inside it
docker exec -it some-mysql /bin/bash
mysql -p wordnet |
After running some SQL queries in the
are the only relevant tables. The actual synsets are linked using makeshift binary IDs generated in the software the Sussi used to created them with. The table However, the ILI (Interlingual Index ID) isn't relevant unless we have links to DanNet from this ID. |
Importing the Open English WordNet presents an interesting challenge as the dataset resource has a relation to every entry it encompasses: http://localhost:3456/dannet/external?subject=%3Chttps%3A%2F%2Fen-word.net%2F%3E Another challenge is the fact that the dataset is quite minimal and doesn't have labels for any resources. The only label-like relation is for the canonicalForm. |
- add relevant schemas - move hash functions to separate ns - also include hash of prefix data
Apparently, the original links to the Princeton wordnet are not included in the WordNetLoom data, so it will need to be imported via the old DanNet data and converted to Open English WordNet IDs. |
Having looked more thoroughly into the two different types of IDs in the old link data, e.g.
vs the more familiar |
John McCrae was very helpful and wrote me the following guide:
I think the last link (the directory of YAML files) is what I need. I'll parse them all and build a mapping from these sense IDs to the OEWN synsets. |
I have mapped the eq_synonym relations with senseidx in the 5000 old links, but not the remaining 123 as the GWA schema had no equivalent relations. I still need to map the wn20 IDs. Eventually, I should also link directly to the ILI instead by using the existing links in the OEWN. As an aside, I think a companion dataset containing labels for the OEWN would be very valuable since the OEWN dataset currently doesn't contain any labels. This dataset can be generated based on the lexical forms present in the dataset (= lemmas). |
In the process of linking DanNet to the Open English WordNet, I discovered a couple of errors in the OEWN dataset, one critical (ILI linking) and one less so: |
Since the CILI resources do not have outgoing relations for the incoming relations, e.g. |
Getting ready to release most of the old English links as a sort of preview: 039ecc0 |
I added the eq_hyperonym and eq_hyponym rels of the links in e88598d + defined a new, complementary inverse property of This illustrates some issues in the existing data where we have both synonym and hyponym relations to an English synset, e.g. http://localhost:3456/dannet/data/synset-298 |
Update on new linksI can log in directly to the Wordnetloom database at wordnetweb01fl.unicph.domain (wordties) and access it as root via the I will have to investigate this database more to see if any data can be recovered. |
Some semi-good news finally: the synset IDs seem to be able to be recovered, even though it will still require some work. They are saved in a weird format, e.g. UPDATENo, they actually all start with This is how to retrieve the relevant IDs: SELECT synset_id FROM synset_attributes WHERE synset_id LIKE '888%000'; |
Adapted from https://stackoverflow.com/questions/356578/how-can-i-output-mysql-query-results-in-csv-format: synset_relation: These are the actual relations, it seems. Many of them seem to have been performed from the Princeton WordNet to DanNet, which is not ideal. SELECT id, child_synset_id, parent_synset_id, synset_relation_type_id
INTO OUTFILE '/var/lib/mysql-files/synset_relation.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM synset_relation; synset_attributes: This is the table with links to Princeton and the CILI in the case of the Princeton words. In DanNet's case, the synset IDs have to be reconstructed based on removing SELECT synset_id, princeton_id, ili_id
INTO OUTFILE '/var/lib/mysql-files/synset_attributes.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM synset_attributes; application_localised_string: This seems to contain the relation names. SELECT id, value
INTO OUTFILE '/var/lib/mysql-files/application_localised_string.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM application_localised_string; |
The new English link triples have been successfully imported as of eb47f48, so this is finally done! It seems like trying to infer every triple now results in |
In the old RDF export, several thousand references to both DanNet and Princeton Wordnet IDs use a format that is not recognised by Apache Jena. In more concrete terms, attempting to import DanNet as-is results in many thousands of lines of this style of warning:
The issue has to do with the suffixes such as
%1:06:02::
or%1:15:00::
not being valid according to the XML processor used by Jena.According to Nicolai Hartvig Sørensen (the old maintainer of DanNet) the IDs have apparently undergone several changes since then and will have to be changed anyway. In addition, they might have been mangled in the current export, e.g.
%5:00:00:rich:03
is also an example suffix.Nicolai says the current IDs are based on this paper: https://www.aclweb.org/anthology/W11-0129.pdf
We need to come up with some type of scheme to reliably convert these old IDs to the newer IDs used by Princeton - or the Open English Wordnet in case we use that.
Some further complications:
The text was updated successfully, but these errors were encountered: