Princeton Wordnet ID conversions #3

simongray · 2021-06-10T09:30:39Z

In the old RDF export, several thousand references to both DanNet and Princeton Wordnet IDs use a format that is not recognised by Apache Jena. In more concrete terms, attempting to import DanNet as-is results in many thousands of lines of this style of warning:

...
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9618, col: 114] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-beard%1:08:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9619, col: 82] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-salon%1:06:02::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9620, col: 113] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-salon%1:06:02::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9621, col: 94] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-supermarket%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9622, col: 120] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-supermarket%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9623, col: 80] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-hall%1:06:04::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9624, col: 113] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-hall%1:06:04::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9625, col: 80] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-home%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9626, col: 112] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-home%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9627, col: 90] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-residence%1:15:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9628, col: 117] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-residence%1:15:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9629, col: 80] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-flat%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9630, col: 113] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-flat%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9631, col: 80] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-tent%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9632, col: 113] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-tent%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9633, col: 84] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-stable%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
...

The issue has to do with the suffixes such as %1:06:02:: or %1:15:00:: not being valid according to the XML processor used by Jena.

According to Nicolai Hartvig Sørensen (the old maintainer of DanNet) the IDs have apparently undergone several changes since then and will have to be changed anyway. In addition, they might have been mangled in the current export, e.g. %5:00:00:rich:03 is also an example suffix.

Nicolai says the current IDs are based on this paper: https://www.aclweb.org/anthology/W11-0129.pdf

We need to come up with some type of scheme to reliably convert these old IDs to the newer IDs used by Princeton - or the Open English Wordnet in case we use that.

Some further complications:

The COR project is also producing a new set of IDs and we are nominally required to accept them in some form.
There is also a requirement by the team at DSL (the old DanNet maintainers) to maintain the connection to the data at DSL which is faciliated by the IDs present in the current version of DanNet.

The text was updated successfully, but these errors were encountered:

simongray · 2021-11-06T10:06:08Z

More relevant comments available in issue #15 (now closed).

simongray · 2021-11-15T13:55:40Z

See also #16 (connotations) which pertains to sentiment analysis work done by Sussi. These relations have also been left out for now.

simongray · 2022-10-25T14:18:18Z

Bart has created an SQL dump for the data that Sussi has produced.

I might be able to create an in-memory SQLite db, import the data into that, and then extract the needed table(s) using JDBC. Some more information: https://grishaev.me/en/clj-sqlite/

simongray · 2022-12-15T09:37:35Z

Currently attempting detective work on the SQL dump using a dockerized MySQL db:

# create and populate database
docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:latest
docker exec -i some-mysql sh -c 'exec mysql -uroot -p"my-secret-pw"' < /Users/rqf595/Downloads/wordnetloom-wordnet.sql

# connect to container and database inside it
docker exec -it some-mysql /bin/bash
mysql -p wordnet

simongray · 2022-12-15T10:08:25Z

After running some SQL queries in the mysql shell, it appears that the tables

tbl_synset
tbl_synset_attributes
tbl_synset_relation
... and possible tbl_relation_type

are the only relevant tables.

The actual synsets are linked using makeshift binary IDs generated in the software the Sussi used to created them with. The table tbl_synset_attributes includes two columns that bear witness to other ID types:

However, the ILI (Interlingual Index ID) isn't relevant unless we have links to DanNet from this ID.

simongray · 2022-12-19T08:43:19Z

Importing the Open English WordNet presents an interesting challenge as the dataset resource has a relation to every entry it encompasses: http://localhost:3456/dannet/external?subject=%3Chttps%3A%2F%2Fen-word.net%2F%3E

Another challenge is the fact that the dataset is quite minimal and doesn't have labels for any resources. The only label-like relation is for the canonicalForm.

- add relevant schemas - move hash functions to separate ns - also include hash of prefix data

simongray · 2023-05-17T09:32:27Z

Apparently, the original links to the Princeton wordnet are not included in the WordNetLoom data, so it will need to be imported via the old DanNet data and converted to Open English WordNet IDs.

https://github.com/globalwordnet/cili

simongray · 2023-05-25T08:56:30Z

Having looked more thoroughly into the two different types of IDs in the old link data, e.g.

production%1:23:00::
bundle%1:06:00::
equipment%1:06:00::

vs the more familiar ENG20-07523126-n which seems to be the ID type used in the cili repo, I have had some difficulties understanding how to get from these other IDs to ones which are mapped. The unfamiliar IDs are seemingly based on the complex, make-shift database of the WordNet project, and refer to lemmas present in multiple different files (mapped to integers) in the different WordNet releases. I haven't been able to find a translation table anywhere.

simongray · 2023-05-26T09:06:01Z

John McCrae was very helpful and wrote me the following guide:

Hi Simon,

These are sense keys, that are used to indicate the word in its synset (i.e., there is one sense key for each member of a synset). They are supposed to be more stable than the synset identifiers (but they aren't) and are preferred by the Princeton team. The full description of them is here:

https://wordnet.princeton.edu/documentation/senseidx5wn

They are quite tricky to calculate, OEWN has a whole script for doing it here:

https://github.com/globalwordnet/english-wordnet/blob/main/scripts/sense_keys.py

You can find all the sense keys and the relevant synsets in the src data for OEWN in the entries-*.yaml files, such as:

https://github.com/globalwordnet/english-wordnet/blob/main/src/yaml/entries-x.yaml

For Princeton WordNet releases, they are normally in a file called sense.index.

Regards,

John

I think the last link (the directory of YAML files) is what I need. I'll parse them all and build a mapping from these sense IDs to the OEWN synsets.

simongray · 2023-05-26T15:26:27Z

I have mapped the eq_synonym relations with senseidx in the 5000 old links, but not the remaining 123 as the GWA schema had no equivalent relations. I still need to map the wn20 IDs. Eventually, I should also link directly to the ILI instead by using the existing links in the OEWN.

As an aside, I think a companion dataset containing labels for the OEWN would be very valuable since the OEWN dataset currently doesn't contain any labels. This dataset can be generated based on the lexical forms present in the dataset (= lemmas).

simongray · 2023-05-30T13:18:59Z

In the process of linking DanNet to the Open English WordNet, I discovered a couple of errors in the OEWN dataset, one critical (ILI linking) and one less so:

simongray · 2023-05-31T07:55:58Z

Since the CILI resources do not have outgoing relations for the incoming relations, e.g. wn:ili, it really makes a lot of sense to implement #53 to make navigating navigating via the CILI feasible.

simongray · 2023-06-01T12:27:46Z

Getting ready to release most of the old English links as a sort of preview: 039ecc0

simongray · 2023-07-04T08:43:53Z

I added the eq_hyperonym and eq_hyponym rels of the links in e88598d + defined a new, complementary inverse property of wn:ili to allow easier navigation of the graph.

This illustrates some issues in the existing data where we have both synonym and hyponym relations to an English synset, e.g. http://localhost:3456/dannet/data/synset-298

simongray · 2023-07-04T13:01:16Z

Update on new links

I can log in directly to the Wordnetloom database at wordnetweb01fl.unicph.domain (wordties) and access it as root via the mysql wordnet command. From there, I can run similar queries to the ones I ran on the SQL dump and it seems like the synset IDs are at least different, though they seem to be regular incrementing integers, which isn't an improvement.

I will have to investigate this database more to see if any data can be recovered.

simongray · 2023-07-04T13:12:14Z

Some semi-good news finally: the synset IDs seem to be able to be recovered, even though it will still require some work.

They are saved in a weird format, e.g. 88830000 for synset 30. The 888 are prefixed, but the only way to know if the remaining zeroes are part of the ID or not is to make a guess based on the position in the table (they seem to come sequentially, so a preceding 88839000 means the following 88840000 is synset 40.

UPDATE

No, they actually all start with 888 and end with 000, so this is quite easy to convert.

This is how to retrieve the relevant IDs:

SELECT synset_id FROM synset_attributes WHERE synset_id LIKE '888%000';

simongray · 2023-07-05T11:21:13Z

Adapted from https://stackoverflow.com/questions/356578/how-can-i-output-mysql-query-results-in-csv-format:

synset_relation: These are the actual relations, it seems. Many of them seem to have been performed from the Princeton WordNet to DanNet, which is not ideal.

SELECT id, child_synset_id, parent_synset_id, synset_relation_type_id
INTO OUTFILE '/var/lib/mysql-files/synset_relation.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM synset_relation;

synset_attributes: This is the table with links to Princeton and the CILI in the case of the Princeton words. In DanNet's case, the synset IDs have to be reconstructed based on removing 888 from the beginning and 000 from the end.

SELECT synset_id, princeton_id, ili_id
INTO OUTFILE '/var/lib/mysql-files/synset_attributes.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM synset_attributes;

application_localised_string: This seems to contain the relation names.

SELECT id, value
INTO OUTFILE '/var/lib/mysql-files/application_localised_string.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM application_localised_string;

simongray · 2023-07-07T07:41:34Z

The new English link triples have been successfully imported as of eb47f48, so this is finally done!

It seems like trying to infer every triple now results in org.apache.jena.shared.JenaException: java.lang.OutOfMemoryError: Java heap space when creating the complete dataset, so I think I might have to give up on that for now. Perhaps I should make a script to fire up a local DanNet instance that people can query themselves.

simongray changed the title ~~Wordnet ID conversions~~ Princeton Wordnet ID conversions Nov 5, 2021

simongray mentioned this issue Nov 5, 2021

Convert Princeton mappings #15

Closed

simongray added the enhancement New feature or request label Mar 30, 2022

simongray added the dataset label Aug 17, 2022

simongray added a commit that referenced this issue Dec 19, 2022

#3 - import open English WordNet

2de5293

- add relevant schemas - move hash functions to separate ns - also include hash of prefix data

simongray added a commit that referenced this issue May 26, 2023

#3 mapping from oldschool WordNet sense IDs to OEWN synset IDs

6c650a7

simongray mentioned this issue May 30, 2023

Labels for OEWN senses, words, and synsets #99

Closed

simongray added a commit that referenced this issue May 30, 2023

#3 - add the remaining old English synonym links as ILI references

d2b64ba

simongray added a commit that referenced this issue May 30, 2023

#3 - section changes to better fit the OEWN

91bb9d3

simongray closed this as completed Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Princeton Wordnet ID conversions #3

Princeton Wordnet ID conversions #3

simongray commented Jun 10, 2021

simongray commented Nov 6, 2021

simongray commented Nov 15, 2021

simongray commented Oct 25, 2022

simongray commented Dec 15, 2022 •

edited

Loading

simongray commented Dec 15, 2022

simongray commented Dec 19, 2022

simongray commented May 17, 2023 •

edited

Loading

simongray commented May 25, 2023

simongray commented May 26, 2023

simongray commented May 26, 2023 •

edited

Loading

simongray commented May 30, 2023

simongray commented May 31, 2023

simongray commented Jun 1, 2023

simongray commented Jul 4, 2023

simongray commented Jul 4, 2023 •

edited

Loading

simongray commented Jul 4, 2023 •

edited

Loading

simongray commented Jul 5, 2023 •

edited

Loading

simongray commented Jul 7, 2023

Princeton Wordnet ID conversions #3

Princeton Wordnet ID conversions #3

Comments

simongray commented Jun 10, 2021

simongray commented Nov 6, 2021

simongray commented Nov 15, 2021

simongray commented Oct 25, 2022

simongray commented Dec 15, 2022 • edited Loading

simongray commented Dec 15, 2022

simongray commented Dec 19, 2022

simongray commented May 17, 2023 • edited Loading

simongray commented May 25, 2023

simongray commented May 26, 2023

simongray commented May 26, 2023 • edited Loading

simongray commented May 30, 2023

simongray commented May 31, 2023

simongray commented Jun 1, 2023

simongray commented Jul 4, 2023

simongray commented Jul 4, 2023 • edited Loading

Update on new links

simongray commented Jul 4, 2023 • edited Loading

UPDATE

simongray commented Jul 5, 2023 • edited Loading

simongray commented Jul 7, 2023

simongray commented Dec 15, 2022 •

edited

Loading

simongray commented May 17, 2023 •

edited

Loading

simongray commented May 26, 2023 •

edited

Loading

simongray commented Jul 4, 2023 •

edited

Loading

simongray commented Jul 4, 2023 •

edited

Loading

simongray commented Jul 5, 2023 •

edited

Loading