Best practices and informed opinions on Event and Occurrence ID's in aligned, Darwin Core data #261

emiliom · 2024-03-29T20:45:22Z

This thread started on the Standardizing Marine Biological Data Slack on March 20, 2024. As it's of general interest, I'm moving it here so it's accessible to others more openly.

I'm curious to hear what heuristics or rules of thumb others are using to create ID's for the aligned data. I've settled on using UUID's for occurrences and semi-intelligible ID's for events. But even for events it gets a bit crazy because I'm using a hierarchical set of event types (cruise > station visit > sample) and have tried to include some of that hierarchy into the first two types, so ID's get long; for sampling events, the data generator uses unique sample ID's, so I've reused those. I also have used a dataset prefix for event ID's in a probably silly attempt to have the ID's be kind of globally unique or at least easily recognized as belonging to the same dataset. But that also leads to long ID's, and I'm not sure if it's worth it. Thoughts? I know @jdpye had thoughts on this b/c we exchanged a couple of messages on this Slack (now hidden) ...

emiliom · 2024-03-29T20:51:08Z

Follow ups, in the order they were sent on Slack and starting with the person who sent it:

@ymgan: I feel you Emilio … I have had very long IDs too. This reminds me of the discussions on this thread: tdwg/dwc#491 Sorry that I don’t have any solutions for this

@emiliom: Thanks for pointing to that discussion, @ymgan It looks really helpful. I'm not expecting solutions here; just pearls of wisdom and a sense of what's been found to be most useful and practical.

@jdpye: My strong preference for meaningful, data-derived IDs comes from a few places, the history of practice at OTN for making meaningful ID fields, the tendency of researchers to use very generic internal IDs for the components of their studies, and my need to find and amend records throughout the pipeline when source data changes.

UUIDs for the sake of guaranteeing uniqueness feels like we are avoiding the work of defining the set of things that makes our record unique. There's no performance penalty for having long ID fields, and they save us far more often than they would ever hinder us as human operators. So it's true, I've never been convinced by the UUID advice.

tdwg/dwc#491 (comment) this guy knows what's up.

@albenson-usgs: Yeah no need for me to rehash what I already said in that DwC thread but just to say that I think this is a topic that is still very fraught and unresolved. My preference at this particular time is for human-resolvable IDs but I know that's not everyone's preference.

@timvdstap: For what it's worth, I'm on the same page as Abby and Jon!

@ymgan: +1 from me! If they have an occurrence table in their database and adding a UUID field is easy, then we go for that. However, there were times where data provider do not use Occurrence table in their database, but rather constructs the occurrence view table by joining multiple tables. I couldn’t find a way to track this with UUIDs every time they update the dataset. In this case, we asked our data provider to use the columns that are least likely to change (NOT institutionCode coz institute could be renamed, NOT triplets) to create a composite identifier for occurrenceID. Not everyone’s preference either …

emiliom · 2024-03-29T20:56:02Z

My follow up after the input received.

Thanks again for everyone's input! I've been trying to digest the input here and discussions in tdwg/dwc #491. There are just too many relevant topics that come to mind , so I'll stop trying to compile "all" relevant threads and considerations, and will list what I have:

I found it helpful to see what GBIF says about occurrenceID, in this paragraph. Though I'll reemphasize that I'm not interested solely on occurrenceID, but also on eventID. I didn't bother to look up what GBIF or OBIS say about eventID.
- BTW, in addition to tdwg/dwc #491, I found this GBIF discourse exchange about occurrenceID relevant, too.
In the eternal, existential battle between proponents of meaningful vs meaningless ID's, y'all are firmly in the meaningful camp . I typically lean that way too, though not always.
There are competing goals and challenges in this space, including:
- ID's generated for external aggregation systems like OBIS that data providers may not be in a position to integrate into their own data management systems (such as they may be!).
- Persistence of ID's across dataset versions (and downstream, in publications using the dataset from OBIS/GBIF) is clearly very important!
Is there a definable role for data "mediators" that perform the DwC alignment and submission to OBIS, that are separate from the data provider's team? If so, and when the conditions are right (eg, IOOS Regional Associations, possibly OTN), could they be the curators of mappings between data provider ID's (when they exist) and ID's provided to OBIS, regardless of whether they are UUID's or meaningful ID's? I'm not saying a data mediator is a must or always realistic, but when they exist, they may be able to play a distinctive role vis a vis ID's.
- On yet another tangent, coincidentally today's ESIP weekly newsletter pointed to a nice writeup about "environmental data intermediaries", though the focus is very different.
In the broader PID space, I think IGSN's (for physical "samples") is a very interesting case. The IGSN community developed doi-like dereferencing, early on (eg, https://igsn.org/SIO000003); like doi's, IGSN strings are not horrendously long and random like UUID's; the IGSN strings encode some meaning (via conventions for short namespaces and "sub-namespaces"); and they have built a sustained community effort towards broad adoption. I'm not saying IGSN's per se are a solution here, but there's wisdom to be had there.
- In addition to the website, here are two good publications about IGSN's and physical samples.
- This article, "Sample Identifiers and Metadata to Support Data Management and Reuse in Multidisciplinary Ecosystem Sciences", is a nifty use case of putting PID's through their paces at the US Department of Energy’s (DOE’s) Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE). They settle on IGSN's, but it's helpful to see how they arrive at that decision, pros and cons of other approaches, and IGSN's worked out for them. Again, I'm not arguing for IGSN's, just pointing to an example of a group reasoning through what they need out of their (P)ID's and describing practical challenges and context.

Alright, enough on ID's! I already have work to do to lay out how my data-alignment code will need to be changed to ensure the ID's I generate on the first version submitted to OBIS are reused in future data-update versions.

emiliom · 2024-03-29T20:57:02Z

From @albenson-usgs: Great summary and resources Emilio! I would definitely advocate for putting this in the issues so the conversation can be found later.
I will say we did discuss some of this at the ESIP Biological Data Standards cluster in making the primer. Also, the ESIP Physical Samples cluster is very keen on IGSNs but the global TDWG community doesn't seem to be and I'm not entirely sure why. Alex Hardisty in the EU really wants the Digital Extended Specimens concept instead of IGSN and I haven't spent the time needed to figure out why (or at least that was my takeaway from some meeting I was in where I was asking Alex about IGSN).

emiliom · 2024-04-09T16:49:30Z

No surprise here, but there was already an issue on this topic in this repo, from 2021: #80

jdpye · 2024-04-09T17:04:58Z

Memory lanes upon memory lanes. Most of my advice from back then stands, and I'm particularly fond of my field-by-field explainer in #80 (comment)

emiliom · 2024-05-08T23:51:18Z

We had missed this resource from the OBIS Manual on "Constructing and using identifier codes"! https://manual.obis.org/identifiers.html
Thanks to the SMBD team at today's meeting for unearthing it. It looks pretty useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices and informed opinions on Event and Occurrence ID's in aligned, Darwin Core data #261

Best practices and informed opinions on Event and Occurrence ID's in aligned, Darwin Core data #261

emiliom commented Mar 29, 2024 •

edited

Loading

emiliom commented Mar 29, 2024

emiliom commented Mar 29, 2024 •

edited

Loading

emiliom commented Mar 29, 2024

emiliom commented Apr 9, 2024

jdpye commented Apr 9, 2024

emiliom commented May 8, 2024

Best practices and informed opinions on Event and Occurrence ID's in aligned, Darwin Core data #261

Best practices and informed opinions on Event and Occurrence ID's in aligned, Darwin Core data #261

Comments

emiliom commented Mar 29, 2024 • edited Loading

emiliom commented Mar 29, 2024

emiliom commented Mar 29, 2024 • edited Loading

emiliom commented Mar 29, 2024

emiliom commented Apr 9, 2024

jdpye commented Apr 9, 2024

emiliom commented May 8, 2024

emiliom commented Mar 29, 2024 •

edited

Loading

emiliom commented Mar 29, 2024 •

edited

Loading