Yesterday Julie McMurry and co-authors published a preprint 10 Simple rules for design, provision, and reuse of persistent identifiers for life science data. This is an important paper trying to address a fundamental problem: how can we make persistent identifiers both human-readable and machine-readable?
Don’t be fooled by the title (used frequently by PLOS Computational Biology) - the paper doesn’t describe simple rules that help the average life sciences researcher. Rather, the paper deals with rather complex issues, and has 36 authors.
There is general agreement that we need persistent identifiers for scholarly communication, and that also includes life sciences datasets, the focus of the paper. What is less clear is how to express these persistent identifiers. An identifier such as AB020317 - for the mouse p53 gene - is ambiguous. It is not clear without additional information that this is an identifier for the GenBank nucleotide database, rather than something completely different. One common approach to make this identifier unambiguous is to use URIs (Uniform Resource Identifiers), e.g. http://www.ncbi.nlm.nih.gov/nuccore/AB020317 in this case.
The paper doesn’t like this approach, and even states that “URIs are still among the most commonly used and most problematic identifiers in the bio-data ecosystem”. The text also states that “their length makes them unwieldy for humans working with the data or for referencing in publications or other text”, but doesn’t go into any detail why URIs are “problematic identifiers”, or why length is an issue in an online environment.
This is an important weakness of the paper, because the authors propose an alternative: CURIEs or compact URIs. CURIEs were proposed by the W3C a few years ago, as a way to make URIs more human-readable. The idea is simple, we use a namespace in addition to the local identifier, separated by a colon, e.g. Genbank:AB020317.
This approach has of course been common practice in the life sciences before CURIEs or even the WWW existed, and is still the most common approach how identifiers for life sciences data are referenced in the scholarly literature. Unfortunately there are important problems with CURIEs, most of them mentioned in the paper:
URIs don’t have the problems listed above: they resolve, are unique, and there is good understanding (and available tools) of how a valid URI should look like and how to extract URIs from text documents. That is why URIs are good representations of persistent identifiers.
Another problem I have with CURIEs: the idea doesn’t seem to have caught on from the initial work more than five years ago (background reading here). I’m not even sure what percentage of persistent identifier experts know about CURIEs.
My recommendation for life sciences data: express persistent identifiers as URIs. Now that can go into 10 simple rules for the average life sciences researcher.
P.S. This blog uses a tool I wrote two years ago that automatically turns CURIEs in the text into links.
The DataCite Technology Stack
DataCite is a DOI registration agency that enables the registration of scholarly content with a persistent identifier (DOI) and metadata. This content can then be searched for, reused, and connected to other scholarly resources. ...
2020 Strategic Priorities for Services and Infrastructure
In a blog post four weeks ago DataCite Executive Director Matt Buys talked about the DataCite strategic priorities for 2020 (Buys, 2020). In this post we want to talk a bit more about the strategic priorities for this year we have regarding services and infrastructure work: a) ...
Digging into Metadata using R
In the first post of this new blog a few weeks ago I talked about Data-Driven Development, and that service monitoring is an important aspect of this. The main service DataCite is providing is registration of digital object identifiers (DOIs) ...