Just like the rest of the internet, much of our scholarly infrastructure is built around the Hypertext Transfer Protocol (HTTP), increasingly HTTPS for security, and soon HTTP/2 for better performance. In this infrastructure Universal Resource Locators (URLs) are essential to locate resources (sic) such as scholarly articles, datasets, researchers, organizations, or grants. Read this recent Thomson Reuters report for a good recent perspective on this topic. While this works for the most part, there are some issues with URLs - not specific to scholarly content, but particularly import here:
No. 1 is a problem relevant to all URLs, e.g. web searches or liking/commenting a particular web page. Originally suggested by Google, Canonical URLs are essential for services such as Facebook or Hypothes.is. They have been formalized in rfc6596 and are commonly used.
No. 2 can be a problem, in particular if we are not careful in designing appropriate URLs for landing pages (see next paragraph), but rather use something long and unreadable that also includes query parameters, etc. If we have no control over how the URL looks like, we can use URL shortener services such as bit.ly, which of course have become a common sight on the web. ShortDOIs are an URL shortener for DOIs, but they don’t seem to have gained much traction.
No. 3 is a particularly important issue, commonly referred to as link rot and described extensively for the scholarly literature, e.g. by Klein. There are several technical solutions to this problem, a common approach is to use a landing page for the resource that will never change (and follows the recommendations by Tim Berners-Lee for Cool URIs, and then use redirection to point to the current location of the resource. This is easily for changes of the URL path using web server redirect rules. It gets more complicated if the server name also changes, in particular if it is the server holding the landing page. Thinking this through you realize that the only way this can be done on a larger scale is via one or more centralized services that not only provide the technical infrastructure for a central redirection (or resolver) service, but also come with a social contract of rules that everyone submitting URLs to the service has to follow - a major difference to URL shorteners, which don’t solve the link rot problem.
The above is of course a description of the DOI service provided by CrossRef, DataCite, and others, as well as similar persistent identifier services. Unfortunately some persistent identifier services don’t do the above: they create and use persistent identifiers, but there is no central resolver service that maps these identifiers back to URLs. This breaks the integration with the bigger scholarly infrastructure based on URLs. One common example are nucleotide sequences such as U65091, there is no single corresponding URL because the sequence can be found in all three main nucleotide databases: http://www.ncbi.nlm.nih.gov/nuccore/U65091. It would help to have a central resolver, e.g. http://nucleotide.org/U65091 that then redirects to one of the three databases based on geographical location or user preference.
There are also problems with DOIs. They use the Handle system to resolve the identifier to a location, and this system was built in the 1990s as infrastructure independent of URLs or DNS (Domain Name Service), at a time when it wasn’t clear yet that URLs and associated standards would become ubiquitous. I don’t have numbers, but practically all DOIs are of course now resolved to URLs using the DOI proxy server at http://doi.org (preferred) or http://dx.doi.org. One main consequence of this is that DOIs are frequently not written as URLs - e.g. doi:10.5555/24242424x instead of https://doi.org/10.5555/24242424x - again breaking the integration with the bigger scholarly infrastructure. The CrossRef DOI display guidelines clearly state that DOIs should be written as URLs in the online environment, which basically is whenever DOIs are used, as PDFs and even Word documents know how to handle URLs. Unfortunately this guideline is still frequently ignored. The above is of course also true for other persistent identifiers using the Handle system, e.g. ePIC.
The other problem with the DOI system is that it doesn’t address issue No. 4, i.e. provide a central metadata index for the resources that use the system. This job is left to the DOI registration agencies such as CrossRef and DataCite, who have implemented a central metadata store (e.g. CrossRef or DataCite) in different ways (e.g. using different metadata schemata), or not at all. This means that we have to look in several places to find all DOIs associated with author John Doe, published since 2012. Obviously we are used to looking up information in multiple places, but not being able to look up the metadata for a DOI without some extra work (finding out the registration agency for the DOI and then going to the respective metadata store) is a problem. One way around these problems is to use the DOI Content Negotiation Service.
The persistent identifiers used in our scholarly infrastructure would benefit from a clearer focus on the problems they should solve, starting with No. 1-4 above. One problem is that we probably focus too much on the persistence problem, implied also by the term persistent identifier or PID. What we have neglected is the resolvable problem, i.e. making as easy as possible to get from the persistent identifier to the resource and/or its metadata. Based on the Den Haag Manifesto and suggested by Todd Vision, we therefore proposed the term trusted identifier with the following characteristics in the conceptual model of interoperability for the ODIN Project:
While not directly relevant for resolving persistent identifiers as URLs, the last point is really important for any persistent identifier infrastructure, described in detail recently.
If I would design a persistent identifier service today (as if we would need yet another persistent identifier service), I would build the system around an URL shortening service that I control. The URLs could look very similar to what we have with DOIs now, e.g. https://doi.org/10.5555/12345678, but it would be clear that persistent identifiers are URLs, not something separate. Plus we could take advantage of all the lessons learned - and possibly even reuse open source code - with URL shorteners, which are much more widely used than scholarly persistent identifiers.
Update 6/4/15: added link to Thomson Reuters report on identifiers and open data.
Human-readable and machine-readable Persistent Identifiers
Yesterday Julie McMurry and co-authors published a preprint 10 Simple rules for design, provision, and reuse of persistent identifiers for life science data. This is an important paper trying to address a fundamental problem: how can we make persistent ...
Metrics for scientific software
One of the challenges of collecting metrics for scholarly outputs is persistent identifiers. For journal articles the Digital Object Identifier (DOI) has become the de-facto standard, other popular identifiers are the pmid from PubMed, ...
Metrics and attribution: my thoughts for the panel at the ORCID-Dryad symposium on research attribution
Oxford. Source: Wikimedia CommonsThis Thursday I take part in a panel discussion at the Joint ORCID – Dryad Symposium on Research Attribution. Together with Trish Groves (BMJ) and Christine Borgman (UCLA) I will discuss several aspects of attribution. ...