Persistent identifiers, random strings, and checksums
Continuing the work on streamlining the DOI assignment for new blog posts in the Rogue Scholar science blog archive, I discovered and fixed a major bug. Rogue Scholar DOIs strings are generated from a random number and include a checksum. It turns out that this checksum was wrong, affecting all Rogue Scholar posts generated with the commonmeta Go library between May 2024 and today.
When working as Technical Director at the DOI registration agency DataCite, I was always jealous of persistent identifier systems that put more thought into the structure of the identifier. DOIs are made up of a prefix that starts with 10.x (where x can be four or five digits), followed by a suffix that can basically be anything. For example this:
https://doi.org/10.1002/(SICI)1099-1409(199908/10)3:6/7<672::AID-JPP192>3.0.CO;2-8
This DOI suffix is actually a SICI (Serial Item and Contribution Identifier), adopted as a NISO standard in 1996 and used for some early DOIs (and withdrawn in 2012). There are three big problems with this DOI:
- some characters don't work well in URLs, in particular
<
or>
, and need to be escaped, - the identifier includes meaning (e.g. ISSN, publication date, and page numbers) that might change over time and is better stored in DOI metadata,
- the identifier is long and complicated, increasing the risk of transmission errors.
DOIs have been around for more than 20 years, so it is very hard/impossible to change how a DOI suffix should look like. But any organization registering DOIs can follow best practices, and in 2016 I wrote about some of them in a post I called Cool DOIs:
Some of the issues have since been addressed, e.g. the use of HTTPS instead of HTTP and doi.org instead of dx.doi.org. Not including semantic info in the DOI string is a constant struggle, instead of SICIs we now see attempts to include the journal name, publication date, or version number in the identifier.
Using base32-encoding for DOI suffix generation starting with a large random number is something I came up with at the time. DOIs are case-insensitive so that base64-encoding, e.g. used in YouTube identifiers, wouldn't work. Base32 uses an alphabet of 32 digits, which makes the identifier more compact than a numerical identifier (10 digits if using the decimal system) such as an ORCID ID. UUIDs such as 816c08de-750e-4d6d-a96f-f995dc1a366a are an alternative, but at the time I found them too long (32 digits) to be practical and because DOIs use a prefix, a unique identifier based on a random string only needs to be unique within the DOIs of that organization, making the task easier.
There are several flavors of base32-encoding around, a popular one was developed by Douglas Crockford. He uses the alphabet (minus the letters A, I, O and U) and digits for the 32 keys and allows optional hyphens and checksums. One frequently used checksum is mod 97-10 or ISO 7064. I wrote the base32-url Ruby gem to implement this in the Ruby programming language, and it is used in DataCite service infrastructure whenever a random DOI string is needed. The InvenioRDM project wrote a Python implementation, used to generate DOI suffixes for the repository software.
When the Rogue Scholar science blog archive started registering DOIs for blog posts, it used the same Python library and in the same configuration (10 digits with a hyphen after five digits, and a two-digit mod 97-10 checksum at the end). In May 2024 I switched to using the commonmeta Go library I wrote for Rogue Scholar DOI registration. For this, I implemented Crockford base32 encoding in Go. Unfortunately I made one mistake and only fully realized it this week: the checksum was not generated properly. This means that checksum checks for those DOIs are broken, but of course they work as expected in any other way. Today I released a fix for this bug, and all Rogue Scholar DOIs registered starting today have properly working checksums.
As a bonus I implemented a new decode command in the commonmeta library that can be used from the command line:
commonmeta decode https://doi.org/10.59350/kqy47-4zz67
This command returns 678542152703
, the random number used to generate that DOI suffix. The checksum is 67
, and if the checksum is wrong an error message is returned, as is the case for DOIs registered with older versions of the commonmeta Go library:
commonmeta decode https://doi.org/10.54900/d3ck1-skq19
This returns an error:
wrong checksum 19 for identifier d3ck1-skq19
Commonmeta also understands DOIs from other organizations using base32-encoding, e.g. repositories running the InvenioRDM software:
commonmeta decode https://doi.org/10.22002/fcw7t-a3s32
The DataCite and Make Data Count blogs:
commonmeta decode https://doi.org/10.60804/1nhk-qd72
The ROR organizational identifier also uses Crockford base32-encoding and mod 97-10 checksums (I came up with the identifier format in late 2018 when helping launch the initial ROR service):
commonmeta decode https://ror.org/00pd74e08
This command returns 23501966
. ROR uses no hyphens and seven digits (where the first digit is always 0).
ORCID doesn't use base32-encoding and uses mod 11-2 checksums, and it uses the same identifier format as ISNI (International Standard Name Identifier). The decode command checks the checksum and also looks at the number range reserved for ORCID.
commonmeta decode https://orcid.org/0000-0003-1419-240X
This made-up ORCID ID (my ORCID ID with a different checksum) results in this:
wrong checksum X for identifier 0000-0003-1419-240X
Conclusions
Best practices for the structure of persistent identifiers have evolved over the years, and in this blog post I have just scratched the surface and looked at some PIDs relevant to Rogue Scholar, namely DOI, ROR, and ORCID. They all use (or can use) random strings and checksums and I added a command to the commonmeta library and command-line tool to decode these random strings and use checksums to verify identifiers. I fixed a bug affecting checksums in the commonmeta Go library and the library now works interchangeably with the Python and Ruby implementations of the Crockford base32-encoding.
Going forward this will allow Rogue Scholar to be more flexible with DOI registration. Participating blogs can provide DOI strings in the metadata (ideally using the id/guid in the RSS/Atom/JSON Feed) and Rogue Scholar can check these strings for uniqueness, checksums, and the correct DOI prefix. The first blogs participating in Rogue Scholar have started to implement this approach, including of course this blog. Please reach out if you need help with this or have general questions or feedback regarding persistent identifiers and checksums.
References
Fenner, M. (2016, December 15). Cool DOIs. Front Matter. https://doi.org/10.53731/r79x921-97aq74v-ag5a2
Fenner, M. (2024, May 13). Going for DOI registration. Front Matter. https://doi.org/10.53731/43qt9-x6p52
Martin Fenner. (2025). front-matter/commonmeta: V0.6.32 (Version v0.6.32) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.14671594
Fenner, M. (2025, January 13). Including DOIs in RSS Feeds: Implementation. Front Matter. https://doi.org/10.53731/m9d5v-xmr74