Talking about Talbot
Talbot is a Python package I started working on at the end of 2022 and plan to release to the Python Package Index (PyPi) in March. Talbot converts scholarly metadata in various formats, including Crossref, DataCite, Schema.org, BibTeX, RIS, and formatted citations – the complete list of supported formats is here. Talbot is a Python version of the Bolognese Ruby gem that I worked on with my DataCite colleagues starting in 2018. After leaving DataCite in 2021 I wrote a fork called Briard that added important metadata conversions, namely writing Crossref XML for DOI registration and reading/writing Citation File Format (CFF) for software metadata.
Talbot, Bolognese, and Briard are all names for dog breeds, the naming convention I have used for most of the Open Source software I have written since releasing the Open Source software Lagotto for tracking article-level metrics in 2012.
My two main use cases for Talbot (and Bolognese) are DOI content negotiation, using DOI metadata to generate metadata in other formats such as BibTeX or as formatted citation in one of the thousands of available citation styles. The Python version will enhance the InvenioRDM Open Source repository platform, e.g. adding RIS and Schema.org JSON-LD to the supported export formats. The other main use case is supporting DOI registration via multiple input formats. Since 2021 the Briard gem for example allows me to register DOIs for this blog as well as the Force11 Upstream blog using metadata in Schema.org format. With Talbot I want to enable Crossref DOI registration in the InvenioRDM platform for use cases where this makes sense, e.g blog posts or preprints. Talbot will help register DOIs from RSS feeds as part of the Rogue Scholar blog archive I am launching in Q2 2023.
One lesson learned with Bolognese/Briard is that the platform/language matters. The InvenioRDM backend is written in Python (the Frontend is in Javascript/React). And while Bolognese/Briard can be used via the command line or in environments such as GitHub Actions that use Docker-based microservices where the language doesn't really matter, having the scholarly metadata conversion available in a Python environment makes a huge difference. So I took the plunge of rewriting a fairly complex library in another language. I am fully aware that there are more languages used for writing scholarly infrastructure code, but for the next few years, Python addresses my needs and is hopefully useful to other infrastructure projects.
While the overall architecture for the evolving Talbot library looks rather similar to Briard, I am making some changes based on my experience over the last five years of working on generic scholarly metadata conversions:
- JSON is the core serialization format. Metadata in XML format (e.g. DataCite, Crossref, JATS) are important, but no longer used internally for Talbot validation. I will instead migrate to JSON schema for metadata validations in Talbot. DataCite, Crossref, and InvenioRDM use Elasticsearch/OpenSearch and thus JSON to index metadata. DataCite XML is still widely used but deprecated for several years, as on submission the XML is converted to JSON internally.
- Type hints. Support for static typing is a trend in dynamic languages Javascript (where Typescript is very popular), Ruby (since Ruby 3.0), and also Python. Talbot uses type hints for linting and that helps with error checking.
- Support unstructured references. Before DataCite Metadata Schema 4.4 (released in April 2021), only references providing an identifier such as a DOI were supported. Crossref has always supported unstructured references, and an identifier isn't available unless content exists in digital form. In the first Talbot release, I take the "fallback solution" approach, providing unstructured metadata if a DOI or other persistent identifier for a reference doesn't exist.
- Author names are hard. One of the biggest challenges with scholarly metadata is author names. In formatted citations and BibTeX separate given and family names are important, and a single name field for both given and family names is a constant source of errors and frustrations. In Talbot I follow both Crossref and Citeproc JSON metadata in that you need either a single name or separate given and family names.
- Dates are hard. Dates are surprisingly hard in scholarly metadata. There are multiple kinds of dates not always used consistently, and incomplete dates such as year-only are very common. One approach to dealing with incomplete dates is encoding the parts year, month, and day separately, used by Citeproc JSON and Crossref in their REST API. The better solution is to use the ISO8601 standard that supports incomplete dates. Other challenges are approximate dates (e.g. circa 1650) and date ranges. These kinds of dates are supported via the Extended Date and Time Format (EDTF), but working with EDTF is hard in code.
- Idiosyncrasies and inconsistencies. There is always a balancing act between supporting a metadata standard thoughtfully and not getting lost in edge cases. DataCite metadata (via Dublin Core on which it is based) makes it hard to work with some of the bibliographic metadata common for books, articles, and other textual resources. For example page numbers or the journal name. Crossref metadata has the tendency to treat things differently depending on the content type, e.g. the ISSN. After working on Bolognese for five ideas I will make some changes to how to best support metadata across different formats. It is clear that there is no single overarching scholarly metadata format, the internal format used by Bolognese, Briard, and now Talbot is a pragmatic mix of the different implementations.