Rogue Scholar is learning a new language

Rogue Scholar is learning a new language
Photo by Fab Lentz / Unsplash

The Rogue Scholar science blog archive is built with Open Source software, starting with Javascript in June 2023 and adding an API written in Python in October 2023. This week I am releasing the first version of commonmeta written in Go to simplify the Rogue Scholar backend.

Commonmeta is the scholarly metadata format that I started in March 2023 based on work going back to 2017, most notably a metadata conversion library I had written in Ruby that is heavily used internally at DataCite. One important use case for commonmeta is Rogue Scholar DOI registration with Crossref, and in 2023 I switched from Ruby to the commonmeta-py Python library, continuing a GitHub Actions workflow.

The second important commonmeta use case for Rogue Scholar is providing blog post metadata in a variety of of formats, including bibtex and formatted citations, to help with reusing science blog posts in reference managers and then cite them in other scholarly works. This functionality is currently also provided by the Python library and is available via both the Rogue Scholar website and API.

The third important commonmeta use case for Rogue Scholar is automated reference extraction of blog posts and storing them with the Crossref DOI metadata. As of today, 1,124 Rogue Scholar blog posts include references registered with Crossref. The references are stored in a dedicated Rogue Scholar database in commonmeta format and shown for each post via both the Rogue Scholar website and API since earlier this month. This is again done by the commonmeta Python library, in combination with the Rogue Scholar Python API.

Just as the Rogue Scholar Javascript frontend reached its limitations in late 2023 and led to the launch of a dedicated Rogue Scholar API written in Python in October 2023, the work on references stretches the capabilities of the current Rogue Scholar service.

One possible path forward would be to not implement references for scholarly blog posts, but references and the linking they provide are a core functionality of the scholarly record. And if not for the references, Rogue Scholar would run into challenges with other desired functionalities in the future, e.g. aggregation of blog posts by author and/or organization, using their ORCID and ROR identifiers.

The path forward I have decided to take is to improve the Rogue Scholar backend which currently consists of database, storage, and authentication. And just as frontends are typically written in Javascript and APIs are commonly written in Python, a common language for cloud and network services is Go. Despite an initial extra effort, I bet that in the long run Go will make the needed backend work easier compared to Javascript and Python, and will simplify Rogue Scholar infrastructure.

It will take time to refactor the Rogue Scholar backend, but a good starting point is commonmeta metadata conversions, which are central to several Rogue Scholar functionalities, as described above. I have started to work on implementing commonmeta metadata conversions in Go, and today I am launching the first public version which converts Crossref and DataCite metadata into the commonmeta format, both for single DOIs and for lists of DOIs. Installing the commonmeta Go application means downloading a single 3 MB binary (versions for Linux, Windows or Mac) from GitHub, a very different experience from Javascript, Python (or Ruby). You can then use the command line to fetch the metadata for a particular DOI (Crossref or DataCite), e.g.

commonmeta convert 10.59350/s859z-eqy17

You can do the same with commonmeta-py but would have to worry about Python versions, virtual environments, and installing dependencies (similarly with commonmeta-ruby). Another advantage of Go is that it is really fast. This will only show if you convert lots of scholarly works, you can try out fetching a random sample of 1,000 DOIs from Crossref:

commonmeta sample -n 1000

One use case for the commonmeta command-line application is the quick collection of scholarly metadata in bulk for coverage analysis, which is then typically done with notebooks written in Python, R, or Javascript. The following command will fetch all Rogue Scholar blog posts (up to 1,000 posts, Front Matter is Crossref member 31795) with funding information and ORCID IDs for at least one author, and store the results (79 blog posts) in funding.json:

commonmeta list -n 1000 --member 31795 --has-award --has-orcid > funding.json

Combined with the Open Data provided by Crossref, DataCite, ORCID, ROR, and others via API and bulk download, this approach can dramatically change how we study research information on our path towards decentralized and federated research information sources.

Over the coming weeks and months, I will improve the functionality of the commonmeta Go application to match the features of commonmeta-py and commonmeta-ruby, in particular the conversion into other formats in bulk, e.g. bibtex, schema.org, and formatted citations. And for the datacite and crossrefxml formats enable DOI registrations, so that the Rogue Scholar DOI registrations and updates are handled with Go.

Go not only works well for command-line applications but also for network services. Building a HTTP server connected to a database around the commonmeta package is straightforward, and the next step after the basic commonmeta features are implemented in Go will be to build a Rogue Scholar references server. Then (later in 2024 or 2025) the next focus will be improving workflows around blog post metadata extraction and DOI registration. In Python message queues are frequently implemented with Celery (and Redis or RabbitMQ), in Go there are several solid options.

Details of the suggested path will become clearer when working on this in the coming months, including whether to use another language instead of Go. Rust is probably the most interesting alternative in 2024 rather than C++, Haskell, or Java.

The approach taken for Rogue Scholar and the lessons learned are also relevant for other scholarly infrastructures that try to follow the Principles of Open Scholarly Infrastructure, including preprint servers, repositories, and journal publishing platforms. To paraphrase a famous quote from the 1953 paper on the structure of DNA, it has not escaped our notice that the specific approach we have postulated immediately suggests a possible architecture for Diamond Open Access scholarly infrastructure.

References

Fenner, M. (2023, June 5). Starting to register DOIs for all blog posts included in the Rogue Scholar. Front Matter. https://doi.org/10.53731/m9fs5-nap05

Fenner, M. (2023, October 9). Rogue Scholar has an API. Front Matter. https://doi.org/10.53731/ar11b-5ea39

Fenner, M. (2023, March 9). Announcing Commonmeta. Front Matter. https://doi.org/10.53731/cp7apdj-jk5f471

Fenner, M. (2017, April 28). A Content Negotiation Update. Front Matter. https://doi.org/10.53731/r7adm61-97aq74v-ag5kv

Fenner, M. (2024). Commonmeta-py [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.8340374

Fenner, M. (2024, January 16). Improving Rogue Scholar metadata conversions. Front Matter. https://doi.org/10.53731/n4vfp-nwb08

Fenner, M. (2024, April 10). Rogue Scholar references learn new tricks. Front Matter. https://doi.org/10.53731/mgmcc-yvt53

Bilder, G., Lin, J., & Neylon, C. (2020). The Principles of Open Scholarly Infrastructure. https://doi.org/10.24343/C34W2H

Fenner, M. (2024, April 8). Simplifying Rogue Scholar Infrastructure. Front Matter. https://doi.org/10.53731/ne1kh-9mn38

Babini, D., Garcia, A. B., Costas, R., Matas, L., Rafols, I., & Rovelli, L. (2024, April 22). Not only Open, but also Diverse and Inclusive: Towards Decentralised and Federated Research Information Sources. Leiden Madtrics. https://doi.org/10.59350/gmrzb-e2p83

Watson, J. D., & Crick, F. H. C. (1953). Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid. Nature, 171(4356), 737–738. https://doi.org/10.1038/171737a0

Copyright © 2024 Martin Fenner. Distributed under the terms of the Creative Commons Attribution 4.0 License.