In August 2021 I joined the InvenioRDM project to help develop and host a modern repository platform for scholarly content. Things didn't exactly go as planned at the beginning of 2022, and I spent five months in the hospital with serious personal health issues. Since returning home in early June, my health has improved considerably, and in September I was able to slowly start working again. This has worked well for me, so it is time to time to set goals for my work on the InvenioRDM project again.
Announcing the Front Matter Archive
Front Matter will be launching a repository to host the full-text content of scholarly blogs in the first quarter of 2023.
The starting point will be the over 450 blog posts that I have published over the last 15 years, but the archive is of course open to content from all scholarly blogs. The goal is to both offer a useful resource for the scholarly community and to learn something about hosting InvenioRDM. An archive of scholarly blog posts is a good starting point, as blogs tend to change technology or location every few years, and older content then often becomes unavailable. Issuing DOIs for blog posts is part of the solution (e.g. this makes citing the blog posts much easier), but a stable long-term archive of the content is equally important. A challenge that scholarly blog posts share with many other often ephemeral scholarly resources, e.g. data and software.
Setting up the InvenioRDM Software
The InvenioRDM open-source software has seen a lot of work in the last several years and is currently at version 9.1. Several project partners are working on releasing a production version within the next twelve months, and the Caltech Library was the first to do so last week. Repository software is a complex piece of infrastructure, and a lot of work is expected for the proper configuration of the Front Matter Archive, including customization of the theme, authentication, etc.
Running the Front Matter Archive in a Kubernetes Cluster
Kubernetes in the last few years has become the de facto standard for deploying containerized applications such as InvenioRDM. Unfortunately running a Kubernetes cluster is not easy, but it will make things easier eventually (e.g. monitoring, scaling, and security), and it can be set up in a variety of environments from private clouds to public cloud providers. Front Matter will be using a managed Kubernetes cluster provided by Digitalocean and provisioned by the infrastructure as code open source software tool Terraform. Kubernetes will not host everything. At least in the initial version Front Matter will use hosted databases and hosted cloud object storage to store files and metadata, as Kubernetes (and containers) excel when running stateless applications, but more work is needed for stateful applications.
Archiving full-text Blog Content
Front Matter will aim to archive blog posts via available APIs (as can be done with the Front Matter Blog), but we will have to take a flexible approach as full-text content will be provided in a variety of formats, e.g. RSS feeds. The initial focus will be on the Ghost and Wordpress blogging platforms. We will help with issuing DOIs for this content and help to archive associated images. We are only interested in content with an open license (CC-BY or CCO) to maximize the potential reuse of the blog archive.
Future of the Archive
After the initial setup of the blog archive, the focus will be on expanding content and on providing added value. Please reach out via email, Discord, or comments if you have scholarly blog content you want to be archived by Front Matter, or have suggestions for added functionality. One feature I want to see improved in InvenioRDM is better support for full-text HTML in addition to embedding PDF and other formats. In the coming months, I will also work on the cost model. There is a moderate cost to archive a blog post (mostly for ingesting content), but these costs would add up if blog posts are added by the thousands.