While the launch of the Rogue Scholar blog archive is still a few months away (happening in the second quarter of this year), I want to give an update on the ongoing work.
The Rogue Scholar blog archive will improve science blogs in important ways,
including full-text search, DOIs and metadata, and long-term archiving. The central piece of the underlying infrastructure is the InvenioRDM open source repository software. Front Matter is one of the organizations helping with InvenioRDM development. For the Rogue Scholar, the specific work needed includes the following:
Support for RSS Feeds
All blogs provide RSS feeds, which will be central to automatically fetching metadata and content for the Rogue Scholar. RSS is not built into InvenioRDM and is not needed by most organizations planning to run InvenioRDM. I will therefore build a separate service for this functionality, integrating with InvenioRDM via its REST API. For a blog to be archived and indexed in the Rogue Scholar, users will use this RSS service, providing basic information such as RSS feed URL, language, license, and contact person – basically the information collected for the Rogue Scholar waitlist (feel free to sign up your blog if you haven't already).
Next Tuesday I will publish an OPML (Outline Processor Markup Language) file with all blogs on the Rogue Scholar waitlist. OPML is the standard for importing and exporting lists of blogs, e.g. when switching from one RSS reader to another. It is a natural fit for managing blogs in Rogue Scholar, and hopefully helps people sign up for interesting science blogs they want to read. If you are on the Rogue Scholar waitlist, please make sure your RSS Feed URL and Home Page URL are correct, and – if you haven't done so already – pick one (and only one) of the top-level categories from the OECD Fields of Science and Technology:
- Natural Sciences
- Engineering and Technology
- Medical and Health Sciences
- Agricultural Sciences
- Social Sciences
- Humanities
The OPML file (and your RSS reader if you import that file) will group science blogs into these categories. Many blogs fall into more than one category, but that isn't supported by OPML.
Hosting Rogue Scholar infrastructure
There are several ways to run InvenioRDM repository software, obviously depending on the resources available at the hosting organization, and the size and complexity of the repository. A small data repository for a university department has different needs than Zenodo, one of the most popular generalist repositories with almost three million records. The Rogue Scholar sits in the middle, a small to medium-sized repository, anticipating 2,000 to 20,000 blog posts twelve months after launch. InvenioRDM relies on Docker and Kubernetes for running production services. This makes sense for large instances such as Zenodo but adds unnecessary complexity to smaller instances such as the Rogue Scholar.
After a substantial amount of deliberation and discussion, I decided to use a different approach for the Rogue Scholar, and this might potentially be of interest to other organizations planning to use InvenioRDM:
- Using virtual machines instead of Docker containers
- Automation of virtual machine building with Packer and Ansible
- Hosting of virtual machines by cloud provider DigitalOcean, fundamentally similar to hosting a Wordpress or Ghost blog
- Making the automation generic to also work for other InvenioRDM instances, and other infrastructure providers, e.g. Openstack
This will be the focus of my work in the next three months, and luckily I have learned a lot about infrastructure automation in my previous jobs at PLOS and DataCite.
Support for Crossref DOI registration
By default, InvenioRDM uses DataCite DOIs, but Rogue Scholar will use Crossref DOIs for blogs that don't already use DOIs. The Crossref pricing is much more favorable for startups such as Front Matter, and for annual DOI registration numbers that at least initially will be in the 100s or low 1000s. I spent a good part of January and February writing a Python scholarly metadata conversion library that I released two weeks ago (commonmeta-py). Among other things, commonmeta-py can read and write Crossref metadata and can enable Crossref DOI registrations in InvenioRDM – which is written in Python (and Javascript for the frontend).
As always, reach out to me with questions and comments.