RSS, Atom, JSON Feed
As I discussed in a recent post, RSS is an essential building block for the upcoming Rogue Scholar Scholarly Blog Archive. RSS makes it easy to import blog posts (both metadata and content) automatically and is supported by all blogging platforms. This kind of automation is critical to keep the costs of running the Rogue Scholar low, allowing it to scale to cover a substantial number of science blog posts, and hopefully becoming an important Open Science resource.
But there are also challenges with using RSS:
- RSS is not a single standard but comes in multiple flavors: multiple versions of RSS, Atom, and the newer JSON Feed. Most libraries for consuming RSS (e.g. the Python feedparser) can handle RSS and Atom, and fewer tools (e.g. the Python feeder) also support the newer JSON Feed.
- The Rogue Scholar will use the InvenioRDM open source platform, which uses OpenSearch to index content and metadata. OpenSearch – just like Elasticsearch on which it is based – works with JSON. Indexing and archiving science blogs therefore should first convert RSS and Atom feeds onto JSON, and JSON Feed, which has been mapped from RSS and Atom, is the obvious choice.
- Some blogs prefer to only publish summaries in their RSS feeds, there have been many discussions on this topic over the years. It would complicate the operation of the Rogue Scholar if full-text content has to retrieved by other means, and archiving full-text content is the primary goal for the Rogue Scholar. The Rogue Scholar needs one feed that provides the full-text content, it doesn't have to be the default blog feed.
- Blogs, in particular personal blogs, may publish content that is out of the scope of the main science topics of the blog. Occasional out-of-scope posts, e.g. talking about major events such as job changes, sickness, or travel, are probably ok, and add a personal note. If this is frequently the case, and this has come up twice in initial Rogue Scholar discussions, it probably makes sense to provide a filtered RSS feed (e.g. using tags) with only a subset of posts.
- Describing a blog and associated metadata (e.g. name, feed URL, language, license, contact) is not something that easily maps how InvenioRDM is modeled. The obvious choice would be communities, but they can also be seen as a higher level of aggregation, e.g. all blog posts about biodiversity independent of the blog source. For now I will work with communities and enhance the InvenioRDM functionality where it also makes sense for other InvenioRDM use cases, of course coordinating with the InvenioRDM community.
Two weeks ago I opened up the waitlist for the Rogue Scholar, and I am happy with the feedback I have received so far: sixteen submissions and a number of encouraging discussions. Consider adding your science blog to the waitlist, or learn more at the Rogue Scholar website. If you have questions, post them in the comments or join the Discord channel (renamed from Front Matter to Rogue Scholar).
It has not escaped our notice that the specific use of RSS we have postulated immediately suggests a possible mechanism for the archiving and DOI registration of other scholarly content.