Archiving individual science blog posts

Archiving individual science blog posts

Sometimes we want to preserve a blog post to read or reuse later. There are generic tools for that purpose, including read-it-later apps such as Instapaper or Pocket. For scholarly blog posts, the right place can also be a reference manager, which stores the content and the metadata, especially if a DOI was registered for each blog post.

Reference managers can store full-text documents and typically use PDF as the file format. Zotero 7 (currently in development) adds support for ePub documents. ePub is not as widely supported as PDF for scholarly journal articles but is popular with books and works better on smaller screens.

How can the Rogue Scholar science blog archive support the download of individual blog posts in a way that facilitates reading and reuse? Another important use case for blog authors is the migration to a different blogging platform (e.g. from WordPress to Hugo), and a generic import/export format would be very helpful for this.

Full-text content of blogs included in the Rogue Scholar is currently stored in HTML format, as that is the format used to distribute blog posts via RSS feed, webpage, or newsletter. One challenge with HTML is that it is not typically used for standalone documents, different from ePub and PDF.

Just as scholarly metadata can be converted into different formats – including the Commonmeta format with libraries written in Ruby and Python that I created in early 2023 – scholarly content can be converted into different formats using a variety of tools, most notably the Pandoc open-source library. Pandoc not only converts HTML documents to PDF and ePub, but also supports converting documents authored in Microsoft Word, LaTeX, or markdown formats. Markdown is particularly important, as it is the authoring format used by blogs powered by static site generators, about 25% of all blogs participating in the Rogue Scholar science blog archive. In the last few days, I have added Pandoc to the Rogue Scholar API to convert HTML to markdown. I am still exploring, but you can see some markdown-formatted blog posts in the content_text field.

Conclusions

  • There is sometimes a need to download individual science blog posts - to keep a local copy to read again later, to reference in another scholarly work, or to migrate to a different blogging platform if you are a blog author.
  • Conversion of content and metadata into different formats is relatively straightforward, and existing open-source tools are already integrated into the Rogue Scholar blog archive. DOI registration uses the commonmeta-ruby library, and basic Pandoc support has recently been added.
  • The best canonical format for science blog posts is not known yet. PDF is not part of standard workflows for blogging and is not an authoring (input) format. JATS is a standard format for scholarly articles but is not typically used by scientists for local files. HTML works better for collections of files. ePub and Markdown are interesting options but more exploration is needed.
Copyright © 2023 Martin Fenner. Distributed under the terms of the Creative Commons Attribution 4.0 License.