Authoring of scholarly articles is a recurring theme in this blog since it started in 2008. Authoring is still in desperate need for improvement, and nobody has convincingly figured out how to solve this problem. Authoring involves several steps, and it helps to think about them separately:
- Writing. Manuscript writing, including formatting, collaborative authoring
- Submission. Formatting a manuscript according to a publisher’s author guidelines, and handing it over to a publishing platform
- Revision. Changes made to a manuscript in the peer review process, or after publication
Although authoring typically involves text, similar issues arise for other research outputs, e.g research data. And these considerations are also relevant for other forms of publishing, whether it is self-publication on a blog or website, or publishing of preprints and white papers.
For me the main challenge in authoring is to go from human-readable unstructured content to highly structured machine-readable content. We could make authoring simpler by either forgoing any structure and just publishing in any format we want, or we can force authors to structure their manuscripts according to a very specific set of rules. The former doesn’t seem to be an option, not only do we have a set of community standards that have evolved for a very long time (research articles for example have title, authors, results, references, etc.), but it also makes it hard to find and reuse scholarly research by others.
The latter option is also not really viable since most researchers haven’t learned to produce their research outputs in machine-readable highly standardized formats. There are some exceptions, e.g. CONSORT and other reporting standards in clinical medicine or the semantic publishing in Crystallography, but for the most part research outputs are too diverse to easily find a format that works for all of them. The current trend is certainly towards machine-readable rather than towards human-readable, but there is still a significant gap - scholarly articles are transformed from documents in Microsoft Word (or sometimes LaTeX) format into XML (for most biomedical research that means JATS) using kludgy tools and lots of manual labor.
What solutions have been tried to overcome the limitations of our current authoring tools, and to make the process more enjoyable for authors and more productive for publishers?
- Do the conversion manually, still a common workflow.
- Tools for publishers such as eXtyles, Merops - both commercial - or the evolving Open Source mPach that convert Microsoft Word documents into JATS XML and do a lot of automated checks along the way.
- Tools for authors that directly generate JATS XML, either as a Microsoft Word plugin (the Article Authoring Add-In, not actively maintained) in the browser (e.g. Lemon8-XML, not actively maintained), or directly in a publishing platform such as Wordpress (Annotum).
- Forget about XML and use HTML5 has the canonical file format, e.g. as Scholarly HTML or HTML5 specifications such as HTMLBook. Please read Molly Sharp’s blog post for background information about HTML as an alternative to XML.
- Use file formats for authoring that are a better fit for the requirements of scholarly authors, in particular Scholarly Markdown.
- Build online editors for scientific content that hide the underlying file format, and guide users towards a structured format, e.g. by not allowing input that doesn’t conform to specifications.
Solution 1. isn’t really an option, as it makes scholarly publishing unnecessarily slow and expensive. Typesetter Kaveh Bazergan has gone on record at the SpotOn London Conference 2012 by saying that the current process is insane and that he wants to be “put out of business”.
Solution 2. is probably the most commonly used workflow used by larger publishers today, but is very much centered around a Microsoft Word to XML workflow. LaTeX is a popular authoring environment in some disciplines but still requires work to convert documents into web-friendly formats such as HTML and XML.
Solutions 3. to 5. have never picked up any significant traction. Overall the progress in this area has been modest at best, and the mainstream of authoring today isn’t too different from 20 years ago. Although I have gone on record for saying that Scholarly Markdown has a lot of potential, the problem is much bigger than finding a single file format, and markdown will never be the solution for all authoring needs.
Solution 6. is an area where a lot of exciting development is currently happening, examples include Authorea, WriteLateX, ShareLaTeX. Although the future of scholarly authoring will certainly include online authoring tools (making it much easier to collaborate, one of the authoring pain points), we run the risk of locking in users into one particular authoring environment.
Going Forward
How can we move forward? I would suggest the following:
- Publishers should accept manuscripts in any reasonable file format, which means at least Microsoft Word, Open Office, LaTeX, Markdown, HTML and PDF, but possibly more. This will create a lot of extra work for publishers, but will open the doors for innovation, both in the academic and commercial sector. We will never see significant progress in scholarly authoring tools if the submission step requires manuscripts to be in a single file format (Microsoft Word) - in particular since this file format is a general purpose word processsing format and not something designed specifically for scholarly content. And we want researchers to spend their time doing research and writing up their research, not formatting documents.
- To handle this avalanche of unstructured documents, publishers need conversion tools that can transform all these documents into a format that can feed into their editorial and publishing workflows. A limited number of these tools exist already, but this will require a significant development effort. Again, opening up submissions to a variety of file formats will not only foster innovation in authoring tools, but also in document conversion tools.
- We should think beyond XML. Many of the workflows designed today center around conversions from one XML format to another, e.g. Microsoft Word to JATS or TEI (popular in the humanities), often using XLST transforms. Not only is XML difficult for humans to read or edit, but the web and many of the technologies built around it are moving away from XML towards HTML5 and JSON. XML is fine as an important output format for publishing, but maybe not the best format to hold everything together.
- As we haven’t come up with a canonical file format for scholarly documents by now, we should give up that idea. XML is great for publisher workflows, but is not something humans can easily edit or read. PDF is still the most widely read format by humans, but is not a good intermediary format. LaTeX is too complex for authors outside of mathematics, physics and related fields, and is not built with web standards in mind. Markdown is promising, but doesn’t easily support highly structured content. And HTML5 and the related ePub are widely popular, but can be hard to edit without a visual editor, and currently don’t include enough standard metadata to support scholarly content out of the box.
- The focus should not be on canonical file formats for scholarly documents, but on tools that understand the manuscripts created by researchers and can transform them into something more structured. As we have learned from document conversion tools such as Pandoc, we can’t do this with a simple find and replace using regular expressions, but need a more structured approach. Pandoc is taking the input document (markdown, LaTeX or HTML) apart and is constructing an abstract syntax tree (AST) of the document, using parsing expression grammar (PEG), which includes a set of parsing rules. Parsing expression grammars are fairly new, first described by Bryan Ford about 10 years ago, but in my mind are a very good fit for the formal grammar of scientific documents. It should be fairly straightforward to generate a variety of output formats from the AST (Pandoc can convert into more than 30 document formats), the hard part is the parsing of the input.
All this requires a lot of work. Pandoc is a good model to start, but is written in Haskell, a functional programming language that not many people are familar with. For small changes Pandoc allows you to directly manipulate the AST (represented as JSON) using filters written in Haskell or Python. And custom writers for other document formats can be written using Lua, another interesting programming language that not many people know about. Lua is a fast and relatively easy to learn scripting language that can be easily embedded into other languages, and for similar reasons is also used to extend the functionality of Wikipedia. PEG parsers in other languages include Treetop (Ruby), PEG.js (Javascript), and ANTLR, a popular parser generator that also includes PEG features.
But I think the effort to build a solid open source conversion tool for scholarly documents is worth it, in particular for smaller publishers and publishing platforms who can’t afford the commercial Microsoft Word to JATS conversion tools. We shouldn’t take any shortcuts - e.g. by focussing on XML and XLST transforms - and we can improve this tool over time, e.g. by starting with a few input and output formats. This tool will be valuable beyond authoring, as it can also be very helpful to convert published scholarly content into other formats such as ePub, and in text mining, which in many ways tries to solve many of the same problems. The Pandoc documentation includes an example of extracting all URLs out of a document, and this can be modified to extract other content. In case you wonder whether I gave up on the idea of Scholarly Markdown - not at all. To me this is a logical next step, opening up journal submission systems to Scholarly Markdown and other evolving file formats. And Pandoc, one of the most interesting tools in this space, is a markdown conversion tool at its heart. The next steps could be the following:
- write a custom writer in Lua that generates JATS output from Pandoc
- explore how difficult it would be to add Microsoft Word .docx as Pandoc input format
- develop Pandoc filters relevant for scholarly documents (e.g. auto-linking accession numbers of biomedical databases)