Authoring of scholarly articles is a recurring theme in this blog since it started in 2008. Authoring is still in desperate need for improvement, and nobody has convincingly figured out how to solve this problem. Authoring involves several steps, and it helps to think about them separately:
Although authoring typically involves text, similar issues arise for other research outputs, e.g research data. And these considerations are also relevant for other forms of publishing, whether it is self-publication on a blog or website, or publishing of preprints and white papers.
For me the main challenge in authoring is to go from human-readable unstructured content to highly structured machine-readable content. We could make authoring simpler by either forgoing any structure and just publishing in any format we want, or we can force authors to structure their manuscripts according to a very specific set of rules. The former doesn’t seem to be an option, not only do we have a set of community standards that have evolved for a very long time (research articles for example have title, authors, results, references, etc.), but it also makes it hard to find and reuse scholarly research by others.
The latter option is also not really viable since most researchers haven’t learned to produce their research outputs in machine-readable highly standardized formats. There are some exceptions, e.g. CONSORT and other reporting standards in clinical medicine or the semantic publishing in Crystallography, but for the most part research outputs are too diverse to easily find a format that works for all of them. The current trend is certainly towards machine-readable rather than towards human-readable, but there is still a significant gap - scholarly articles are transformed from documents in Microsoft Word (or sometimes LaTeX) format into XML (for most biomedical research that means JATS) using kludgy tools and lots of manual labor.
What solutions have been tried to overcome the limitations of our current authoring tools, and to make the process more enjoyable for authors and more productive for publishers?
Solution 1. isn’t really an option, as it makes scholarly publishing unnecessarily slow and expensive. Typesetter Kaveh Bazergan has gone on record at the SpotOn London Conference 2012 by saying that the current process is insane and that he wants to be “put out of business”.
Solution 2. is probably the most commonly used workflow used by larger publishers today, but is very much centered around a Microsoft Word to XML workflow. LaTeX is a popular authoring environment in some disciplines, but still requires work to convert documents into web-friendly formats such as HTML and XML.
Solutions 3. to 5. have never picked up any significant traction. Overall the progress in this area has been modest at best, and the mainstream of authoring today isn’t too different from 20 years ago. Although I have gone on record for saying that Scholarly Markdown has a lot of potential, the problem is much bigger than finding a single file format, and markdown will never be the solution for all authoring needs.
Solution 6. is an area where a lot of exciting development is currently happening, examples include Authorea, WriteLateX, ShareLaTeX. Although the future of scholarly authoring will certainly include online authoring tools (making it much easier to collaborate, one of the authoring pain points), we run the risk of locking in users into one particular authoring environment.
How can we move forward? I would suggest the following:
All this requires a lot of work. Pandoc is a good model to start, but is written in Haskell, a functional programming language that not many people are familar with. For small changes Pandoc allows you to directly manipulate the AST (represented as JSON) using filters written in Haskell or Python. And custom writers for other document formats can be written using Lua, another interesting programming language that not many people know about. Lua is a fast and relatively easy to learn scripting language that can be easily embedded into other languages, and for similar reasons is also used to extend the functionality of Wikipedia. PEG parsers in other languages include Treetop (Ruby), PEG.js (Javascript), and ANTLR, a popular parser generator that also includes PEG features.
But I think the effort to build a solid open source conversion tool for scholarly documents is worth it, in particular for smaller publishers and publishing platforms who can’t afford the commercial Microsoft Word to JATS conversion tools. We shouldn’t take any shortcuts - e.g. by focussing on XML and XLST transforms - and we can improve this tool over time, e.g. by starting with a few input and output formats. This tool will be valuable beyond authoring, as it can also be very helpful to convert published scholarly content into other formats such as ePub, and in text mining, which in many ways tries to solve many of the same problems. The Pandoc documentation includes an example of extracting all URLs out of a document, and this can be modified to extract other content. In case you wonder whether I gave up on the idea of Scholarly Markdown - not at all. To me this is a logical next step, opening up journal submission systems to Scholarly Markdown and other evolving file formats. And Pandoc, one of the most interesting tools in this space, is a markdown conversion tool at its heart. The next steps could be the following: