One of my personal highlights in last week's Research Data Alliance (RDA) 6th Plenary Meeting in Paris was the Data Packages Birds of a Feather (BoF), organized by Rufus Pollock from the Open Knowledge Foundation (OKFN). He highlighted the urgent need for packacking data in a standard format to facilitate reuse, and described the extensive work the OKFN has done on data packages. A particular focus is on packacking CSV files, the most widely used format for exchanging data.
I was sold on the importance of CSV and the idea of packacking data in a standard format since attending CSVconf (co-organized by Rufus) in July 2014, and have written about this several times (Build Roads not Stagecoaches, Reference Lists and Tables of Content), most recently a few weeks ago (Using YAML frontmatter with CSV). In the RDA session I suggested two important improvements to the OKFN data package format:
- Single file. One very important aspect of packaging is to provide everything in a single file, generated by zipping a folder of multiple files. This pattern has become very common and is used for example for electronic books (
epub
), Microsoft Word documents (docx
), or Google Chrome extensions (crx
). - Identifier. The OKFN data package spec uses a
name
attribute to uniquely identify the package. This approach falls short because enforcing the uniqueness of a human readable identifier requires a registry, and this not part of the data package spec. What is needed is a persistent identifier, and DataCite obviously has a lot of experience in this area.
Container-based digital infrastructure is a very hot topic thanks to Docker, and it has become clear that tools and workflows are at least as important as the spec for the container itself. If we want to move to scholarly infrastructure based on containers, used here interchangeably with packages, we need registries for these containers that not only provide globally unique identifiers, but also a central index for finding these containers - the Docker Hub for scholarly containers. DataCite is providing persistent identifiers and standardized metadata for scholarly content with a focus on research data and is therefore in a perfect position to become such a registry and in the RDA session I therefore said:
I want DataCite to become a registry for data containers
While data, in particular in CSV format, are probably the first and most important use case, containers make sense for all scholarly content, and DataCite DOIs are used for text documents, images, software, etc. in addition to datasets. For this reason I prefer the term scholarly container over data container.
As important as the containers themselves are tools and services that work with them, in particular packing and unpacking. The CSVY
format discussed here in a recent blog post could be used by an individual as intermediate step towards a data container. I see tool support as the critical step that decides whether scholarly containers take off as a standard format. Karthik Ram from rOpenSci attended the session in Paris (and CSV.conf last year) and expressed great interest in adding support for scholarly containers in their suite of tools.
Next Steps
- specify the work needed for DataCite to fully support scholarly containers
- work with Rufus and OKFN, e.g. on registry support and packaging into a single file
- work with the broader community on supporting scholarly containers: data repositories, reference managers, tools to analyze datasets, etc.
- propose a pre-conference workshop for the Force2016 conference in April 2016. This conference started out as Beyond the PDF in 2011, and scholarly containers are a perfect thematic fit.
Acknowledgments
This blog post was originally published on the DataCite Blog.