One of my personal highlights in last week's Research Data Alliance (RDA) 6th Plenary Meeting in Paris was the Data Packages Birds of a Feather (BoF), organized by Rufus Pollock from the Open Knowledge Foundation (OKFN). He highlighted the urgent need for packacking data in a standard format to facilitate reuse, and described the extensive work the OKFN has done on data packages. A particular focus is on packacking CSV files, the most widely used format for exchanging data.
I was sold on the importance of CSV and the idea of packacking data in a standard format since attending CSVconf (co-organized by Rufus) in July 2014, and have written about this several times (Build Roads not Stagecoaches, Reference Lists and Tables of Content), most recently a few weeks ago (Using YAML frontmatter with CSV). In the RDA session I suggested two important improvements for the OKFN data package format:
epub), Microsoft Word documents (
docx), or Google Chrome extensions (
nameattribute to uniquely identify the package. This approach falls short because enforcing the uniqueness of a human readable identifier requires a registry, and this not part of the data package spec. What is needed is a persistent identifier, and DataCite obviously has a lot of experience in this area.
Container-based digital infrastructure is a very hot topic thanks to Docker, and it has become clear that tools and workflows are at least as important as the spec for the container itself. If we want to move to scholarly infrastructure based on containers, used here interchangeably with packages, we need registries for these containers that not only provide globally unique identifiers, but also a central index for finding these containers - the Docker Hub for scholarly containers. DataCite is providing persistent identifiers and standardized metadata for scholarly content with a focus on research data and is therefore in a perfect position to become such a registry and in the RDA session I therefore said:
I want DataCite to become a registry for data containers
While data, in particular in CSV format, are probably the first and most important use case, containers make sense for all scholarly content, and DataCite DOIs are used for text documents, images, software, etc. in addition to datasets. For this reason I prefer the term scholarly container over data container.
As important as the containers themselves are tools and services that work with them, in particular packing and unpacking. The
CSVY format discussed here in a recent blog post could be used by an individual as intermediate step towards a data container. I see tool support as the critical step that decides whether scholarly containers take off as a standard format. Karthik Ram from rOpenSci attended the session in Paris (and CSV.conf last year) and expressed great interest in adding support for scholarly containers in their suite of tools.
This blog post was originally published on the DataCite Blog.
Six Misunderstandings about Scholarly Markdown
In this post I want to talk about some of the misunderstandings I frequently encounter when discussing markdown as a format for authoring scholarly documents.Scholars will always use Microsoft WordMicrosoft Word is of course what almost all authors use ...
First analysis of software metrics
Last week I wrote about software.lagotto.io, an instance of the lagotto open source software collecting metrics for the about 1,400 software repositories included in Sciencetoolbox. In this post I want to report the first results analyzing the data.Number of software repositories (out of 1,404) ...
Metrics and attribution: my thoughts for the panel at the ORCID-Dryad symposium on research attribution
This Thursday I take part in a panel discussion at the Joint ORCID – Dryad Symposium on Research Attribution. Together with Trish Groves (BMJ) and Christine Borgman (UCLA) I will discuss several aspects of attribution. Trish will speak about ethics, ...