One of my personal highlights in last week's Research Data Alliance (RDA) 6th Plenary Meeting in Paris was the Data Packages Birds of a Feather (BoF), organized by Rufus Pollock from the Open Knowledge Foundation (OKFN). He highlighted the urgent need for packacking data in a standard format to facilitate reuse, and described the extensive work the OKFN has done on data packages. A particular focus is on packacking CSV files, the most widely used format for exchanging data.
I was sold on the importance of CSV and the idea of packacking data in a standard format since attending CSVconf (co-organized by Rufus) in July 2014, and have written about this several times (Build Roads not Stagecoaches, Reference Lists and Tables of Content), most recently a few weeks ago (Using YAML frontmatter with CSV). In the RDA session I suggested two important improvements for the OKFN data package format:
epub), Microsoft Word documents (
docx), or Google Chrome extensions (
nameattribute to uniquely identify the package. This approach falls short because enforcing the uniqueness of a human readable identifier requires a registry, and this not part of the data package spec. What is needed is a persistent identifier, and DataCite obviously has a lot of experience in this area.
Container-based digital infrastructure is a very hot topic thanks to Docker, and it has become clear that tools and workflows are at least as important as the spec for the container itself. If we want to move to scholarly infrastructure based on containers, used here interchangeably with packages, we need registries for these containers that not only provide globally unique identifiers, but also a central index for finding these containers - the Docker Hub for scholarly containers. DataCite is providing persistent identifiers and standardized metadata for scholarly content with a focus on research data and is therefore in a perfect position to become such a registry and in the RDA session I therefore said:
I want DataCite to become a registry for data containers
While data, in particular in CSV format, are probably the first and most important use case, containers make sense for all scholarly content, and DataCite DOIs are used for text documents, images, software, etc. in addition to datasets. For this reason I prefer the term scholarly container over data container.
As important as the containers themselves are tools and services that work with them, in particular packing and unpacking. The
CSVY format discussed here in a recent blog post could be used by an individual as intermediate step towards a data container. I see tool support as the critical step that decides whether scholarly containers take off as a standard format. Karthik Ram from rOpenSci attended the session in Paris (and CSV.conf last year) and expressed great interest in adding support for scholarly containers in their suite of tools.
This blog post was originally published on the DataCite Blog.
Why should we work where we live?
Photo by Danyu Wang / UnsplashAt the SciFoo Camp this weekend Erin McKiernan and I moderated an unconference session on the topic Why should we work where we live? This was a spontaneous idea after we had talked about this topic on Friday (Erin lives in Mexico with a job in Canada, ...
Data Citation Support in Reference Managers
This is the title of an upcoming workshop next Sunday organized by Ian Mulvany and myself. The workshop is a pre-conference event of the Force15 conference in Oxford. This blog post summarizes some of the issues and work that needs to be done.Data ...
Speaker Deck for Sharing Presentations
It has become common practice to make presentation slides available for those unable to attend in person, or for more in-depth review later. The most popular service to do this is of course Slideshare. Slideshare is a fine service, ...