Data citation is core to DataCite's mission and DataCite is involved in several projects that try to facilitate data citation, including THOR, Data Citation Implementation Pilot (DCIP), Research Data Alliance (RDA), and COPDESS. The biggest roadblock for wider data citation adoption might be insufficient incentives for individual researchers, but another major challenge is that implementing data citation is still too complicated.
When we talk about data citation, we typically mean two related, but different scenarios:
The first scenario is not conceptually different from an article citing another article, where the common practice is to put everything that is cited into the reference list.
The second scenario is probably not only more common, but also requires more complex workflows, e.g. coordination of issuing persistent identifiers for article and data and linking them together via metadata. And we as a community are still working on common practices for doing this. Assuming again that incentives are the biggest driver of change, I would argue that researchers, publishers, and funders are all interested in making this work, but that data repositories have the strongest motivation to improve the current situation. If this is true then we should give data repositories a bigger role in the publication of data associated with an article.
While many publishers host supplementary information for articles, they leave the hosting of more complex research data to external data repositories specialized in this task. Properly referencing all associated data in the article is currently the job of the publisher, and I propose that we give more of this responsibility to the data repository. The data repository can create a data catalog card (with associated persistent identifier and metadata) that describes all data associated with an article. The data catalog card is a collection of metadata, and different from a data paper. The data described in the catalog card can be hosted in that repository or elsewhere.
The publisher then links to this data catalog card via the article metadata and can display the catalog card formatted as a data availability statement. The publisher could (and should) still link to individual data where appropriate, but the proposed solution helps solve several important issues:
Several general purpose data repositories already provide most or all of this functionality, I am most familiar with Dryad, BioStudies (McEntyre, Sarkans, & Brazma, 2015) and Figshare (Hyndman, 2016). Data catalog cards probably work best for repositories that a flexible in the kinds of data they take, and repositories that already have integrations with publishers. Not every data repository needs to support this functionality. Data catalog cards are also an opportunity for differentiation, e.g. by providing data curation, help with data review, etc.
My thinking about this topic was triggered by a conversation with Tim Clark in the context of the DCIP project. The guest post by Dan S. Katz (Katz, 2016) and the discussion around it was another important motivation, and a DataCite blog post from last August (Fenner, 2015) contains some of the ideas expressed here. Obviously this topic is of great interest to DataCite, as we hope that data catalog cards use DataCite DOIs, and that we can help both with making article/data publishing workflows easier, and with discovering data associated with an article.
This blog post was originally published on the DataCite Blog.
Fenner, M. (2015). Reference lists and tables of content. DataCite Blog. Retrieved from https://blog.datacite.org/reference-lists-and-tables-of-content
Hyndman, A. (2016). Unveiling figshare ’collections’ - a new way to group content. Figshare Blog. Retrieved from https://figshare.com/blog/Unveiling_figshare_Collections_a_new_way_to_group_content/202
Katz, D. S. (2016). To better understand research communication, we need a groid (group object identifier). DataCite Blog. Retrieved from https://blog.datacite.org/to-better-understand-research-communication-we-need-a-groid-group-object-identifier
McEntyre, J., Sarkans, U., & Brazma, A. (2015). The BioStudies database. Molecular Systems Biology, 11(12), 847–847. https://doi.org/10.15252/msb.20156658
Infrastructure Tips for the Non-Profit Startup
When I started as DataCite Technical Director four months ago, my first post (Fenner, 2015) on this blog was about what I called Data-Driven Development. The post included a lot of ideas on how to approach development and technical infrastructure. ...
In 1998 Tim Berners-Lee coined the term cool URIs (1998), that is URIs that don’t change. We know that URLs referenced in the scholarly literature are often not cool, leading to link rot (Klein et al., 2014) and making it hard or impossible to find the referenced resource.Cool URIs are, ...
Using YAML Frontmatter with CSV
CSV (comma-separated values) is a popular file format for data. It is popular because it is very simple: CSV is text-based and any application that can open text files can read or write CSV. This makes it a good fit for digital preservation. ...