Data catalog cards: simplifying article/data linking

Data citation is core to DataCite's mission and DataCite is involved in several projects that try to facilitate data citation, including THOR, Data Citation Implementation Pilot (DCIP), Research Data Alliance (RDA), and COPDESS. The biggest roadblock for wider data citation adoption might be insufficient incentives for individual researchers, but another major challenge is that implementing data citation is still too complicated.

Citation needed. By User:Tfinc (Own work) CC BY-SA 3.0, via Wikimedia Commons

When we talk about data citation, we typically mean two related, but different scenarios:

  1. an article or other scholarly work cites an already published dataset.
  2. all data and related metadata underlying the findings reported in a submitted manuscript should be deposited in an appropriate public repository (PLOS data availability statement)

The first scenario is not conceptually different from an article citing another article, where the common practice is to put everything that is cited into the reference list.

The second scenario is probably not only more common, but also requires more complex workflows, e.g. coordination of issuing persistent identifiers for article and data and linking them together via metadata. And we as a community are still working on common practices for doing this. Assuming again that incentives are the biggest driver of change, I would argue that researchers, publishers, and funders are all interested in making this work, but that data repositories have the strongest motivation to improve the current situation. If this is true then we should give data repositories a bigger role in the publication of data associated with an article.

While many publishers host supplementary information for articles, they leave the hosting of more complex research data to external data repositories specialized in this task. Properly referencing all associated data in the article is currently the job of the publisher, and I propose that we give more of this responsibility to the data repository. The data repository can create a data catalog card (with associated persistent identifier and metadata) that describes all data associated with an article. The data catalog card is a collection of metadata, and different from a data paper. The data described in the catalog card can be hosted in that repository or elsewhere.

The publisher then links to this data catalog card via the article metadata and can display the catalog card formatted as a data availability statement. The publisher could (and should) still link to individual data where appropriate, but the proposed solution helps solve several important issues:

  • the data catalog card simplifies manuscript submission for publishers
  • the data record provides a machine-readable representation of the data availability statement that publishers are increasingly requiring
  • the publisher doesn't need to provide machine-readable metadata for all data used in an article, but can reference the data catalog card. Accession numbers that are not globally unique can be used in the article if they are properly referenced in the data catalog card. This facilitates the transition from current practices
  • some articles refer to thousands of datasets (e.g. genomics papers), and this number of links is difficult to describe in the traditional article format (e.g. JATS)

Several general purpose data repositories already provide most or all of this functionality, I am most familiar with Dryad, BioStudies (McEntyre, Sarkans, & Brazma, 2015) and Figshare (Hyndman, 2016). Data catalog cards probably work best for repositories that a flexible in the kinds of data they take, and repositories that already have integrations with publishers. Not every data repository needs to support this functionality. Data catalog cards are also an opportunity for differentiation, e.g. by providing data curation, help with data review, etc.

My thinking about this topic was triggered by a conversation with Tim Clark in the context of the DCIP project. The guest post by Dan S. Katz (Katz, 2016) and the discussion around it was another important motivation, and a DataCite blog post from last August (Fenner, 2015) contains some of the ideas expressed here. Obviously this topic is of great interest to DataCite, as we hope that data catalog cards use DataCite DOIs, and that we can help both with making article/data publishing workflows easier, and with discovering data associated with an article.

Acknowledgments

This blog post was originally published on the DataCite Blog. 

References

Fenner M. Reference Lists and Tables of Content. Published online August 15, 2015. doi:10.53731/r795v41-97aq74v-ag4cd

Unveiling figshare “Collections” - a new way to group content. Accessed July 2, 2023. https://figshare.com/blog/Unveiling_figshare_Collections_a_new_way_to_group_content/202

Katz DS. To better understand research communication, we need a GROUPID (group object identifier). Published online April 17, 2016. doi:10.5438/SHR4-2BS2

McEntyre J, Sarkans U, Brazma A. The BioStudies database. Molecular Systems Biology. 2015;11(12):847. doi:10.15252/msb.20156658