Persistent identifiers (PIDs) are not only important to uniquely identify a publication, dataset, or person, but the metadata for these persistent identifiers can provide unambiguous linking between persistent identifiers of the same type, e.g. journal articles citing other journal articles, or of different types, e.g. linking a researcher and the datasets they produced.
Work is needed to connect existing persistent identifiers to each other in standardized ways, e.g. to the outputs associated with a particular researcher, repository, institution or funder, for discovery and impact assessment. Some of the more complex but still important use cases can’t be addressed by simply collecting and aggregating links between two persistent identifiers, including
To address these use cases we need a more complex model to describe the resources that are identified by PIDs, and the connections between them: a graph. In graph theory, the resources identified by PIDs correspond to the nodes in this graph, and the connections between PIDs correspond to the edges.
Fig 1. A schematic representation of the PID graph with digital objects connected by PIDs, showing three use cases: A: Different versions of software code, B: Datasets hosted by a particular repository, C: All digital objects connected to a research object.
Using a graph makes it easier to describe these more complex use cases and relationships, and this approach has been frequently applied to similar questions in the past. FREYA builds on the expertise and close collaboration with the Research Graph team and adopts the outputs of the Research Data Alliance DDRI Working group to transform PID connections into an improved graph of research objects. This project takes advantage of the best practices of graph modelling and distributed network analysis techniques. We call this the PID Graph.
Before starting work on implementing the PID Graph, the FREYA partners collected user stories from their communities relevant to the PID Graph work. We used GitHub issues in a public repository for this activity and then met in person in August 2018 to clarify, group and prioritize these user stories. In total, we identified 48 user stories, described here. The main outcomes of the August 2018 workshop were:
After identifying and describing the most relevant use cases, summarized above, we started the implementation work for the FREYA PID Graph. Our goal was to implement the PID Graph as standard production service rather than a research activity or pilot service, so scalability and maintainability are of utmost importance. We learned a lot from the extensive experience gained in the Research Graph initiative and decided to build PID Graph using a set of federated RESTful JSON APIs. PID Graph will not be a single service but federated between FREYA PID providers, FREYA disciplinary partners, and organizations outside of FREYA. PID Graph will be provided by RESTful JSON APIs that describe the resources (nodes) and connections (edges) in this graph. All FREYA PID providers use RESTful JSON APIs to provide PID metadata so that this approach aligns with the extensive existing infrastructure.
The first working PID Graph implementation is provided by DataCite, extending the existing Event Data Service (Dasler & Cousijn, 2018), a collaboration between Crossref and DataCite. Event Data is a service that provides connections (here called events) between PIDs and other resources, with an initial focus on social media mentions and data citations. The initial PID Graph work done by DataCite since the August 2018 workshop has added these functionalities to DataCite Event Data:
Include not only metadata about connections but also metadata about the resources identified by PIDs. This dramatically simplifies the API calls needed to construct a PID Graph. We do this by optionally including the metadata for the subj and obj (the resources linked via the event) in Event Data via an extra query parameter: https://api.datacite.org/events?include=subj,obj
Including the metadata for subj and obj also enables queries based on resource metadata, e.g. query by type of content that is connected: https://api.datacite.org/events?include=subj,obj&citationType=ScholarlyArticle-SoftwareSourceCode
This query today returns 1,078 events connecting scholarly articles and software, including 834 from journal articles referencing software via Crossref metadata and 210 from software referencing journal articles via DataCite metadata.
Metadata for resources contain a lot of information about connected PIDs. We can take advantage of this by including the information in DataCite Event Data, allowing queries that in effect connect two PIDs via an intermediary resource and two connections. Specifically, we include these relations and associated PIDs:
These connected PIDs can then act as a proxy in PID Graph queries, as demonstrated in this example:
The query today returns one data citation of the dataset identified by the DOI, and eight data files that are part of this dataset. If someone decides to cite one of these data files instead of the dataset (following principle 8 Specificity and Verifiability of the Joint Declaration of Data Citation Principles (Data Citation Synthesis Group, 2014)), that data citation would also be included in the DataCite Event Data response.
Similarly, the citation of a specific version of a dataset would be included if querying for the parent version of the dataset. Examples for funding and authorship are given in the next paragraph.
The initial focus in Event Data was on social media mentions and data citations. DataCite has added author-resource links and funder-resource links, using ORCID and Crossref Funder ID as PIDs, respectively. DataCite also include dataset usage statistics, as part of the work in the Make Data Count (Lowenberg, Budden, & Cruse, 2018) project. This enables the following two use cases:
The aim is for any interested parties within and beyond FREYA to implement PID Graph services, meaning that we have to figure out how best to coordinate and enable this federated PID Graph. And of course, there are initiatives outside of FREYA taking similar approaches and addressing similar use cases. These include:
To coordinate these activities we have organized a Birds of a Feather session at the RDA Plenary in Philadelphia next week (Wednesday at 2:30 PM): Research Data Graph.
The initial implementation of the PID Graph in DataCite Event Data contains 5.38 million events as of today and more work is needed to convert existing events to the new format (we expect a total of 25 million events with the current data source), improve documentation, and build visualizations and other frontend services to make it easier to show the PID Graph information we already have. But if you can’t wait and are not afraid working with JSON REST APIs, feel free to explore DataCite Event Data, which is a free service with no registration required, by starting with the documentation.
And please reach out to us via the PID Forum if you are interested to learn more about PID Graph, want to see your data in PID Graph, or are working on a related project and want to coordinate. And of course, join us for the RDA Plenary session next week in Philadelphia if you plan to attend the conference.
Dasler, R., & Cousijn, H. (2018). Are your data being used? Event data has the answer! DataCite Blog. https://doi.org/10.5438/S6D3-K860
Data Citation Synthesis Group. (2014). Joint declaration of data citation principles. Force11. https://doi.org/10.25490/a97f-egyk
Ioannidis, A., & Gonzalez Lopez, J. B. (2019). Asclepias: Flower power for software citation. https://doi.org/10.5281/ZENODO.2548643
Lowenberg, D., Budden, A., & Cruse, P. (2018). It’s time to make your data count! https://doi.org/10.5438/PRE3-2F25
Manghi, P., & Bardi, A. (2019). The openaire research graph - opportunities and challenges for science. https://doi.org/10.5281/ZENODO.2600275
Using YAML Frontmatter with CSV
CSV (comma-separated values) is a popular file format for data. It is popular because it is very simple: CSV is text-based and any application that can open text files can read or write CSV. This makes it a good fit for digital preservation. ...
DOI Registrations for Software
We know that software is important in research, and some of us in the scholarly communications community, for example, in FORCE11, have been pushing the concept of software citation as a method to allow software developers and maintainers to get academic ...
Using Jupyter Notebooks with GraphQL and the PID Graph
Two weeks ago DataCite announced the pre-release version of a GraphQL API [Fenner (2019)]. GraphQL simplifies complex queries that for example want to retrieve information about the authors, funding and data citations for a dataset with a DataCite DOI. ...