In May, the Make Data Count team announced that we have received additional funding from the Alfred P. Sloan Foundation for work on the Make Data Count (MDC) initiative. This will enable DataCite to do additional work in two important areas:
In this blog post, we want to provide more technical details about the upcoming work on the bibliometrics dashboard; the log processing service will be the topic of a future blog post. The bibliometrics dashboard will be based on several important infrastructure pieces that DataCite has built over the past few years, and that are again briefly described below.
In the MDC initiative we track data citations in the scholarly literature, focussing on datasets registered with DataCite and publications registered with Crossref.
We use the joint Crossref/DataCite Event Data service to exchange information about connections between publications and datasets, contributed via Crossref and DataCite members through the metadata they register. These connections are also made available via a Scholix-compliant REST API. In the previous MDC project, the Event Data service was expanded to include data usage stats and make retrieving information easier for DataCite members.
DataCite members and repositories upload monthly reports about data usage to DataCite using a standard format (COUNTER Code of Practice for Research Data Usage Metrics and protocol (SUSHI). COUNTER Code of Practice for Research Data Usage Metrics and the DataCite usage reports API were developed in the previous MDC project.
The DataCite GraphQL API [Fenner (2020)] built in the EC-funded FREYA project brings together all of the above information in a single API that supports the complex queries typically needed for retrieving aggregated data citation information.
We use Jupyter notebooks to analyze and visualize the information made available in the GraphQL API [Fenner (2019)], and have developed documentation, demos, and training material with our partners in the FREYA project.
Common DOI Search, a service currently under development by DataCite with help from Crossref and others in the FREYA project, and with a first version planned to be released in August, will bring a single search interface for all scholarly DOIs, no matter from which DOI registration agency (DataCite, Crossref, etc.). All DOIs in Common DOI Search are in a single Elasticsearch search cluster using the same DataCite metadata schema.
Also as part of the PARSEC project, we have built the Researcher Profile that, using the researcher's ORCID ID, brings all academic outputs and their metrics for a given researcher into a single dashboard. This work serves as a blueprint for other aggregations (e.g. by research organization) in the bibliometrics dashboard.
All the services described above are required building blocks for the bibliometrics dashboard we will start working on in August. What the dashboard will add is better insights into the data citation data we have collected, primarily helping the bibliometricians in the project, but also available to other users. We will use Jupyter notebooks for exploratory analyses and to address very specific research questions, and data visualizations in the bibliometrics dashboard that address the most common questions, such as the growth of data citations over time.
We have picked the popular Vega library for our data visualizations. Vega is not only widely used and very flexible, but also available in versions for Jupyter notebooks (Altair) and React (React-Vega).
DataCite members and the repositories they work with contribute to the bibliometrics dashboard in important ways, registering content with a DOI and standard metadata facilitating citation, inclusion of references in the metadata, and submission of data repository usage stats. The bibliometrics dashboard will increase our understanding of data citation and data usage stats through the bibliometrics work, but will also provide aggregations of information of interest to our members – for example data citations and data usage over time, by discipline or by repository – not available before. This information is displayed in the bibliometrics dashboard, and available via Jupyter notebooks and the GraphQL API.
This blog post was originally published on the DataCite Blog.
Fenner, M. (2019). Using jupyter notebooks with graphql and the pid graph. https://doi.org/10.5438/HWAW-XE52
Fenner, M. (2020). Powering the pid graph: Announcing the datacite graphql api. https://doi.org/10.5438/YFCK-MV39
Publishing tabular data as blog post
CSV in many ways is for data what Markdown is for text documents: a very simple format that is both human- and machine-readable, and that – despite a number of shortcomings - is widely used. Given the popularity of Markdown for writing blog posts, ...
2020 Strategic Priorities for Services and Infrastructure
In a blog post four weeks ago DataCite Executive Director Matt Buys talked about the DataCite strategic priorities for 2020 (Buys, 2020). In this post we want to talk a bit more about the strategic priorities for this year we have regarding services and infrastructure work: a) ...