This week I start as the new DataCite Technical Director. While I get up to speed with existing DataCite services and infrastructure, and we start to launch new services (e.g. this blog), this is also a good time to communicate the overall approach I am taking. I like to call it Data-Driven Development, or DDD as we all love acronyms.
Data-Driven Development and related terms are in use in several contexts, in particular economics, and programming. The term sounds similar to test-driven development and behavior-driven development, two related software development processes. Business intelligence and data science are of course closely related. My definition is as follows:
We develop and maintain our services based on data.
This shouldn't come as a surprise as DataCite's mission is Helping you to find, access and reuse data. And my last job at the Open Access publisher PLOS was all about collecting and presenting data about the reuse of scholarly articles (citations, downloads, social media mentions, etc.). But here I mean data in a much broader sense.
While the overall strategic direction is determined by the Board together with the DataCite working groups and members, we can collect data that help with decisions in product development, for example
Compared with the next two sections, tools for data-driven product development are less commonplace (unless I missed them, in which case please provide feedback).
The data generated during software development are increasingly made available through automated tools. We can
Any web-based service can and should be monitored for
We don't want to stop at collecting all these data, we also need a strategy for providing them to the DataCite Board, DataCite working groups, DataCite members and data centers, DataCite staff, and everyone else who cares about these data. The default should be open, exceptions are mostly data that would raise privacy or security concerns, e.g. IP addresses in usage stats. Most of the services mentioned in this post are open for everyone to look at.
Good data-driven development should not only collect lots of data and make them available, but we also need to aggregate the information in meaningful ways. Service monitoring is a good example where staff needs to understand exactly what is going on, but the typical DataCite user only cares about whether all services are running as expected. A status dashboard would be a good solution here.
The data we are generating also need to be put into the broader context. We need
Of course I am aware that this is an ambitious agenda, in particular since DataCite is a small non-profit that has limited staff and financial resources. But I don't think that data-drive development should be left to for-profit organizations and/or to organizations of a certain size. There are several things DataCite can do:
This blog post was originally published on the DataCite Blog.
Unknown. (1931). Hannover, blick auf hannover. ETH-Bibliothek Zürich, Bildarchiv. https://doi.org/10.3932/ETHZ-A-000159123
It's all about Relations
In a guest post two weeks ago Elizabeth Hull explained that only 6% of Dryad datasets associated with a journal article are found in the reference list of that article, data she also presented at the IDCC conference in February (Mayo, Hull, & Vision, ...
The DataCite MDC Stack
In May, the Make Data Count team announced that we have received additional funding from the Alfred P. Sloan Foundation for work on the Make Data Count (MDC) initiative. This will enable DataCite to do additional work in two important areas:Implement ...
Tracking the Growth of the PID Graph
The connections between scholarly resources generated by persistent identifiers (PIDs) and associated metadata form a graph: the PID Graph [Fenner & Aryani (2019)]. We developed this PID Graph concept in the EC-funded FREYA project, ...