Today we are announcing our first new functionality of 2019, a much improved search for DataCite DOIs and metadata. While the DataCite Search user interface has not changed, changes under the hood bring many important improvements and are our biggest changes to search since 2012.
Newly registered (and tagged findable) DOIs now appear in the DataCite Search index within a few minutes, compared with the previous up to 12 hour lag. The same is true for metadata updates or DOIs removed from the public search index (by changing the DOI state from findable to registered). Faster indexing is particularly important when related content is published at the same time, e.g. a dataset with a DataCite DOI associated with a journal article with a Crossref DOI.
This faster indexing makes it possible for members and clients to use the Search index also in DOI Fabrica, enabling the same advanced search functionality available in DataCite Search, but also including DOIs in draft or registered state. Our Solr search index could not be used in DOI Fabrica, as users would not see newly created or updated DOIs because of the indexing delay. This makes it much easier to manage DOIs and associated metadata, e.g. by filtering for DOIs in draft state or finding DOIs using the retired metadata schemata 2.1 and 2.2. And it is the first time that we provide DOI registration and search in a single user interface; this kind of simplification is one of our themes for 2019 [Dasler (2018)].
Our new search index covers all metadata and allows specific searches of every metadata field. For example geoLocationPlace:
The supported search syntax is very similar to what was available before, and uses the Elasticsearch Query String Syntax. You can for example specify field names, use wildcards, regular expressions, ranges, and boolean operators, e.g. use
creators.affiliation:stanford +creators.affiliation:ucsf to find the 174 DOIs with collaborators from both of these two institutions.
One important limitation of our previous search index, and a common issue with many search implementations, was the deep paging problem, making it hard if not impossible to fetch a very large number of results. Our new search index supports cursor-based pagination that overcomes this problem, allowing users to, for example, harvest all DOI metadata from a particular member. This is done in the REST API, specifying a larger number of records per page – e.g.
https://api.datacite.org/providers/caltech/dois?page[size]=1000 – and the using the URL provided via
links.next in the API response for the next query.
The above changes were made possible by updating our search index service from an old version of Solr (4.0) to a recent version of Elasticsearch (6.3). We switched to Elasticsearch, as it works better with our new JSON-based workflow – see our December blog post about JSON [Fenner (2018)] – and we can use a hosted service tightly integrated with the rest of our infrastructure thereby reducing the support effort needed.
Not all DataCite services have been switched to the new search index, the Stats Portal and OAI-PMH service will be migrated within the next three months and continue to use the old Solr search index for now.
In the coming weeks and months we will also provide better documentation, and improve performance and fix any bugs we encounter. We will also work with our members to better understand what kind of queries they are most interested in, and how we can better support these queries in the search interface.
This blog post was originally published on the DataCite Blog.
Dasler, R. (2018). DataCite 2018 wrap-up and 2019 preview. https://doi.org/10.5438/BCKB-QY95
Fenner, M. (2018). Introducing datacite json. https://doi.org/10.5438/1PCA-1Y05
2020 Strategic Priorities for Services and Infrastructure
In a blog post four weeks ago DataCite Executive Director Matt Buys talked about the DataCite strategic priorities for 2020 (Buys, 2020). In this post we want to talk a bit more about the strategic priorities for this year we have regarding services and infrastructure work: a)...
Using Schema.org for DOI Registration
Three weeks ago we started assigning DOIs to every post on this blog (Fenner, 2016c). The process we implemented uses a new command line utility and integrates well with our the publishing workflow, with (almost)...
Explaining the DataCite/ORCID Auto-update
This Monday ORCID, CrossRef and DataCite announced (ORCID post, CrossRef post, DataCite post) the new auto-update service that automatically pushes metadata to ORCID when an ORCID identifier is found in newly registered DOI names.This is the first joint announcement by the three organizations,...