DataCite is a DOI registration agency that enables the registration of scholarly content with a persistent identifier (DOI) and metadata. This content can then be searched for, reused, and connected to other scholarly resources. But how does the underlying infrastructure enable this? In this blog post, we will describe what we have built to make this work. This is a fairly technical post, as I tried to go a little deeper into the details.
DataCite is a small nonprofit organization (currently 12 team members, including three in the development team), and the team is fully remote. All our infrastructure is running in the Cloud (most of it using Amazon Web Services (AWS), with the servers that store data located in Ireland). We have automated the operation of our services as much as possible, following DevOps best practices. Because of the small team size, we have no separation of software development and system administration teams, and DevOps allows us to highly integrate these roles. Two important automation tools we use are GitHub Actions for Continues Integration/Continues Deployment (CI/CD) and Terraform for managing “Infrastructure as Code”. While we are using Terraform since 2016, we have only recently migrated our CI/CD workflows from Travis CI (finishing the migration in the next few months), mainly because GitHub Actions come with many ready-to-use actions for some of the more complex parts of our deployment pipeline. All DataCite software is available with an open license in a public GitHub repository, and that also includes our GitHub actions and Terraform configurations. You can for example find the code for our REST API here, and the corresponding GitHub actions here.
The DataCite backend uses services that store data (files or databases), and we use managed AWS services for those, e.g. RDS to manage our MySQL relational databases. Our APIs are all running as stateless Docker containers, and we use the Amazon Elastic Container Service (ECS), in combination with Amazon Application Load Balancers to manage those. The adoption of Docker containers was the biggest change in our infrastructure 2016-2019, and we have developed a lot of expertise in this area. Going forward we will switch to Kubernetes (AWS Kubernetes Service) at some point, as it has become the de-facto standard for container management in the cloud and provides additional functionalities in a widely-used open source platform. In 2015 all backend services were written in Java, over the last six years – as we upgraded our services one after another – this has changed to backend services written in Ruby and Python. This might again change going forward, we have for example started to use an open source software for collecting usage stats that is written in Elixir. While we have to be careful as a small development team to not spread our expertise too wide, we need to be open to new technologies, and a grant-funded project that can be based on existing open source software
The following picture puts everything I talked about together into a single view (obviously omitting a lot of detail):
Please feel free to reach out to me if you have any questions about the DataCite technology stack. If you are now interested in working for the DataCite development team, you can find more information about an open position here.
This blog post was originally published on the DataCite Blog.
The DataCite MDC Stack
In May, the Make Data Count team announced that we have received additional funding from the Alfred P. Sloan Foundation for work on the Make Data Count (MDC) initiative. This will enable DataCite to do additional work in two important areas:Implement ...
The DataCite GraphQL API is now open for (pre-release) business
DataCite DOIs describe resources such as datasets, samples, software and publications with rich metadata. An important part of this metadata is the description of connections between resources that use persistent identifiers (PIDs) ...
Powering the PID Graph: announcing the DataCite GraphQL API
Today DataCite launches a new API that powers the PID Graph, the graph formed by scholarly resources described by persistent identifiers (PIDs) and the connections between them. The API is powered by GraphQL, a widely adopted Open Source technology that enables queries of this graph, ...