Working with the Research Organization Registry (ROR) Data Dump

The commonmeta Go library has seen a major update this week that dramatically simplifies working with the Research Organization Registry (ROR) data dump, including conversion to other serialization formats (e.g. JSON Lines) and metadata formats (InvenioRDM), and integrated affiliation matching.

File Download

ROR metadata are updated regularly (typically about once a month) and made available as a file download via the Zenodo repository under a Creative Commons Zero waiver. The single file is a compressed zip archive with the metadata in JSON and CSV formats, each for v1 and v2 of the ROR schema.

There are two challenges with the ROR data dump file download: while there is a stable DOI for the latest version of the data, that DOI resolves to the dataset landing page, and there is no easy way to automatically get to the file download URL for automatic downloads of new versions.

The other challenge is that the compressed zip archive contains four archived files that can't be downloaded individually. The complete archive is 58.6 MB, whereas the zipped v2 JSON would be 17.7 MB (the uncompressed file is 256.9 MB).

In the commonmeta library, the download URL of the most recent ROR data dump and the file names in that zip archive are hard-coded. commonmeta can automatically fetch the full zip archive and selectively extract the v2 JSON.

When using commands that require ROR metadata, commonmeta looks for a data dump in zipped Avro format (more on Avro below) in the folder where the command is run. If that file isn't found, commonmeta looks for a v2 JSON file from the data dump, and if that file isn't found either, fetches the data dump from Zenodo, extracts the v2 JSON file, and generates the compressed Avro file. For example a local lookup of an organization via its ROR identifier:

commonmeta convert https://ror.org/04jvcky17

Running this command transparently downloads, extracts, and converts the latest ROR data dump and looks up the metadata for https://ror.org/04jvcky17 (Newport Festivals Foundation). This takes only a few seconds (depending on your network connection), and going forward uses ROR data stored locally.

File Formats

Commonmeta can automatically convert the JSON data of all 115K ROR records into other serialization formats, currently JSON Lines, YAML, CSV, and Avro. And optionally compress them as zip archive, for example:

commonmeta list --from ror --file mydata.csv.zip

This command in a few seconds generates a compressed CSV file (almost) identical to the CSV provided in the ROR data dump. Whereas JSON, JSON Lines, and YAML are straightforward to work with, the CSV format has limitations and shows only a (large) subset of the metadata. Avro is another special format; it is focused on efficient storage and transmission over network connections, but is not human-readable. Avro uses a schema in JSON format, which has advantages over the schema-less JSON, YAML, and CSV, including data validation and smaller file sizes. These are the file sizes for the ROR data dump (the JSON file is slightly smaller than the original file because null values were omitted).

Format	Size (MB)	Size ZIP (MB)
JSON	182.3	17.0
JSON Lines	108.1	14.5
YAML	125.5	15.3
CSV	33.5	10.3
Avro	41.6	13.2

Other criteria besides file size are the speed of reading and writing files in this format, readability by humans, and supported data types. CSV is very readable, but is only useful as output format, as data types other than text and numbers are not supported. Avro generates the smallest files, but is more complicated to work with as it requires a schema. YAML is very human-readable, whereas JSON Lines works well over network connections as it is easier to stream than JSON.

commonmeta allows working with all these formats for ROR data (CSV only as output as it doesn''t include all metadata), so you can for example give a JSON Lines file to a colleague and she can use it as commonmeta input:

commonmeta list mydata.jsonl --from ror --file mydata2.csv

Metadata formats and filtering

The ROR schema describes the metadata needed for the affiliation use case in scholarly works. A related metadata schema is used by the InvenioRDM repository platform that the Rogue Scholar blogging platform also uses. Here metadata vocabularies are typically described in YAML and use a subset of the ROR metadata, both for author affiliations and for funders. One small twist is that one metadata field is different: affiliations support acronyms for names, whereas funders support the country code. commonmeta can handle this and also generate a subset of organizations that are of type funder :

commonmeta list --from ror --to inveniordm --file funders.yaml

One challenge in the current version of the InvenioRDM platform (v12.0) is that importing large vocabularies (e.g. all ROR data) is slow and error-prone. In a previous blog post I suggested to only import the affiliations needed, but that workflow is slow and complicated. The upcoming version v13.0 of InvenioRDM has better handling of large vocabulary imports, in the meantime commonmeta supports the generation of smaller vocabularies that can be imported in batches, e.g.

commonmeta list --from ror --to inveniordm --file affilations_ror.yaml -n 10000 --page 1

This command generates a YAML file in a format InvenioRDM understands and containing only 10,000 organizations. By installing commonmeta on the InvenioRDM server and running this command repeatedly and importing the YAML in batches we can overcome the limitations of the v12.0 vocabulary import.

commonmeta currently supports two filters to generate subsets of the ROR data: by organization type (e.g. funder or university) and/or by country. Please reach out if you are interested in other metadata formats for organizations and/or filters.

Queries

commonmeta supports simple queries of the local ROR data by ROR ID or external ID (Crossref Funder ID, GRID, ISNI, Wikidata):

commonmeta convert Q7713086

More complex queries are currently not possible with local data, but commonmeta integrates with the ROR API to support affiliation matching.

commonmeta match --from ror "The Alfred Hospital"

This will return a single ROR record if a match with a score of 0.9 or higher is found by the ROR API. The affiliation matching can be combined with looking up metadata from Crossref, DataCite, or InvenioRDM, a core functionality of commonmeta. If affiliation names but no ROR ids are provided, commonmeta can automatically merge the information found by affiliation matching. To indicate that the metadata was found by ROR and not the publisher, commonmeta adds an assertedBy field with the value ror to the response. Affiliation identifiers provided by the publisher will have a publisher value in the response. The following query returns a random sample of 50 publications from Crossref member 31795 (Front Matter) with affiliation matching applied and the results stored in commonmeta schema format.

commonmeta list --from crossref --member 31795 --sample=true -n 50 --match=true --file matching.json

Conclusions

The work on using Go to read, format, and integrate ROR metadata was inspired by a session at the recent InvenioRDM Partner meeting in Hamburg. InvenioRDM is written in Python and Javascript/React, but Go is a great alternative for simple installations (commonmeta is a single 5 MB binary) and performance-critical functions (e.g. converting the ROR data dump into the InvenioRDM YAML format). Rust is another language that people use in these situations, and more work is needed to compare the relative strengths and weaknesses.

More work is also needed to decide on the best file format for storing scholarly metadata at scale. Avro looks promising, but needs to be compared with JSON in more detail. Another interesting newer format is Parquet. It is a column-oriented file format in contrast to the formats described here, which are row-based. This makes some things harder but other things much easier, and this becomes more critical as the number of metadata records grows from 100K (ROR) to the millions (DataCite, Crossref).

Finally, this work demonstrates that a good proportion of metadata work can be done locally, working with data dumps rather than high frequencies of API calls.

Update (April 24): I refactored the commonmeta Go library (version v0.21.0 and later) to a) have all ROR metadata embedded (no need to download them and also faster, increases the size of the binary by 15 MB), and b) using a different data structure (map instead of slice, or dict instead of array in Python language). This makes lookup by identifier (one of the important use cases) faster, e.g. when fetching affiliation information of scholarly works in bulk.

References

Research Organization Registry. (2025). ROR Data (Version v1.63) [Dataset]. Zenodo. https://doi.org/10.5281/ZENODO.6347574

Martin Fenner. (2025). front-matter/commonmeta: V0.19.4 (Version v0.19.4) [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.15256488

Fenner, M. (2025, April 7). Where I simplified ROR affiliation metadata handling. Front Matter. https://doi.org/10.53731/ymbv8-7jm78

Fenner, M. (2025, March 19). Rogue Scholar meets the InvenioRDM community. Front Matter. https://doi.org/10.53731/1aw0b-pr243

Working with the Research Organization Registry (ROR) Data Dump

File Download

File Formats

Metadata formats and filtering

Queries

Conclusions

References

Rogue Scholar adds full-text content to all blog post web pages

DOI registration workflow for a science blog (version 2)

Join to our community 👋

File Download

File Formats

Metadata formats and filtering

Queries

Conclusions

References

Share Article:

Rogue Scholar adds full-text content to all blog post web pages

DOI registration workflow for a science blog (version 2)

More in this Category Open Infrastructure

Commonmeta understands OpenAlex

Working with the Research Organization Registry (ROR) Data Dump

Where I simplified ROR affiliation metadata handling

Does it compose?

Join to our community 👋