Last week I wrote about software.lagotto.io, an instance of the lagotto open source software collecting metrics for the about 1,400 software repositories included in Sciencetoolbox. In this post I want to report the first results analyzing the data.
If you want to follow along, please go to https://github.com/mfenner/software-analysis, this repository holds all the data, as well as the R code used for analysis. A special thanks goes to Scott Chamberlain who greatly helped me by tweaking the alm R package to support URLs instead of DOIs as identifiers.
The first step in the analysis is to get an overview of the external sources citing or discussing the software package:
This is basically the same figure as in the previous post, but with two differences: I have added a Nature.com OpenSearch data source, and I have found an additional 64 repositories cited in scholarly articles via an Europe PMC full-text Search that also includes the reference lists (thanks to Jo McEntyre).
I am not sure why we are not picking up any Wikipedia citations, and have to take a closer look. The ORCID source also needs tweaking, and there are some issues with the Wordpress.com data that I have to look into as well. Citations in the scholarly literature are obviously the most interesting data, and we have three Github repos with more than 25 citations, including https://github.com/najoshi/sickle with 54 citations. As most repositories in our sample are cited only once if at all, a correlation with Github stars and forks is not useful. Sickle is popular on Github (52 stars and 32 forks), but it is not clear that this activity is correlated to citations (e.g. because there are more citations than stars).
The vast majority of software repos in this analysis are hosted by Github, so we have the numbers of stars and forks for those. It is interesting, although probably not very surprising, that the number of Github stargazers and forks is highly correlated:
We can find Facebook activity (likes, comments or shares) for one third of the repositories. There is a reasonably good correlation between Facebook activity and number of Github forks:
One interesting analysis would be to look at the repositories that have been forked much more often relative to their Facebook activity, e.g. Impala with 1,207 Github stars and 458 forks, but only 5 Facebook shares. One limitation of the analysis is that we are not tracking Facebook (or other social media) activity for all forks of a repo.
We found Reddit discussions mentioning one of the repositories in 7% of cases. Once we have a larger sample size it would be interesting to correlate this activity with Github stars and forks, similar to what we did for Facebook. By far the most popular repository from our sample on Reddit is Bitcoin, followed by JQuery. Twitter activity is notoriously difficult to collect since Twitter doesn’t keep tweets very long, hence probably the low numbers compared to Facebook and Reddit.
Feel free to play with the data and scripts provided at https://github.com/mfenner/software-analysis, my next step is probably to include a much larger number of software repositories.
It has not escaped our notice that the kind of analysis described above could be applied to any software repository, not just scientific software.
Nine simple ways to make it easier to (re)use your data
This paper in markdown format was written by Ethan White et al. The markdown file and the associated bibliogaphy and figure files are available from the Github repository of the paper.I used this version, an earlier version was published as PeerJ Preprint....
Random notes from the altmetrics12 conference
Last week I attended the altmetrics12 workshop in Chicago. You can read all 11 abstracts here, and the conference had good Twitter coverage (using the hashtag #altmetrics12), at least until Twitter had a total blackout around 12 PM our time.All but two...
Introducing the PID Graph
Persistent identifiers (PIDs) are not only important to uniquely identify a publication, dataset, or person, but the metadata for these persistent identifiers can provide unambiguous linking between persistent identifiers of the same type, e.g....