https://github.com/mpadge/ros-pkg-authors
Analysis of authorial roles of rOpenSci packages
https://github.com/mpadge/ros-pkg-authors
Last synced: about 2 months ago
JSON representation
Analysis of authorial roles of rOpenSci packages
- Host: GitHub
- URL: https://github.com/mpadge/ros-pkg-authors
- Owner: mpadge
- Created: 2019-09-26T11:20:44.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-05-21T19:24:05.000Z (about 5 years ago)
- Last Synced: 2025-02-14T13:23:25.514Z (3 months ago)
- Language: R
- Size: 966 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.Rmd
- Authors: authors-per-time-interval-1.png
Awesome Lists containing this project
README
---
title: "Authorial contributions of rOpenSci packages"
author: "Mark Padgham"
date: "`r Sys.Date()`"
output:
html_document:
toc: false
toc_float: false
number_sections: true
theme: flatly
pandoc_args: [
"--number-offset=1"
]
---```{r, echo = FALSE}
knitr::opts_chunk$set(
out.width = "100%",
collapse = TRUE,
comment = "#>",
fig.path = "",
echo = TRUE
)
```A response to the [call for
help](https://github.com/ropenscilabs/annual-report-help) contributing to the
rOpenSci 2019 Annual Report, via [issue
#4](https://github.com/ropenscilabs/annual-report-help/issues/4), which aimed
to quantify "Number of authors per package". From the original description by [\@sckott](https://github.com/sckott):> Does the number of maintainers per package change through time?
> In the interest of software sustainability, ideally each package would have
more than one maintainer, but this is relatively rare. We'd like to see the
number of maintainers per package increase over time, but number of maintainers
is hard to address without detailed knowledge of each repo. As a proxy, we
could count up all authors regardless of their role (not counting reviewers,
funders).> Question becomes: Does the number of authors per package change (increase) through time?
Suggested approaches were based on extracting "official" authors from package
DESCRIPTION files, but numbers of authors in this context are unavoidably
cumulative, and must increase across time because authors are very generally
*not* ever removed once added. Addressing the issue through official
DESCRIPTION files would thus require comparison of these rates of increase with
some kind of neutral, expected value, which seems impracticable, so alternative
approaches are pursued here.The analyses mostly work via several functions defined in several
`function-*.R` files, loaded here, along with necessary libraries.```{r libs, message = FALSE}
library (jsonlite)
library (dplyr)
library (magrittr)
library (ggplot2)
library (cranlogs)
source ("functions-repos.R")
source ("functions-extract.R")
source ("functions-analyse.R")
```The functions mostly use the github graphql API to extract the full commit
histories of all rOpenSci package repos, and of RStudio packages granted the
honour of being listed in their ["official" hex sticker
page](https://github.com/rstudio/hex-stickers/tree/master/PNG). The extraction
of commit histories from the github graphql API requires a client to be
established with the following code:
```{r gh-cli}
token <- Sys.getenv("GITHUB_GRAPHQL_TOKEN") # or whatever
gh_cli <- ghql::GraphqlClient$new (
url = "https://api.github.com/graphql",
headers = list (Authorization = paste0 ("Bearer ", token))
)
```## Get commit histories from github
The objects of this analysis are the entire commit histories (for the default
branch of a repository) of rOpenSci and RStudio repositories. The first
functions obtain all associated repositories as a `data.frame` with columns for
both repository name and associated github organization. The second column is
necessary because not all repositories are directly and respectively hosted on
[`github/ropensci`](https://github.com/ropensci) or
[`github/rstudio`](https://github.com/rstudio). These data are extracted with
the functions, `get_ros_repos()` and `get_rstudio_repos`. Commit histories can
then be extracted by submitting these resultant `data.frame` objects to the
single function, `get_all_commits()`. This function takes about 20 minutes or
so to run for each of rOpenSci and RStudio, so the resultant data are saved to
enable immediate re-loading in all subsequent analyses.```{r summary-fakey, eval = FALSE, echo = TRUE}
repos <- get_ros_repos ()
system.time (
dat_ros <- get_all_commits (gh_cli, repos)
)
names (dat_ros) <- repos
saveRDS (dat_ros, "results-ropensci.Rds")dat <- get_rstudio_repos ()
system.time (
dat_rst <- get_all_commits (gh_cli, repos)
)
names (dat_rst) <- repos
saveRDS (dat_rst, "results-rstudio.Rds")
```## Proportion of contributions from non-primary contributors
All of the following analyses focus on "non-primary contributors", which are
simply contributors other than the statistically dominant contributor. The
first metric analysed here is the proportion of non-primary contributions, via
the `prop_np_commits()` function, which by default yields quarterly-aggregates
of contributions.
```{r prop_np_commits}
np_commits_rst <- prop_np_commits (readRDS ("results-rstudio.Rds"))
np_commits_rst$org <- "RStudio"
np_commits_ros <- prop_np_commits (readRDS ("results-ropensci.Rds"))
np_commits_ros$org <- "rOpenSci"
results <- bind_rows (np_commits_ros, np_commits_rst) %>%
rename (non_primary = n)
ggplot (results, aes (date, non_primary)) +
geom_point (colour = "#9239F6") +
geom_smooth (colour = "#FF0076", method = "lm", formula = y ~ x) +
facet_wrap (.~org) +
theme (axis.title.y = element_text (angle = 90))
```
RStudio packages clearly have a statistically higher proportion of non-primary
contributions:
```{r prop_np_commits-t-test, echo = FALSE}
tt <- t.test (np_commits_rst$n, np_commits_ros$n)
message ("mean (RStudio) = ", signif (tt$estimate [1], 3),
"; mean (rOpenSci) = ", signif (tt$estimate [2], 3),
" [T = ", signif (tt$statistic, 3),
", df = ", signif (tt$parameter, 5),
", p = ", signif (tt$p.value, 2), "]")
```
```{r prop_np_commits-lm, echo = FALSE}
lm_rst <- summary (lm (np_commits_rst$n ~ np_commits_rst$date))
t_rst <- signif (lm_rst$coefficients [2, 3], 2)
p_rst <- signif (lm_rst$coefficients [2, 4], 2)
lm_ros <- summary (lm (np_commits_ros$n ~ np_commits_ros$date))
t_ros <- signif (lm_ros$coefficients [2, 3], 2)
p_ros <- signif (lm_ros$coefficients [2, 4], 2)
```
The figure also clearly reveals that proportions of non-primary contributions
for RStudio packages have actually decreased between the years 2010 and 2019
(although this decrease was not significant; T = `r t_rst`; p = `r p_rst`).
There was no significant change for rOpenSci (T = `r t_ros`; p =
`r p_ros`).## Effect of package prominence
Packages that are more prominent may attract more non-primary contributions,
and so the preceding results may to some extent merely reflect differences in
package prominence. (We use the term "prominence" here in lieu of "popularity",
with due connotation that prominence is an attribute that can be actively
manipulated; in particular, RStudio has a commercial budget not available to
rOpenSci, and which is able to be directed towards increasing the prominence of
their packages.) Prominence is quantified here by the total number of package
downloads divided by the time elapsed since a package's first release. For
that, we use the [`cranlogs` package](https://cranlogs.r-pkg.org/). Note that
not all rOpenSci packages have been released on CRAN, and so prominence metrics
will only exist for those which have. The extraction of downloads can take
quite some time, so we save the results for subsequent analyses.```{r cran-downloads, eval = FALSE}
pkgs_rst <- names (readRDS ("results-rstudio.Rds"))
x <- cran_downloads (pkgs_rst, from = "1997-04-01")
saveRDS (x, file = "cran-rstudio.Rds")
pkgs_ros <- names (readRDS ("results-ropensci.Rds"))
x <- cran_downloads (pkgs_ros, from = "1997-04-01")
saveRDS (x, file = "cran-ropensci.Rds")
```
Each of these is a single `data.frame` with columns for `date` (daily values from
the specified `from` date), `count` of daily downloads, and `package` naming
each requested package. The following function then converts these daily
downloads for each package into a single measure of average daily downloads
over the entire lifetime of a package.```{r process-downloads}
x <- readRDS ("cran-rstudio.Rds")
p_rst <- unlist (lapply (split (x, as.factor (x$package)), function (i) {
first <- which (i$count > 0) [1]
sum (i$count) / (nrow (i) - first + 1) }))
x <- readRDS ("cran-ropensci.Rds")
p_ros <- unlist (lapply (split (x, as.factor (x$package)), function (i) {
first <- which (i$count > 0) [1]
sum (i$count) / (nrow (i) - first + 1) }))
```That of course raises the question of what those "prominence" scores look like?
```{r prominence}
dat_rst <- data.frame (package = names (p_rst),
prominence = as.numeric (p_rst),
org = "RStudio",
stringsAsFactors = FALSE)
dat_ros <- data.frame (package = names (p_ros),
prominence = as.numeric (p_ros),
org = "rOpenSci",
stringsAsFactors = FALSE)
dat <- rbind (dat_rst, dat_ros)
dat$log_prominence <- log10 (dat$prominence)
ggplot (dat, aes (x = org, y = log_prominence, fill = org)) +
geom_violin (alpha = 0.7) +
theme (axis.title.y = element_text (angle = 90))
```And perhaps unsurprisingly, RStudio packages are enormously more prominent
that rOpenSci packages (noting that the scale is logarithmic). Does this
prominence affect the proportions of non-primary contributions? For that we
need a single measure of the proportion of commits aggregated over the entire
history of each repo, extracted here with a `quarterly = FALSE` argument.```{r prom-non-primary}
np_commits_rst <- prop_np_commits (readRDS ("results-rstudio.Rds"), quarterly = FALSE)
np_commits_rst$org <- "RStudio"
np_commits_ros <- prop_np_commits (readRDS ("results-ropensci.Rds"), quarterly = FALSE)
np_commits_ros$org <- "rOpenSci"
np_commits <- dplyr::bind_rows (np_commits_ros, np_commits_rst) %>%
dplyr::rename (package = repo)dat <- dplyr::left_join (dat, np_commits, by = c ("package", "org")) %>%
rename (non_primary = n)
ggplot (dat, aes (x = log_prominence, y = non_primary, colour = org)) +
geom_point () +
geom_smooth (method = "lm", formula = y ~ x) +
theme (axis.title.y = element_text (angle = 90))
```
```{r prom-non-primary-stats, echo = FALSE}
dat_ros <- dat [dat$org == "rOpenSci", ]
lm_ros <- summary (lm (dat_ros$n ~ dat_ros$log_prominence))
dat_rst <- dat [dat$org == "RStudio", ]
lm_rst <- summary (lm (dat_rst$n ~ dat_rst$log_prominence))t_rst <- signif (lm_rst$coefficients [2, 3], 2)
p_rst <- signif (lm_rst$coefficients [2, 4], 2)
t_ros <- signif (lm_ros$coefficients [2, 3], 2)
p_ros <- signif (lm_ros$coefficients [2, 4], 2)
```The two organizations follow categorically different trajectories. More
prominent RStudio packages attract significantly greater proportions of
non-primary contributions (T = `r t_rst`, p = `r p_rst`),
while more prominent rOpenSci packages tend to attract *lower* proportions of
non-primary contributions, and so become more dominated by singular primary
contributors, although this effect is not significant (T = `r t_ros`, p =
`r p_ros`).## Temporal patterns of non-primary contributions
We now delve into more detailed analyses of the git commit histories, through
analysing both numbers of commits and numbers of lines of code committed. We
quantify numbers of distinct contributors, through aggregating numbers of both
commits and lines of code over a defined time period -- fixed at 3 months
throughout all of the following, although could be easily modified -- and
grouping by unique contributor. Contributions from the primary contributor are
removed from the analysis, so as only to count contributions from additional
people other than the primary author. The numbers are then converted to relative
amounts for each time period, sorted in decreasing order, and then converted to
a linear rate of decrease per additional unique contributor. This property --
referred to from hereon as "non-primary contribution rate" -- is strictly
negative, but will approach one in the ideal situation of all contributors to
a package having equal contributions. The more negative the non-primary
contribution rate, the more a package is dominated by a single contributor.
This metric is derived for each package for each quarter in which sufficient
data are available.```{r authors-per-time-interval}
dat <- readRDS ("results-rstudio.Rds")
commits_rst <- stats_commits (dat)
lines_rst <- stats_lines (dat)
np_commits_rst <- prop_np_commits (dat)
dat <- readRDS ("results-ropensci.Rds")
commits_ros <- stats_commits (dat)
lines_ros <- stats_lines (dat)
np_commits_ros <- prop_np_commits (dat)commits_rst$org <- "RStudio"
lines_rst$org <- "RStudio"
commits_ros$org <- "rOpenSci"
lines_ros$org <- "rOpenSci"mean (commits_ros$slope, na.rm = TRUE); mean (commits_rst$slope, na.rm = TRUE)
mean (lines_ros$slope, na.rm = TRUE); mean (lines_rst$slope, na.rm = TRUE)results <- rbind (commits_rst,
commits_ros,
lines_rst,
lines_ros) %>%
dplyr::filter (!is.na (slope))ggplot (results, aes (date, slope)) +
geom_point (colour = "#9239F6") +
geom_smooth (colour = "#FF0076", method = "lm") +
facet_wrap (.~var + org) +
ylab ("Non-primary contribution rate") +
theme (axis.title.y = element_text (angle = 90))
```Non-primary contributions in terms of both commits and lines of code have thus
increased over time in both organizations, with values being clearly higher for
RStudio than rOpenSci.## rOpenSci package categories
We now repeat the above analyses for sub-sets of packages within categories
designed by rOpenSci. These categories are provided in the `get_ros_repos()`,
summarised thus:```{r ros-categories}
pkgs <- get_ros_repos ()
tab <- table (pkgs$category)
knitr::kable (data.frame (name = names (tab),
num_packages = as.integer (tab)))
```
```{r ros-baseline-slope, echo = FALSE}
mod_rst <- summary (lm (commits_rst$slope ~ commits_rst$date))
slope_rst <- signif (mod_rst$coefficients [2, 1], 3)
npcr_rst <- signif (mean (commits_rst$slope, na.rm = TRUE), 3)
mod_ros <- summary (lm (commits_ros$slope ~ commits_ros$date))
slope_ros <- signif (mod_ros$coefficients [2, 1], 3)
npcr_ros <- signif (mean (commits_ros$slope, na.rm = TRUE), 3)
```We now repeat the analysis immediately above of relative rates of non-primary
contributions over time for packages within each of these categories. The
baseline for comparison formed from all packages considered together has a mean
non-primary contribution rate of `r npcr_ros`, with an increase per year of
`r slope_ros`. The equivalent RStudio values are a mean of `r npcr_rst` with an
increase per year of `r slope_rst`.```{r ros-baseline-slope-extra, echo = FALSE}
# precise values to use below
npcr_ros <- mean (commits_ros$slope, na.rm = TRUE)
slope_ros <- mod_ros$coefficients [2, 1]
npcr_rst <- mean (commits_rst$slope, na.rm = TRUE)
slope_rst <- mod_rst$coefficients [2, 1]
``````{r one-category}
category_stats <- function (pkgs, commits, category = "data-access")
{
cat_pkgs <- pkgs$repo [which (pkgs$category %in% category)]
cat_commits <- commits_ros [commits_ros$repo %in% cat_pkgs, ] %>%
filter (!is.na (slope))
slope <- mn <- NA
if (nrow (cat_commits) > 1)
{
mod <- summary (lm (cat_commits$slope ~ cat_commits$date))
slope <- mod$coefficients [2, 1]
mn <- mean (cat_commits$slope, na.rm = TRUE)
}
c (mean_slope = mn, change = slope)
}
res <- t (vapply (unique (pkgs$category [!is.na (pkgs$category)]), function (i)
category_stats (pkgs, commits, i), numeric (2)))
res <- data.frame (category = c ("RStudio-all", "rOpenSci", rownames (res)),
npcr = c (npcr_rst, npcr_ros, res [, 1]),
change = c (slope_rst, slope_ros, res [, 2]),
stringsAsFactors = FALSE) %>%
filter (!is.na (change))
# Leave Rstudio and rOpenSci at stop, and sort all other rows
index <- c (1, 2, 2 + order (res$npcr [3:nrow (res)], decreasing = TRUE))
knitr::kable (res [index, ], digits = c (0, 3, 3), row.names = FALSE)
```
```{r category-summaries, echo = FALSE}
cats <- res$category [index] [3:nrow (res)]
```And the image processing category is the one and only category that outperforms
RStudio in terms both of overall non-primary commits (through having the lowest
non-primary commit rate of all), and in that tendency increasing more strongly
over time (`change` =
`r signif (res$change[res$category=="image-processing"], 2)`). The following
three categories (`r paste0 (cats [2:4], collapse = ", ")`) all have relatively
low mean non-primary commit values, yet actually become more negative over
time, indicating *decreasing* degrees of community engagement in the code of
these packages. The next category of `r cats [5]` has the second-highest rate of
increase in engagement over time (`change =`
`r signif (res$change[res$category==cats[5]], 2)`). The highest rate of
increase in engagement comes from the geospatial category, to which most of my
packages belong. So at least I am potentially part of one small yet positive
contribution to the broader rOpenSci community.