https://github.com/uscbiostats/software-dev
Coding Standards for the USC Biostats group
https://github.com/uscbiostats/software-dev
hpc r reproducibility reproducible-research standards
Last synced: 5 days ago
JSON representation
Coding Standards for the USC Biostats group
- Host: GitHub
- URL: https://github.com/uscbiostats/software-dev
- Owner: USCbiostats
- Created: 2017-01-13T00:00:44.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2025-03-30T11:36:25.000Z (6 months ago)
- Last Synced: 2025-04-05T09:05:42.241Z (6 months ago)
- Topics: hpc, r, reproducibility, reproducible-research, standards
- Language: HTML
- Size: 74.3 MB
- Stars: 36
- Watchers: 20
- Forks: 8
- Open Issues: 1
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
output:
rmarkdown::github_document:
html_preview: false
---# Software Development Standards 
This project's main contents are located in the project's [Wiki](wiki#welcome-to-the-software-development-wiki).
# USCbiostats R packages
```{r setup, include=FALSE}
library(httr)
library(stringr)
library(knitr)
library(scholar) # <--- The key difference
``````{r, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
``````{r, include=FALSE}
# We'll assume `packages.csv` has columns:
# name, repo, on_bioc, scholar_id, pubid, google_scholar, description
# Lines starting with '#' in CSV are ignored.pkgs <- read.csv("packages.csv", comment.char = "#", stringsAsFactors = FALSE)
# If on_bioc does not exist, create it
if (!"on_bioc" %in% names(pkgs)) {
pkgs$on_bioc <- FALSE
} else {
# Convert text "TRUE"/"FALSE" to logical
pkgs$on_bioc <- ifelse(pkgs$on_bioc %in% c("TRUE","True","true"), TRUE, FALSE)
}# Check CRAN status
pkgs$on_cran <- TRUE
for (i in seq_len(nrow(pkgs))) {
nm <- pkgs$name[i]
url <- sprintf("https://cran.r-project.org/package=%s", nm)
resp <- tryCatch(GET(url), error = function(e) e)
if (inherits(resp,"error") || status_code(resp) != 200) {
pkgs$on_cran[i] <- FALSE
}
}# Sort packages by name
pkgs <- pkgs[order(pkgs$name), , drop=FALSE]
pkgs <- pkgs[!(is.na(pkgs$name) | pkgs$name == ""), ]# Build the data frame that will become our final table
dat <- data.frame(
Name = character(nrow(pkgs)),
Description = character(nrow(pkgs)),
Citations = character(nrow(pkgs)), # will fill in
stringsAsFactors = FALSE
)for (i in seq_len(nrow(pkgs))) {
nm <- pkgs$name[i]
repo_url <- if (!is.na(pkgs$repo[i]) && nzchar(pkgs$repo[i])) {
pkgs$repo[i]
} else {
paste0("https://github.com/USCbiostats/", nm)
}
# The clickable package name
dat$Name[i] <- sprintf("[**%s**](%s)", nm, repo_url)
desc_txt <- pkgs$description[i] # base description
# If on CRAN, add badges
if (pkgs$on_cran[i]) {
desc_txt <- paste(
desc_txt,
sprintf("[](https://CRAN.R-project.org/package=%1$s)", nm),
sprintf("[](https://CRAN.R-project.org/package=%1$s)", nm)
)
}
# If on Bioc, add a badge
if (pkgs$on_bioc[i]) {
desc_txt <- paste(
desc_txt,
# Build status shield
sprintf(
"[](https://bioconductor.org/packages/release/bioc/html/%1$s.html)",
nm
),
# Downloads rank shield
sprintf(
"[](https://bioconductor.org/packages/release/bioc/html/%1$s.html)",
nm)
)
}
dat$Description[i] <- desc_txt
}# Initialize Citations
dat$Citations <- ""
``````{r, include=FALSE}
# -----------------------------
# 1) Scholar approach:
# -----------------------------
get_scholar_citation_count <- function(sid, pubid, pkg_name) {
# If there's a specific publication ID
if (!is.na(pubid) && nzchar(pubid)) {
# Use get_article_cite_history() + sum the 'cites'
article_hist <- tryCatch(
get_article_cite_history(sid, pubid),
error = function(e) NULL
)
if (is.data.frame(article_hist) && nrow(article_hist) > 0 && "cites" %in% names(article_hist)) {
return(sum(article_hist$cites))
} else {
return(NA_integer_)
}
} else {
# Otherwise, fallback to the fuzzy match on package name in get_publications()
pubs <- tryCatch(
get_publications(sid),
error = function(e) NULL
)
if (!is.null(pubs) && is.data.frame(pubs) && nrow(pubs) > 0) {
idx <- which(grepl(pkg_name, pubs$title, ignore.case=TRUE))
if (length(idx) > 0) {
# Return the first match's total cites
return(pubs$cites[idx[1]])
}
}
return(NA_integer_)
}
}# -----------------------------
# 2) Old HTML scraping approach:
# -----------------------------
# We'll define a function that tries to parse a Google Scholar URL (like ?cites=...)
# using readLines or GET+iconv, then run a regex to find "XXX results" lines.
# If found, return XXX as integer. Otherwise NA.
get_html_scrape_citation_count <- function(gs_url) {
if (is.na(gs_url) || !nzchar(gs_url)) {
return(NA_integer_)
}
# We'll fetch as raw and convert.
page_txt <- tryCatch({
resp <- httr::GET(gs_url)
if (httr::status_code(resp) != 200) {
stop("HTTP status not 200")
}
raw_ct <- httr::content(resp, as="raw")
txt <- iconv(rawToChar(raw_ct, multiple=TRUE), from="UTF-8", to="UTF-8", sub="byte")
txt
}, error = function(e) {
return(NULL)
})
if (is.null(page_txt)) {
return(NA_integer_)
}
# We'll split into lines
lines <- strsplit(page_txt, "\n", fixed=TRUE)[[1]]
# Remove some tags. (Might or might not help.)
lines <- gsub("\\<[[:alnum:]_/-]+\\>", "", lines, perl=TRUE)
# The old code used a regex looking for something like "123 results (0.23 sec)"
# e.g. "([0-9,]+)[\\s\\n]+results?[\\s\\n]+\\([\\s\\n]*[0-9]+"
# But Scholar might say "About 123 results..."
# So we can attempt a simpler approach:
# "About X results" or "X results"
re <- "About\\s+([0-9,]+)\\s+results\\s*(\\([^)]*\\))?|
([0-9,]+)\\s+results\\s*(\\([^)]*\\))?"
# We'll try both capturing groups
m <- regexpr(re, lines, perl=TRUE, ignore.case=TRUE)
# Find the first line that matches
line_idx <- which(m != -1)
if (length(line_idx) == 0) {
return(NA_integer_)
}
# We'll just pick the first match
line_of_interest <- lines[line_idx[1]]
# Extract the numeric portion
# We'll do two sub captures, so:
match_txt <- regmatches(line_of_interest, m[1])
# We'll use a simpler approach with stringr if you prefer:
library(stringr)
# This pattern tries to find numbers in the text
nums_found <- str_extract_all(line_of_interest, "[0-9,]+")[[1]]
if (length(nums_found) == 0) {
return(NA_integer_)
}
# Convert e.g. "1,234" -> 1234
cites_int <- as.integer(gsub("[^0-9]", "", nums_found[1]))
cites_int
}tot_citations <- 0L
# Now we loop over each package row
for (i in seq_len(nrow(pkgs))) {
pkg_name <- pkgs$name[i]
sid <- if ("scholar_id" %in% names(pkgs)) pkgs$scholar_id[i] else NA_character_
pubid <- if ("pubid" %in% names(pkgs)) pkgs$pubid[i] else NA_character_
old_link <- if ("google_scholar" %in% names(pkgs)) pkgs$google_scholar[i] else NA_character_
cval <- NA_integer_
# 1) Try scholar approach if sid is not empty
if (!is.na(sid) && nzchar(sid)) {
cval <- get_scholar_citation_count(sid, pubid, pkg_name)
if (!is.na(cval) && cval >= 0) {
# If we got a valid integer from Scholar
if (!is.na(pubid) && nzchar(pubid)) {
# We have a link to the actual publication
dat$Citations[i] <- sprintf("[%d](%s)", cval, old_link)} else {
# We only have the count, no direct pub link
dat$Citations[i] <- as.character(cval)
}
tot_citations <- tot_citations + cval
next # Done with this package
}
}
# 2) Fallback: old HTML approach using google_scholar column
cval_html <- get_html_scrape_citation_count(old_link)
if (!is.na(cval_html) && cval_html >= 0) {
dat$Citations[i] <- sprintf("[%d](%s)", cval_html, old_link)
tot_citations <- tot_citations + cval_html
}
}
``````{r printing, echo = FALSE}
knitr::kable(dat, row.names = FALSE)
```As of `r Sys.Date()`, the packages listed here have been cited **`r tot_citations`** times
(source: Google Scholar).To update this list, modify the file [packages.csv](packages.csv). The
`README.md` file is updated automatically using GitHub Actions, so there's no
need to "manually" recompile the README file after updating the list.# Coding Standards
1. [Coding Standards](wiki#coding-standards)
2. [Software Thinking](wiki/coding-standards.md#software-thinking)
3. [Development Workflow](wiki/coding-standards.md#development-workflow)
4. [Misc](wiki/coding-standards.md#misc)We do have some direct guidelines developed as issue templates [here](templates).
# Bioghost Server
1. [Introduction](wiki/Bioghost-server.md#introduction)
2. [Setup](wiki/Bioghost-server.md#setup)
3. [Cheat Sheet](wiki/Bioghost-server.md#cheat-sheet)# HPC in R
* [Parallel computing in R](wiki/HPC-in-R.md#parallel-computing-in-r)
* [parallel](wiki/HPC-in-R.md#parallel)
* [iterators+foreach](wiki/HPC-in-R.md#foreach)
* [RcppArmadillo + OpenMP](wiki/HPC-in-R.md#rcpparmadillo-and-openmp)# Happy Scientist Seminars
The Happy Scientist Seminars are educational seminars sponsored by Core D of IMAGE, the Biostats Program Project award. This series, the "Happy Scientist" seminar series, is aimed at providing educational material for members of Biostats, both students and faculty, about a variety of tools and methods that might prove useful to them. If you have any suggestions for subjects that you would like to learn about in future, please send email to Kim Siegmund at (kims@usc.edu). Our agenda will be driven by your specific interests as far as is possible.
A list of past seminars with material can be found [here](/happy_scientist/).