{"id":19644837,"url":"https://github.com/uscbiostats/software-dev","last_synced_at":"2025-10-05T00:04:09.272Z","repository":{"id":47551569,"uuid":"78798165","full_name":"USCbiostats/software-dev","owner":"USCbiostats","description":"Coding Standards for the USC Biostats group","archived":false,"fork":false,"pushed_at":"2025-03-30T11:36:25.000Z","size":77927,"stargazers_count":36,"open_issues_count":1,"forks_count":8,"subscribers_count":20,"default_branch":"master","last_synced_at":"2025-04-05T09:05:42.241Z","etag":null,"topics":["hpc","r","reproducibility","reproducible-research","standards"],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/USCbiostats.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-01-13T00:00:44.000Z","updated_at":"2025-03-30T11:36:31.000Z","dependencies_parsed_at":"2025-03-28T18:39:31.185Z","dependency_job_id":null,"html_url":"https://github.com/USCbiostats/software-dev","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/USCbiostats%2Fsoftware-dev","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/USCbiostats%2Fsoftware-dev/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/USCbiostats%2Fsoftware-dev/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/USCbiostats%2Fsoftware-dev/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/USCbiostats","download_url":"https://codeload.github.com/USCbiostats/software-dev/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251319757,"owners_count":21570451,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hpc","r","reproducibility","reproducible-research","standards"],"created_at":"2024-11-11T14:30:13.860Z","updated_at":"2025-10-05T00:04:09.266Z","avatar_url":"https://github.com/USCbiostats.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\noutput:\n    rmarkdown::github_document:\n      html_preview: false\n---\n\n# Software Development Standards ![GitHub last commit](https://img.shields.io/github/last-commit/USCbiostats/software-dev)\n\nThis project's main contents are located in the project's [Wiki](wiki#welcome-to-the-software-development-wiki).\n\n# USCbiostats R packages\n\n```{r setup, include=FALSE}\nlibrary(httr)\nlibrary(stringr)\nlibrary(knitr)\nlibrary(scholar)   # \u003c--- The key difference\n```\n\n\n```{r, include=FALSE}\nknitr::opts_chunk$set(warning = FALSE, message = FALSE)\n```\n\n```{r, include=FALSE}\n# We'll assume `packages.csv` has columns:\n# name, repo, on_bioc, scholar_id, pubid, google_scholar, description\n# Lines starting with '#' in CSV are ignored.\n\npkgs \u003c- read.csv(\"packages.csv\", comment.char = \"#\", stringsAsFactors = FALSE)\n\n# If on_bioc does not exist, create it\nif (!\"on_bioc\" %in% names(pkgs)) {\n  pkgs$on_bioc \u003c- FALSE\n} else {\n  # Convert text \"TRUE\"/\"FALSE\" to logical\n  pkgs$on_bioc \u003c- ifelse(pkgs$on_bioc %in% c(\"TRUE\",\"True\",\"true\"), TRUE, FALSE)\n}\n\n# Check CRAN status\npkgs$on_cran \u003c- TRUE\nfor (i in seq_len(nrow(pkgs))) {\n  nm \u003c- pkgs$name[i]\n  url \u003c- sprintf(\"https://cran.r-project.org/package=%s\", nm)\n  resp \u003c- tryCatch(GET(url), error = function(e) e)\n  if (inherits(resp,\"error\") || status_code(resp) != 200) {\n    pkgs$on_cran[i] \u003c- FALSE\n  }\n}\n\n# Sort packages by name\npkgs \u003c- pkgs[order(pkgs$name), , drop=FALSE]\npkgs \u003c- pkgs[!(is.na(pkgs$name) | pkgs$name == \"\"), ]\n\n# Build the data frame that will become our final table\ndat \u003c- data.frame(\n  Name        = character(nrow(pkgs)),\n  Description = character(nrow(pkgs)),\n  Citations   = character(nrow(pkgs)),  # will fill in\n  stringsAsFactors = FALSE\n)\n\nfor (i in seq_len(nrow(pkgs))) {\n  nm       \u003c- pkgs$name[i]\n  repo_url \u003c- if (!is.na(pkgs$repo[i]) \u0026\u0026 nzchar(pkgs$repo[i])) {\n    pkgs$repo[i]\n  } else {\n    paste0(\"https://github.com/USCbiostats/\", nm)\n  }\n  # The clickable package name\n  dat$Name[i] \u003c- sprintf(\"[**%s**](%s)\", nm, repo_url)\n  \n  desc_txt \u003c- pkgs$description[i]  # base description\n  \n  # If on CRAN, add badges\n  if (pkgs$on_cran[i]) {\n    desc_txt \u003c- paste(\n      desc_txt,\n      sprintf(\"[![CRAN status](https://www.r-pkg.org/badges/version/%1$s)](https://CRAN.R-project.org/package=%1$s)\", nm),\n      sprintf(\"[![CRAN downloads](https://cranlogs.r-pkg.org/badges/grand-total/%1$s)](https://CRAN.R-project.org/package=%1$s)\", nm)\n    )\n  }\n  \n  # If on Bioc, add a badge\n  if (pkgs$on_bioc[i]) {\n  desc_txt \u003c- paste(\n    desc_txt,\n    # Build status shield\n    sprintf(\n      \"[![BioC build status](https://bioconductor.org/shields/build/release/bioc/%s.svg)](https://bioconductor.org/packages/release/bioc/html/%1$s.html)\",\n      nm\n    ),\n    # Downloads rank shield\n    sprintf(\n      \"[![BioC downloads](https://bioconductor.org/shields/downloads/release/%s.svg)](https://bioconductor.org/packages/release/bioc/html/%1$s.html)\",\n      nm)\n    )\n  }\n  \n  dat$Description[i] \u003c- desc_txt\n}\n\n# Initialize Citations\ndat$Citations \u003c- \"\"\n```\n\n```{r, include=FALSE}\n# -----------------------------\n# 1) Scholar approach:\n# -----------------------------\nget_scholar_citation_count \u003c- function(sid, pubid, pkg_name) {\n  # If there's a specific publication ID\n  if (!is.na(pubid) \u0026\u0026 nzchar(pubid)) {\n    # Use get_article_cite_history() + sum the 'cites'\n    article_hist \u003c- tryCatch(\n      get_article_cite_history(sid, pubid),\n      error = function(e) NULL\n    )\n    if (is.data.frame(article_hist) \u0026\u0026 nrow(article_hist) \u003e 0 \u0026\u0026 \"cites\" %in% names(article_hist)) {\n      return(sum(article_hist$cites))\n    } else {\n      return(NA_integer_)\n    }\n  } else {\n    # Otherwise, fallback to the fuzzy match on package name in get_publications()\n    pubs \u003c- tryCatch(\n      get_publications(sid),\n      error = function(e) NULL\n    )\n    if (!is.null(pubs) \u0026\u0026 is.data.frame(pubs) \u0026\u0026 nrow(pubs) \u003e 0) {\n      idx \u003c- which(grepl(pkg_name, pubs$title, ignore.case=TRUE))\n      if (length(idx) \u003e 0) {\n        # Return the first match's total cites\n        return(pubs$cites[idx[1]])\n      }\n    }\n    return(NA_integer_)\n  }\n}\n\n# -----------------------------\n# 2) Old HTML scraping approach:\n# -----------------------------\n# We'll define a function that tries to parse a Google Scholar URL (like ?cites=...)\n# using readLines or GET+iconv, then run a regex to find \"XXX results\" lines. \n# If found, return XXX as integer. Otherwise NA.\nget_html_scrape_citation_count \u003c- function(gs_url) {\n  \n  if (is.na(gs_url) || !nzchar(gs_url)) {\n    return(NA_integer_)\n  }\n  \n  # We'll fetch as raw and convert. \n  page_txt \u003c- tryCatch({\n    resp \u003c- httr::GET(gs_url)\n    if (httr::status_code(resp) != 200) {\n      stop(\"HTTP status not 200\")\n    }\n    raw_ct \u003c- httr::content(resp, as=\"raw\")\n    txt    \u003c- iconv(rawToChar(raw_ct, multiple=TRUE), from=\"UTF-8\", to=\"UTF-8\", sub=\"byte\")\n    txt\n  }, error = function(e) {\n    return(NULL)\n  })\n  \n  if (is.null(page_txt)) {\n    return(NA_integer_)\n  }\n  \n  # We'll split into lines\n  lines \u003c- strsplit(page_txt, \"\\n\", fixed=TRUE)[[1]]\n  \n  # Remove some tags. (Might or might not help.)\n  lines \u003c- gsub(\"\\\\\u003c[[:alnum:]_/-]+\\\\\u003e\", \"\", lines, perl=TRUE)\n  \n  # The old code used a regex looking for something like \"123 results (0.23 sec)\"\n  # e.g. \"([0-9,]+)[\\\\s\\\\n]+results?[\\\\s\\\\n]+\\\\([\\\\s\\\\n]*[0-9]+\" \n  # But Scholar might say \"About 123 results...\"\n  # So we can attempt a simpler approach:\n  # \"About X results\" or \"X results\"\n  re \u003c- \"About\\\\s+([0-9,]+)\\\\s+results\\\\s*(\\\\([^)]*\\\\))?|\n       ([0-9,]+)\\\\s+results\\\\s*(\\\\([^)]*\\\\))?\"\n  # We'll try both capturing groups\n  m  \u003c- regexpr(re, lines, perl=TRUE, ignore.case=TRUE)\n  # Find the first line that matches\n  line_idx \u003c- which(m != -1)\n  if (length(line_idx) == 0) {\n    return(NA_integer_)\n  }\n  # We'll just pick the first match\n  line_of_interest \u003c- lines[line_idx[1]]\n  \n  # Extract the numeric portion\n  # We'll do two sub captures, so:\n  match_txt \u003c- regmatches(line_of_interest, m[1])\n  \n  # We'll use a simpler approach with stringr if you prefer:\n  library(stringr)\n  # This pattern tries to find numbers in the text\n  nums_found \u003c- str_extract_all(line_of_interest, \"[0-9,]+\")[[1]]\n  if (length(nums_found) == 0) {\n    return(NA_integer_)\n  }\n  \n  # Convert e.g. \"1,234\" -\u003e 1234\n  cites_int \u003c- as.integer(gsub(\"[^0-9]\", \"\", nums_found[1]))\n  cites_int\n}\n\ntot_citations \u003c- 0L\n\n# Now we loop over each package row\nfor (i in seq_len(nrow(pkgs))) {\n  \n  pkg_name  \u003c- pkgs$name[i]\n  sid       \u003c- if (\"scholar_id\" %in% names(pkgs)) pkgs$scholar_id[i] else NA_character_\n  pubid     \u003c- if (\"pubid\"      %in% names(pkgs)) pkgs$pubid[i]      else NA_character_\n  old_link  \u003c- if (\"google_scholar\" %in% names(pkgs)) pkgs$google_scholar[i] else NA_character_\n  \n  cval \u003c- NA_integer_\n  \n  # 1) Try scholar approach if sid is not empty\n  if (!is.na(sid) \u0026\u0026 nzchar(sid)) {\n    cval \u003c- get_scholar_citation_count(sid, pubid, pkg_name)\n    if (!is.na(cval) \u0026\u0026 cval \u003e= 0) {\n      # If we got a valid integer from Scholar\n      if (!is.na(pubid) \u0026\u0026 nzchar(pubid)) {\n        # We have a link to the actual publication\n      dat$Citations[i] \u003c- sprintf(\"[%d](%s)\", cval, old_link)\n\n      } else {\n        # We only have the count, no direct pub link\n        dat$Citations[i] \u003c- as.character(cval)\n      }\n      tot_citations \u003c- tot_citations + cval\n      next  # Done with this package\n    }\n  }\n  \n  # 2) Fallback: old HTML approach using google_scholar column\n  cval_html \u003c- get_html_scrape_citation_count(old_link)\n  if (!is.na(cval_html) \u0026\u0026 cval_html \u003e= 0) {\n    dat$Citations[i] \u003c- sprintf(\"[%d](%s)\", cval_html, old_link)\n    tot_citations \u003c- tot_citations + cval_html\n  }\n}\n```\n\n\n```{r printing, echo = FALSE}\nknitr::kable(dat, row.names = FALSE)\n```\n\nAs of `r Sys.Date()`, the packages listed here have been cited **`r tot_citations`** times\n(source: Google Scholar).\n\nTo update this list, modify the file [packages.csv](packages.csv). The\n`README.md` file is updated automatically using GitHub Actions, so there's no\nneed to \"manually\" recompile the README file after updating the list. \n\n\n# Coding Standards\n\n1.  [Coding Standards](wiki#coding-standards)\n2.  [Software Thinking](wiki/coding-standards.md#software-thinking)\n3.  [Development Workflow](wiki/coding-standards.md#development-workflow)\n4.  [Misc](wiki/coding-standards.md#misc)\n\nWe do have some direct guidelines developed as issue templates [here](templates). \n\n# Bioghost Server\n\n1.  [Introduction](wiki/Bioghost-server.md#introduction)\n2.  [Setup](wiki/Bioghost-server.md#setup)\n3.  [Cheat Sheet](wiki/Bioghost-server.md#cheat-sheet)\n\n# HPC in R\n    \n*   [Parallel computing in R](wiki/HPC-in-R.md#parallel-computing-in-r)  \n*   [parallel](wiki/HPC-in-R.md#parallel)\n*   [iterators+foreach](wiki/HPC-in-R.md#foreach)\n*   [RcppArmadillo + OpenMP](wiki/HPC-in-R.md#rcpparmadillo-and-openmp)\n\n# Happy Scientist Seminars\n\nThe Happy Scientist Seminars are educational seminars sponsored by Core D of IMAGE, the Biostats Program Project award. This series, the \"Happy Scientist\" seminar series, is aimed at providing educational material for members of Biostats, both students and faculty, about a variety of tools and methods that might prove useful to them. If you have any suggestions for subjects that you would like to learn about in future, please send email to Kim Siegmund at (kims@usc.edu). Our agenda will be driven by your specific interests as far as is possible. \n\nA list of past seminars with material can be found [here](/happy_scientist/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuscbiostats%2Fsoftware-dev","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fuscbiostats%2Fsoftware-dev","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuscbiostats%2Fsoftware-dev/lists"}