{"id":35125166,"url":"https://github.com/djvill/slac-fairness","last_synced_at":"2026-05-21T09:04:18.287Z","repository":{"id":176390919,"uuid":"657352453","full_name":"djvill/SLAC-Fairness","owner":"djvill","description":"Tools to assess fairness and mitigate unfairness in sociolinguistic auto-coding","archived":false,"fork":false,"pushed_at":"2024-04-08T17:33:52.000Z","size":33178,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-04-08T21:18:49.322Z","etag":null,"topics":["machine-learning","ml-fairness","research-methods","sociolinguistics"],"latest_commit_sha":null,"homepage":"https://djvill.github.io/SLAC-Fairness/","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/djvill.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-06-22T22:00:20.000Z","updated_at":"2023-06-26T20:12:29.000Z","dependencies_parsed_at":"2023-11-07T02:23:41.198Z","dependency_job_id":null,"html_url":"https://github.com/djvill/SLAC-Fairness","commit_stats":null,"previous_names":["djvill/slac-fairness"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/djvill/SLAC-Fairness","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djvill%2FSLAC-Fairness","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djvill%2FSLAC-Fairness/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djvill%2FSLAC-Fairness/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djvill%2FSLAC-Fairness/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/djvill","download_url":"https://codeload.github.com/djvill/SLAC-Fairness/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djvill%2FSLAC-Fairness/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33295263,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-21T02:57:32.698Z","status":"ssl_error","status_checked_at":"2026-05-21T02:57:31.990Z","response_time":62,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","ml-fairness","research-methods","sociolinguistics"],"created_at":"2025-12-28T02:02:46.105Z","updated_at":"2026-05-21T09:04:18.280Z","avatar_url":"https://github.com/djvill.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# `SLAC-Fairness`: Tools to assess fairness and mitigate unfairness in sociolinguistic auto-coding\n\n_Dan Villarreal (Department of Linguistics, University of Pittsburgh)_\n\n![](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)\n\nThis work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/).\n\n\n## Introduction\n\nThis GitHub repository is a companion to the paper \"Sociolinguistic auto-coding has fairness problems too: Measuring and mitigating overlearning bias\", published open-access in _Linguistics Vanguard_ in 2024: \u003chttps://doi.org/10.1515/lingvan-2022-0114\u003e.\nIn the paper, I investigate **sociolinguistic auto-coding (SLAC)** through the lens of **machine-learning fairness**.\nJust as some algorithms produce biased predictions by _overlearning_ group characteristics, I find that the same is true for SLAC.\nAs a result, I attempt **unfairness mitigation strategies (UMSs)** as techniques for removing gender bias in auto-coding predictions (without harming overall auto-coding performance too badly).\n\n\n**_Repository navigation:_**\n\n- [_Repository homepage_](https://djvill.github.io/SLAC-Fairness)\n- [_Repository code_](https://github.com/djvill/SLAC-Fairness)\n- [_Analysis walkthrough_](https://djvill.github.io/SLAC-Fairness/Analysis-Walkthrough)\n- [_Unfairness mitigation strategy descriptions_](https://djvill.github.io/SLAC-Fairness/UMS-Info)\n\n\n### If you're new to sociolinguistic auto-coding (SLAC)\n\nSociolinguistic auto-coding is a machine-learning method for classifying variable linguistic data (often phonological data), such as the alternation between _park_ \u0026 \"_pahk_\" or _working_ \u0026 _workin'_.\n\nYou can learn more about SLAC by reading the following resources.\n\n- Dan Villarreal et al.'s 2020 _Laboratory Phonology_ article [\"From categories to gradience: Auto-coding sociophonetic variation with random forests\"](https://doi.org/10.5334/labphon.216)\n- Tyler Kendall et al.'s 2021 _Frontiers in AI_ article [\"Considering performance in the automated and manual coding of sociolinguistic variables: Lessons from variable (ING)\"](https://doi.org/10.3389/frai.2021.648543)\n- Dan Villarreal et al.'s 2019 tutorial [\"How to train your classifier\"](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html)\n  - Explains the R code powering my implementation of SLAC\n\n\n## What's the point of this repository?\n\nFirst, you can **reproduce** the analysis I performed for the _Linguistics Vanguard_ paper, using the same data and code that I did.\nSimply follow the analysis walkthrough [tutorial](https://djvill.github.io/SLAC-Fairness/Analysis-Walkthrough.html).\n\nSecond, you can also [adapt this code](#adapting-this-code-to-your-own-projects) to your own projects.\nYou might want to use it if you want to (1) **assess fairness** for a [pre-existing auto-coder](#assessing-fairness-for-a-pre-existing-auto-coder) and/or (2) create a **fair auto-coder** by [testing unfairness mitigation strategies](#testing-unfairness-mitigation-strategies) on your data.\n\nFinally, I invite [comments, critiques, and questions](#auditing-this-code-to-critique-andor-suggest-changes) about this code.\nI've made this code available for transparency's sake, so please don't hesitate to reach out!\n\n\n## What's in this repository?\n\nThe files in this repository fall into a few categories.\nClick the links below to jump to the relevant subsection:\n\n- Info written for humans\n  - `README.md`: What you're reading now\n  - `Analysis-Walkthrough.Rmd` \u0026 `.html`: Tutorial for replicating analysis in paper\n  - `UMS-Info.Rmd` \u0026 `.html`: Descriptions of unfairness mitigation strategies\n- [Input data](#input-data)\n  - `Input-Data/`\n- [Outputs](#outputs)\n  - `Outputs/`\n- [Code that does stuff](#code)\n  - `R-Scripts/`\n  - `Shell-Scripts/`\n- [Code/info pertaining to the repository itself](#codeinfo-pertaining-to-the-repository-itself)\n  - `.gitignore`\n  - `LICENSE.md`\n  - `renv/`\n  - `renv.lock`\n  - `.Rprofile`\n  - `_includes/`\n\nYou can browse files [here](https://github.com/djvill/SLAC-Fairness).\n\n\n### A quick note on the two-computer setup\n\nThis repository's structure reflects the two-computer setup I used to run this analysis.\nI generated and measured auto-coders on a more powerful system that is not quite as user-friendly (Pitt's [CRC](https://crc.pitt.edu/)), then analyzed the metrics on my less-powerful-but-user-friendlier laptop.\n(It's perfectly fine to use a one-computer setup if you don't have access to high-performance computing;\nthe code [will take longer to run](https://djvill.github.io/SLAC-Fairness/Analysis-Walkthrough.html#running-time) on a less-powerful machine, but it still might be faster/easier than the HPC learning curve!)\nIn the rest of this section, I'll refer to this two-computer split several times.\n\n\n### Input data\n\nContents:\n\n- `Input-Data/`\n  - `LabPhonClassifier.Rds`: Pre-existing auto-coder to analyze for fairness. This auto-coder is the same as in [\"How to train\"](https://github.com/nzilbb/How-to-Train-Your-Classifier/blob/main/LabPhonClassifier.Rds), but with a `Gender` column added to the auto-coder's `trainingData` element.\n  - `trainingData.Rds`: /r/ data for generating auto-coders that use unfairness mitigation strategies, also available [here](https://github.com/nzilbb/How-to-Train-Your-Classifier/blob/crc-version/Data/trainingData.Rds). This is the result of [step 1 in \"How to train\"](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html#step-1). A dataset with more tokens (but less acoustic information) is available [here](https://github.com/nzilbb/Sld-R-Data).\n  - `meanPitches.csv`: Pitch data to use for UMS 3.1 (normalizing speaker pitch). I measured pitch (F0) for word-initial /r/ tokens and calculated each speaker's average minimum and maximum pitch.\n  - `UMS-List.txt`: Tab-separated file matching UMS codes to descriptions. This is also used by the R scripts to define the set of acceptable UMS codes.\n\nThe /r/ and pitch data comes from Southland New Zealand English, historically New Zealand's only regional variety, which is characterized by variable rhoticity.\nThe New Zealand Institute of Language, Brain and Behaviour maintains a corpus of sociolinguistic interviews with Southland English speakers totaling over 83 hours of data.\nThis corpus is hosted in an instance of [LaBB-CAT](https://labbcat.canterbury.ac.nz/);\nthe data files were downloaded from LaBB-CAT, with subsequent data-wrangling in R (including speaker anonymization).\n\n_Skip ahead for info on using your own [auto-coder](#assessing-fairness-for-a-pre-existing-auto-coder) and [training data](#using-your-own-training-data), or modifying the set of [UMSs](#adding-andor-subtracting-umss)._\n\n\n### Outputs\n\nContents:\n\n- `Outputs/`\n  - `Autocoders-to-Keep/`\n    - \"Final\" auto-coders (saved as `.Rds` files). Unlike the temporary auto-coders, this folder is version-controlled (see [info on `.gitignore`](#codeinfo-pertaining-to-the-repository-itself)), so it's useful for selectively saving auto-coders we want to hold onto.\n  - `Shell-Scripts/`\n    - Text files (saved with the `.out` file extension) that record any output of [shell scripts](#shell-scripts), including errors. Useful for diagnosing issues with the code if something goes wrong.\n  - `Performance/`\n    - Tabular data (saved as `.csv` files) with metrics of auto-coders' performance (e.g., overall accuracy) and fairness (e.g., accuracy for women's vs. men's tokens). These files bridge the [two-computer split](#a-quick-note-on-the-two-computer-setup) split: we extract metrics on a more powerful system (see [walkthrough](https://djvill.github.io/SLAC-Fairness/Analysis-Walkthrough.html#baseline-metrics)) so we can analyze them on a user-friendlier computer.\n  - `Diagnostic-Files/`: Temporary files that are useful only in the moment (e.g., peeking \"under the hood\" to diagnose a code issue if something goes wrong) and/or too large to [share between computers](#a-quick-note-on-the-two-computer-setup). Most files are [`.gitignore`d](#codeinfo-pertaining-to-the-repository-itself), save for empty `dummy_file`s that exist only so the [empty folders can be shared to GitHub](https://stackoverflow.com/a/8418403).\n    - `Model-Status/`\n      - Temporary files (with extension `.tmp`) that are created during [optimization for performance](https://djvill.github.io/SLAC-Fairness/Analysis-Walkthrough.html#baseline) to signal which auto-coders are completed or running.\n    - `Temp-Autocoders/`\n      - Auto-coders run with different UMSs, for which we want to measure performance and fairness but we don't need to version-control\n  - `Other/`: Files mostly meant for passing info between scripts\n    - `Var-Imp*.csv`: Data on variable importance for [\"precursor\" UMSs](https://djvill.github.io/SLAC-Fairness/Analysis-Walkthrough.html#meas-precursor) used for UMSs 2.1.x.\n    - `Best-Params*.csv`: Optimal hyperparameters from [hyperparameter-tuning](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html#step-4) runs.\n    - `Drop-Log*.csv`: Log files for [outlier-dropping](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html#step-5) runs.\n\nAll these outputs were generated using the data and code in this repository.\nYou may want to create a 'clean' version of the repository without any of these outputs, to see if your system replicates the outputs I got.\n\n\n### Code\n\nContents:\n\n- `R-Scripts/`: Scripts that do the heavy lifting of running auto-coders and facilitating analysis.\n- `Shell-Scripts/`: Scripts meant for the user to run; these scripts call the R scripts and collate their outputs.\n\n\nThe division of labor has two benefits.\nFirst, it makes the code more modular, so a larger process isn't completely lost if just one part fails.\nSecond, many high-performance computing environments don't allow users to run code on-demand, instead submitting job requests, packaged into shell scripts, to a workload management system (aka job queue).\nThese shell scripts are written to be compatible with [Slurm](https://slurm.schedmd.com/), the job queue used by Pitt's [CRC](https://crc.pitt.edu/) clusters.\nIf your computing environment _doesn't_ require submitting job requests, the shell scripts should still run as-is.\nYou also have the option of foregoing the shell scripts and running the R scripts directly.\n\n\n\n\n#### R scripts\n\nThese include 'main scripts' that run auto-coders and 'helper scripts' that define functionality shared among the main scripts.\n\nMain scripts:\n\n- `R-Scripts/`\n  - `Run-UMS.R`: Generates a single auto-coder according to an unfairness mitigation strategy.\n  - `Hyperparam-Tuning.R`: Subjects an auto-coder to [hyperparameter tuning](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html#step-4), one stage of optimizing an auto-coder for performance. (Note: This tunes only what \"How to train\" calls [`ranger` parameters](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html#ranger-parameters-mtry-splitrule-min.node.size)) because I'm now less sure that the other hyperparameters are appropriate for tuning.)\n  - `Outlier-Dropping.R`: Subjects an auto-coder to [outlier dropping](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html#step-5), one stage of optimizing an auto-coder for performance.\n\nThe main scripts are written to be called from a command-line client like Bash, using the command `Rscript`.\n(To use `Rscript`, R needs to be in your [PATH](#running-this-code-on-your-own-machine).)\nFor example, if you navigate Bash to `R-Scripts/`, you can run `Rscript Run-UMS.R --ums 0.0`\nThese scripts take several arguments (like `--ums`);\nto see arguments, run `Rscript \u003cscript-name\u003e --help` from the command line.\nIf you prefer working exclusively in R, you can use `rscript()` from [`callr`](https://cran.r-project.org/package=callr) to call these scripts from within R (e.g., `callr::rscript(\"Run-UMS.R\", c(\"--ums\", \"0.0\"), wd=\"R-Scripts/\", stdout=\"../Shell-Scripts/Output/Run-UMS_UMS0.0.out\", stderr=\"2\u003e\u00261\")`).\n\n\nHelper scripts:\n\n- `R-Scripts/`\n  - `UMS-Utils.R`: Contains utility functions for generating and analyzing auto-coders that utilize UMSs. The most important functions are:\n    - `umsData()`: Reshape data for auto-coder by applying UMS, only keeping necessary columns, and optionally dropping outliers\n    - `umsFormula()`: Specify model formula based on UMS\n    - `cls_fairness()`: Investigate auto-coder fairness (see [walkthrough](#rq2-ums-utils))\n    - `cls_summary()`: Generate one-row dataframe of fairness/performance metrics (see [walkthrough](#rq2-ums-utils))\n  - `Rscript-Opts.R`: Defines command-line options for how main scripts should run.\n  - `Session-Info.R`: Combines and prints R session info from the outputs of multiple scripts. Meant to be used in shell scripts.\n\n\n#### Shell scripts\n\nContents:\n\n- `Shell-Scripts/`\n  - `Run-UMS.sh`: Generates a single auto-coder according to a UMS, and optionally optimizes it for performance. This flexible lower-level script is useful for exploratory analysis.\n  - `Baseline.sh`: Wrapper script that calls `Run-UMS.sh` for [baseline auto-coder](https://djvill.github.io/SLAC-Fairness/Analysis-Walkthrough.html#baseline) (mostly exists to override the default [outfile](#outputs) name)\n  - `UMS-Round1.sh`: Generates auto-coders according to UMSs whose codes start with 0, 1, 2, or 3 (save for UMS 0.0, the baseline).\n  - `UMS-Round2.sh`: Generates auto-coders according to UMSs whose codes start with 4.\n\nThe shell scripts are written to be called from Bash, using the commands `bash` (to run directly) or `sbatch` (to submit to a Slurm job queue; [see above](#code)).\nFor example, if you navigate Bash to `Shell-Scripts/`, you can run `sbatch Baseline.sh`.\n`Run-UMS.sh` takes two arguments: a UMS numerical code (e.g., `sbatch Run-UMS.sh 4.2.1`) and an optional `-o` flag to [optimize the auto-coder for performance](https://djvill.github.io/SLAC-Fairness/Analysis-Walkthrough.html#baseline).\nAll other shell scripts hard-code these options (as well as other options passed to the R scripts), but these can be adjusted as needed (under the heading `##EDITABLE PARAMETERS`).\n\n\nThe shell scripts also collate output from the R scripts;\nI recommend saving this output to a text file.\nThese scripts include a Slurm command that automatically writes script output to a corresponding `.out` file in `Outputs/Shell-Scripts/`.\nIf you're not using Slurm, you can append a command that tells Bash where to send outputs, including errors (e.g., `bash Baseline.sh \u0026\u003e ../Outputs/Shell-Scripts/Baseline.out`);\nif you omit this part of the command (e.g., `bash Baseline.sh`), the output will simply print in Bash.\n\n\nCRC's cluster uses [Lmod](http://lmod.readthedocs.org) to make modules (like R) available to shell scripts via the `module load` command.\nYour system may not need to load modules explicitly, or may use different commands to load R.\n\n\n### Code/info pertaining to the repository itself\n\nContents:\n\n- `.gitignore`: Tells Git which files/folders to exclude from being version-controlled (and being shared to GitHub or [between computers](#a-quick-note-on-the-two-computer-setup)). Because the auto-coders are huge files, I exclude `Outputs/Diagnostic-Files/Temp-Autocoders/` from version-control and just [pull out fairness/performance data](https://djvill.github.io/SLAC-Fairness/Analysis-Walkthrough.html#baseline-measure) instead. If there's any I want to keep, I save them to the non-ignored folder `Outputs/Autocoders-to-Keep/`.\n- `LICENSE.md`: Tells you what you're permitted to do with this code.\n- `renv/`: Set up by the [`renv` package](https://rstudio.github.io/renv/) to ensure our code behaves the same regardless of package updates. See more info [below](#renv).\n- `renv.lock`: Set up by `renv` to store [info about package versions](https://rstudio.github.io/renv/articles/lockfile.html).\n- `.Rprofile`: Contains R code to run at the start of any R session in this repository. In this case, this code was set up by `renv` to run a script that loads the package versions recorded in `renv.lock`. If you want to disable `renv`, simply delete this file.\n- `_includes/`: Contains code that is inserted into the `\u003chead\u003e` element of this site's webpages, which uses [GitHub Pages](https://pages.github.com/) and the [Jekyll](https://jekyllrb.com/) theme [Primer](https://github.com/pages-themes/primer) to render the site. This is not necessary for any of the code's core functionality.\n\n\n## Running this code on your own machine\n\nTo run this code on your own machine, you'll need a suitable computing environment and software.\nAll required and recommended software is free and open-source.\nThis document was originally run using high-performance computing resources provided by the University of Pittsburgh's [Center for Research Computing (CRC)](https://crc.pitt.edu/), in particular its [shared memory parallel cluster](https://crc.pitt.edu/resources/h2p-user-guide/node-configuration).\nYou _can_ run this code on a normal desktop or laptop---it just might take a while!\nYou'll also need at least 400 Mb of disk space free.\nSee the [walkthrough](https://djvill.github.io/SLAC-Fairness/Analysis-Walkthrough.html#script-info) for more information about machine specs, running time, and disk space used.\n\n\nRequired software:\n\n- The statistical computing language [R](https://cloud.r-project.org/) (version \u003e= 4.3.0)\n  - Since these scripts call R from the command line, R must be in your PATH (directions for [Windows](https://info201.github.io/r-intro#windows-command-line), [macOS](https://www.architectryan.com/2012/10/02/add-to-the-path-on-mac-os-x-mountain-lion/#.Uydjga1dXDg), [Unix](https://unix.stackexchange.com/a/26059))\n\t\t- To check, run `Rscript -e R.version.string` at the command line. If you see your R version, then R is in your path; if you get the error `Rscript: command not found`, R is not.\n- R packages:\n  - `tidyverse` (v. \u003e= 2.0.0)\n  - `magrittr` (v. \u003e= 2.0.3)\n  - `caret` (v. \u003e= 6.0-94)\n  - `ranger` (v. \u003e= 0.15.1)\n  - `ROCR` (v. \u003e= 1.0-11)\n  - `foreach` (v. \u003e= 1.5.2)\n  - `doParallel` (v. \u003e= 1.0.17)\n  - `optparse` (v. \u003e= 1.7.3)\n  - `this.path` (v. \u003e= 2.0.0) \n  - `benchmarkme` (v. \u003e= 1.0.8)\n  - `rmarkdown` (v. \u003e= 2.22)\n  - `knitr` (v. \u003e= 1.43)\n  - `renv` (v. \u003e= 0.17.3)\n  - These packages will install dependencies that you don't need to install directly. See full R session info [here](https://djvill.github.io/SLAC-Fairness/Analysis-Walkthrough.html#R-session-info)\n- The command-line client [Bash](https://www.gnu.org/software/bash/) (v. \u003e= 5.0.0)\n  - If you install Git (recommended), Bash is included in the install\n- The document converter [Pandoc](https://pandoc.org/) (v. \u003e= 2.19)\n  - If you install RStudio (recommended), Pandoc is included in the install\n\nPlease note that R and its packages are continually updated, so in the future the code may not work as expected (or at all!).\nIf you hit a brick wall, don't hesitate to [reach out](#auditing-this-code-to-critique-andor-suggest-changes)!\n\n\nI also recommend using Git and GitHub to create your own shareable version of the code; \ndoing so will help me effectively troubleshoot any issues you have. \nIn particular:\n\n1. [Download Git](https://git-scm.com/downloads) onto your computer.\n    - See this [Git tutorial](https://github.com/djvill/LSA2019-Reproducible-Research) if you've never used it before.\n1. Sign up for a free [GitHub account](https://github.com/join)\n1. [Fork this repository](https://github.com/djvill/SLAC-Fairness/fork) (keep the same repository name), and clone it onto your computer.\n1. Test out the code on your own system: Edit the code, create commits, push your commits to your remote fork.\n    - You may want to create a 'clean' version of the repository without any of the [generated outputs](#outputs), to see if your system replicates my outputs.\n1. [Reach out](#auditing-this-code-to-critique-andor-suggest-changes)!\n\n\nFinally, I recommend using the integrated development environment [RStudio](https://www.rstudio.com/products/rstudio/download/).\nWhile it doesn't change how the code in this repository works, RStudio makes R code easier to understand, write, edit, and debug.\n\n\n### `renv`\n\nThis repository uses the [`renv`](https://rstudio.github.io/renv/) package to ensure that updates to R packages don't break the code.\nIn effect, `renv` freezes your environment in time by preserving the package versions the code was originally run on.\nThis is great from a reproducibility perspective, but it entails some extra machinery before you can run the code.\nFor all the examples below, you need to load this repo in R or RStudio by setting your working directory somewhere inside the repo.\n\n\nBefore you can run any of this code, run `renv::restore()`.\nThis will download the packages at the correct versions to an `renv` cache on your system.\nThen you should be able to run this code on your machine.\n\n\nOf course, using old versions of these packages means you won't be able to benefit to any package updates since this repo was published.\nIf you want to use new package versions, you have to register them with `renv`.\nIf you're using R 4.3.x (the version used for this code), run `renv::update()`;\nif R \u003e= 4.4, run `renv::init()` and select option 2.\nTo update renv itself, run `renv::upgrade()`.\nOf course, the code may not work as expected thanks to changes to the packages it relies on.\nIf you're satisfied with how the code runs, you can register the updated versions with `renv::snapshot()`.\n\n\nIf you want to use a package that's not registered with `renv`, use `renv::record()`.\n\n\nFinally, if you're finding this all too much of a hassle, you can skip using `renv` altogether;\njust delete `.Rprofile` and restart R/RStudio.\n\n\n\n## Adapting this code to your own projects\n\nHow much you want to adapt this code is really up to you.\nYou might want to 'carbon-copy' this analysis on your own project, but in all likelihood your project will dictate that you make some changes to better fit your project's needs.\nJust as [\"training an auto-coder is not a one-size-fits-all process\"](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html#introduction), so too is auto-coding fairness.\nFor example, if you are confident that your predictor data (e.g., acoustic measures) does not suffer from measurement error, you can skip the time-consuming step of [accounting for outliers](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html#do_we_even_need_to_mark_outliers).\nIn some cases, this code might not necessarily work for your project; \nfor example, this code only handles fairness across two groups, and it only handles binary classification (two categories).\n\nBelow, you can read about:\n\n- Assessing fairness for a [pre-existing auto-coder](#assessing-fairness-for-a-pre-existing-auto-coder)\n- Creating a fair auto-coder by [testing unfairness mitigation strategies](#testing-unfairness-mitigation-strategies)\n  - Using your own [training data](#using-your-own-training-data)\n  - Adding and/or subtracting [UMSs](#adding-andor-subtracting-umss)\n\n\nIn addition, I strongly recommend making the data you use for this task publicly available if possible, since open data helps advance science (see Villarreal \u0026 Collister \"Open methods in linguistics\", in press for Oxford collection _Decolonizing linguistics_).\nHowever, if you do so, make sure what you share conforms to the ethics/IRB agreement(s) in place when the data was collected (if applicable).\n\nFinally, if there's anything in this code that you can't figure out or isn't working for you, please **don't hesitate to [reach out](#auditing-this-code-to-critique-andor-suggest-changes)!**\nPlease note that there is no warranty for this code.\n\n\n### Assessing fairness for a pre-existing auto-coder\n\nThis is one possible goal of your analysis, mirroring the _Linguistics Vanguard_ paper's [RQ2](https://djvill.github.io/SLAC-Fairness/Analysis-Walkthrough.html#RQ2).\nTo analyze your auto-coder, it needs to have been generated by `caret::train()`.\nThe auto-coder's `trainingData` element also needs a column with group data (e.g., which tokens belong to female vs. male speakers).\nIf you use the scripts in this repository, that's taken care of for you; \n`umsData()` retains the group column in the training dataframe passed to `train()`, and `umsFormula()` excludes the group column from the predictor set.\nHowever, if you didn't use these scripts to run your auto-coder, you'll need to either manually add the group column to the `trainingData` element, or just re-run your auto-coder using these scripts.\n\n\nIf your auto-coder conforms to these requirements, you can use functions from `R-Scripts/UMS-Utils.R` to analyze fairness.\nSee the [walkthrough](https://djvill.github.io/SLAC-Fairness/Analysis-Walkthrough.html#rq2-ums-utils) for examples of how to use this code.\n\n\n### Testing unfairness mitigation strategies\n\nThis is the other possible goal of your analysis, mirroring the _Linguistics Vanguard_ paper's [RQ3](https://djvill.github.io/SLAC-Fairness/Analysis-Walkthrough.html#RQ3).\n\n\n#### Using your own training data\n\nYou'll need your own training data (in place of `trainingData.Rds`), and you may need normalization data depending on which UMSs you want to try.\n\nFormatting requirements for training data:\n\n- Tabular data (data stored in rows and columns), saved in a `.csv` or `.Rds` file\n- Each row represents a single token of some categorical linguistic variable\n- At least some of the tokens have been coded into classes (in `trainingData.Rds`, these are tokens for which the column `HowCoded==\"Hand\"`)\n- Columns needed for auto-coder:\n  - 1 column with variant labels for already-coded tokens and blanks/`NA`s for uncoded tokens (`Rpresent` in `trainingData.Rds`)\n    - Currently, this code only handles binary classification (two categories, not counting `NA`s)\n  - 1 column with the group that you're assessing fairness for (`Gender` in `trainingData.Rds`)\n    - Currently, this code only handles two-group fairness\n  - Multiple columns that contain predictors that the auto-coder will use for coding (in `trainingData.Rds`, 180 columns from `tokenDur` to `absSlopeF0`, inclusive)\n    - Can be any data type\n\n\nIf you want to perform any speaker normalization (either as a preprocessing step or as UMS 3.1), you'll also need:\n\n- In your training data, a `Speaker` column\n- An additional data file with normalization baselines (like `meanPitches.csv`) :\n  - One row per speaker, with every speaker in your training data\n  - A `Speaker` column\n  - A column for each baseline measure you want to use for normalization (`MinPitch` in `meanPitches.csv` is used as baseline for the `F0min` measure in the training data, `MaxPitch` for the `F0max` measure)\n\n\nIn addition to those requirements, here are some data formatting recommendations (you don't _have_ to format your data this way, but if not you'll need to tinker with the code some more).\nSome of these pertain to the [data preprocessing step in \"How to train\"](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html#step-1):\n\n- If you suspect that your predictors have [measurement error](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html#do_we_even_need_to_mark_outliers) and you want to take advantage of the outlier-dropping script, then you need to mark measurement outliers. Outliers for predictors `X` \u0026 `Y` (for example) should be marked as `TRUE` in columns `X_Outlier` \u0026 `Y_Outlier`.\n- The code drops rows that have `NA`s for the dependent variable, group, and predictor columns (and outlier columns, if applicable). Depending on how many missing measurements you have, you might want to consider imputing measurements and/or thinning your predictor set.\n- You might want to add a `HowCoded` column to easily separate hand-coded and auto-coded tokens.\n- Thanks to [pre-processing](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html#step-1), `trainingData.Rds` reflects normalized measurements for formant timepoints but not pitch, so UMS 3.1 involves pitch normalization. You might decide to fold _all_ normalization into pre-processing and skip normalization as a UMS.\n- You may want to anonymize your speakers, especially if you choose to make your data open, as in `trainingData.Rds`, but this is strictly optional.\n\n\nFinally, as a general note, you may have to tweak the [R scripts](#r-scripts) a little bit to accommodate your data.\nFor example:\n\n- These scripts assume the training data file is an `.Rds` file. If it's a `.csv`, you'll have to tweak the code\n- If your columns have different names than the ones in `trainingData.Rds` (e.g., if your dependent variable isn't `Rpresent`), you'll need to find-and-replace column names in the scripts. Alternatively, you can pass column names as arguments to `UMS-Utils.R` functions (e.g., `umsData(myData, dependent=ING, group=Ethnicity)`).\n- If you don't have a `HowCoded` column, you'll need to modify the lines of code that refer to that column.\n- Depending on the size of your predictor set, you'll want to change the default value of `mtry` (the number of predictors attempted at each split) in `Rscript-Opts.R` and `Hyperparam-Tuning.R`; a typical value is the square root of the number of predictors, rounded down to the nearest integer.\n- If you rename or reorganize folders or files, you'll need to change the code to account for that.\n\n\n#### Adding and/or subtracting UMSs\n\nDepending on your groups, your predictor set, and/or your dependent variable, you might want to add or subtract UMSs.\nFor example, if you already have equal token counts for women vs. men, UMS 1.1 (downsample men to equalize token counts by gender) wouldn't apply.\n\n\nTo add a new UMS:\n\n1. Pick a new [UMS code](https://djvill.github.io/SLAC-Fairness/UMS-Info.html#ums-codes)\n    - Don't use a code that's already been defined (it just creates unnecessary complications)\n    - If it's a combination UMS, the code should start with `4`\n1. Add the code and description to `Input-Data/UMS-List.txt`\n1. Modify `umsData()` in `R-Scripts/UMS-Utils.R`\n    - Single UMSs: Add a new `} else if (UMS==\"\u003cnew-UMS\u003e\") {` block to the `implementUMS()` subroutine\n    - Combination UMSs: Add code to interpret the second \u0026 third digits near the bottom of `umsData()`\n    - Note that `umsData()` uses [`tidyselect` semantics](https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html) for several arguments (`dependent`, `group`, `predictors`, \u0026 `dropCols`). If you're using any of these column names in a `dplyr` function, wrap them in double-braces (e.g., `data %\u003e% select({{dependent}}, {{group}})`); if you need a column name as a string, use the deparse-substitute trick (e.g., `depName \u003c- deparse(substitute(dependent))`)\n1. If using a shell script to run multiple UMSs in a single round, edit the script so the UMS code is matched by the `pattern` regex and not by `excl` (e.g., to include UMS 5.1, use `pattern=^[0-35]`)\n\nYou only need to subtract a UMS explicitly if you're using a shell script to run multiple UMSs in a single round.\nTo subtract a UMS, use the `excl` regex to exclude it (e.g., to exclude UMSs 1.4 and 2.2, use `excl=\"^1.4|2.2\"`).\nNo need to modify `umsData()`, since the code will just skip over that UMS in the chain of `else if {}` statements.\n\nNote that the existing UMS list is actually more general than its descriptions suggest.\nFor example, UMSs 1.3.1 and 1.3.2 both achieve equal /r/ base rates by gender, by downsampling either women's Absent (1.3.1) or men's Present (1.3.2).\nHowever, `umsData()` actually translates this into \"downsample one of the classes from the smaller group\" vs. \"the bigger group\", automatically detecting which class to downsample from which group.\nTry plugging your data into `umsData()` to see whether the existing code affects your data the way you expect.\n\n\n## Auditing this code to critique and/or suggest changes\n\nReaders are more than welcome to critique this code!\nWhile I think much of this code is pretty solid, there are no doubt some bugs here and there, some inefficient code implementations, and/or some tortured data-scientific reasoning.\nYou can [raise GitHub issues](https://github.com/djvill/SLAC-Fairness/issues), [start discussions](https://github.com/djvill/SLAC-Fairness/discussions), or [send me an email](mailto:d.vill@pitt.edu?subject=[SLAC-Fairness]%20Auditing%20code).\n\n**Please don't be afraid to suggest changes, report bugs, or ask questions---I want this code to be useful for you, and there are no bad questions!**\n\n\n## Citing this repository\n\nIf you use this repository, **please cite it**!\nStudies show that research [software](https://doi.org/10.1002/asi.23538) and [data](https://doi.org/10.1371/journal.pone.0136631) are under-cited, which makes it hard for contributors to gauge usage or get credit.\nHere's a recommended citation:\n\n\u003e Villarreal, Dan. 2023. `SLAC-Fairness`: Tools to assess fairness and mitigate unfairness in sociolinguistic auto-coding. Available at https://djvill.github.io/SLAC-Fairness/.\n\n\n## Acknowledgements\n\nI would like to thank Chris Bartlett, the Southland Oral History Project (Invercargill City Libraries and Archives), and the speakers for sharing their data and their voices.\nThanks are also due to Lynn Clark, Jen Hay, Kevin Watson, and the New Zealand Institute of Language, Brain and Behaviour for supporting this research.\nValuable feedback was provided by audiences at NWAV 49, the Penn Linguistics Conference, Pitt Computer Science, and the Michigan State SocioLab.\nOther resources were provided by a Royal Society of New Zealand Marsden Research Grant (16-UOC-058) and the University of Pittsburgh Center for Research Computing (specifically, the H2P cluster supported by NSF award number OAC-2117681).\nAny errors are mine entirely.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdjvill%2Fslac-fairness","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdjvill%2Fslac-fairness","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdjvill%2Fslac-fairness/lists"}