{"id":22538178,"url":"https://github.com/cfpb/proxy-methodology","last_synced_at":"2026-02-17T17:32:41.669Z","repository":{"id":20164477,"uuid":"23435142","full_name":"cfpb/proxy-methodology","owner":"cfpb","description":null,"archived":false,"fork":false,"pushed_at":"2024-04-11T17:19:32.000Z","size":87994,"stargazers_count":74,"open_issues_count":1,"forks_count":42,"subscribers_count":24,"default_branch":"master","last_synced_at":"2025-10-07T20:59:26.900Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Stata","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cfpb.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2014-08-28T16:31:53.000Z","updated_at":"2025-08-05T14:51:57.000Z","dependencies_parsed_at":"2025-02-02T07:39:04.186Z","dependency_job_id":null,"html_url":"https://github.com/cfpb/proxy-methodology","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/cfpb/proxy-methodology","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cfpb%2Fproxy-methodology","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cfpb%2Fproxy-methodology/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cfpb%2Fproxy-methodology/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cfpb%2Fproxy-methodology/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cfpb","download_url":"https://codeload.github.com/cfpb/proxy-methodology/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cfpb%2Fproxy-methodology/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29551257,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-17T14:33:00.708Z","status":"ssl_error","status_checked_at":"2026-02-17T14:32:58.657Z","response_time":100,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-07T11:10:40.146Z","updated_at":"2026-02-17T17:32:41.637Z","avatar_url":"https://github.com/cfpb.png","language":"Stata","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BISG_RACE_ETHNICITY\n\nIn conducting fair lending analysis in both supervisory and enforcement\ncontexts, the Bureau’s Office of Research (OR) and Division of Supervision,\nEnforcement, and Fair Lending (SEFL) rely on a Bayesian Improved Surname\nGeocoding (BISG) proxy method, which combines geography- and surname-based\ninformation into a single proxy probability for race and ethnicity used in fair\nlending analysis conducted for non-mortgage products.\nThis document describes the steps needed to build the BISG proxies.\n\nThe methodology described here is an example of a proxy methodology that\nOR and SEFL use, although we may alter this methodology in particular analyses,\ndepending on the circumstances involved.\nIn addition, the proxy method may be revised as we become aware of enhancements\nthat would increase accuracy and performance.\nFor more details, see [“Using Publicly Available Information to Proxy for\nUnidentified Race and Ethnicity: A Methodology and Assessment”][paper].\n\nIncluded are a series of Stata scripts and subroutines that prepare the\npublicly available census geography and surname data and that construct the\nsurname-only, geography-only, and BISG proxies for race and ethnicity.\nThe scripts, subroutines, and data provided here do not contain directly\nidentifiable personal information or other confidential information,\nsuch as confidential supervisory information.\n\nPlease note that all scripts and subroutines are written for execution in\nStata 12 on a Linux platform and may need to be modified for other environments.\nUsers must define a number of parameters, including file paths and arguments for subroutines.\nThe scripts that define the subroutines also identify and describe arguments, as required.\n\nUsers must supply their own application- or individual-level data,\nand any geocoding of those data must occur prior to the execution of the\nscript sequence: **this code assumes that the input application- or\nindividual-level data are already geocoded with census block group,\ncensus tract, and 5-digit ZIP code.**\n\nHowever, included is an example designed to instruct the user in executing\nthe proxy building code sequence.\nIt relies on a set of fictitious data constructed by `create_test_data.do` from\nthe publicly available census surname list and geography data.\nIt is provided to illustrate how the `main.do` is set up to run the proxy\nbuilding code and does not reflect any particular individual’s or\ninstitution’s information.\n\nA control script, `/scripts/main.do`, is included to step through the process below.\nThe user will need to change paths and define parameters as required.\n\n1. Geocode the data in a geocoding software package (for example, ArcGIS)\n   to obtain tract and block group identifiers for each record.\n1. Build name and geography proxies from Census files included in `/input_files`:\n   1. Census surname list:\n      1. `/scripts/surname_creation_lower.do`—takes .csv file of census surnames,\n         formats surnames to be read as all lower case,\n         and imputes any suppressed values.\n         File created by `surname_creation_lower.do`:\n         1. `/input_files/created/census_surnames_lower.dta`\n      1. In order to prepare the user-defined datasets for use with the Census surname list,\n         basic cleaning of surnames using regular expressions and other forms of\n         name standardization is required.\n         This script exists at: `/scripts/surname_parser.do`.\n         File created by `surname_parser.do` in user-defined directory:\n         1. ``` `dir'/proxy_name.dta ```\n   1. Census geographies:\n      1. `/scripts/create_attr_over18_all_geo_entities.do`—uses the base information,\n         for individuals age 18 and older, from the Census flat files for\n         block group, tract, and ZIP code[\u003csup\u003e1\u003c/sup\u003e](#fn-1) and allocates\n         \"Some Other Race\"[\u003csup\u003e2\u003c/sup\u003e](#fn-2) to each group in proportion.\n         It creates three files (one each for block group, tract, and ZIP code)\n         with geo probabilities for use in proxy:\n         1. `/input_files/created/blkgrp_attr_over18.dta`\n         1. `/input_files/created/tract_attr_over18.dta`\n         1. `/input_files/created/zip_attr_over18.dta`\n1. Calculate the BISG probabilities following the methodology described in\n   [“Using Publicly Available Information to Proxy for Unidentified Race and Ethnicity:\n   A Methodology and Assessment”][paper].\n   1. `/scripts/geo_name_merger_all_entities_over18.do`—this program\n      creates three files (one each for block group, tract, and ZIP code)\n      with BISG probabilities in user-defined directory:\n      1. ```/`maindir'/`inst_name'_proxied_blkgrp.dta```\n      1. ```/`maindir'/`inst_name'_proxied_tract.dta```\n      1. ```/`maindir'/`inst_name'_proxied_zip.dta```\n1. The final step is to merge together the block group, tract, and ZIP code-based BISG proxies\n   and choose the most precise proxy given the precision of geocoding,\n   e.g. block group (if available), then tract (if available), or ZIP code\n   (if block group and tract unavailable) using:\n   1. `/scripts/combine_probs.do`\n      File created by combine_probs.do in user-defined directory:\n      1. ```/`maindir'/`inst_name'_`file'proxied_final.dta```\n\nPlease direct all questions, comments, and suggestions to:\nCFPB_proxy_methodology_comments@cfpb.gov.\n\n---\n\n\u003ca aria-hidden=\"true\" href=\"#fn-1\" class=\"anchor\" name=\"user-content-fn-1\"\u003e\u003cspan class=\"octicon octicon-link\"\u003e\u003c/span\u003e\u003c/a\u003e\n\u003csup\u003e1\u003c/sup\u003e\n    When referring to ZIP code demographics, we match the institution-based\n    ZIP code information to ZIP Code Tabulation Areas (ZCTAs) as defined by\n    the U.S. Census Bureau.\n\n\u003ca aria-hidden=\"true\" href=\"#fn-2\" class=\"anchor\" name=\"user-content-fn-2\"\u003e\u003cspan class=\"octicon octicon-link\"\u003e\u003c/span\u003e\u003c/a\u003e\n\u003csup\u003e2\u003c/sup\u003e\n    In the 2010 SF1, the U.S. Census Bureau produced tabulations that report\n    counts of Hispanics and non-Hispanics by race.\n    These tabulations include a “Some Other Race” category.\n    We reallocate the “Some Other Race” counts to each of the remaining six\n    race and ethnicity categories using an Iterative Proportional Fitting\n    procedure to make geography based demographic categories consistent with\n    those on the census surname list.\n\n\n---\n\n## Update to proxy methodology – April 2017\n\nIn the summer 2014 edition of _Supervisory Highlights_,[\u003csup\u003e3\u003c/sup\u003e](#fn-3)\nthe Bureau previously reported that examination teams use a\nBayesian Improved Surname Geocoding (BISG) proxy methodology for race and\nethnicity in their fair lending analysis of non-mortgage credit products.\nThe BISG methodology relies on the distribution of race and ethnicity based on\nplace-of-residence and surname, which are publicly available information from\nCensus. The method involves constructing a probability of assignment to race\nand ethnicity based on demographic information associated with surname and then\nupdating this probability using the demographic characteristics of the census\nblock group associated with place of residence. The updating is performed\nthrough the application of a Bayesian algorithm, which yields an integrated\nprobability that can be used to proxy for an individual’s race and\nethnicity.[\u003csup\u003e4\u003c/sup\u003e](#fn-4)\n\nThrough March of 2017, examination teams had relied on the surname list derived\nfrom the 2000 Decennial Census of the Population in their construction of the\nBISG proxy for race and ethnicity.[\u003csup\u003e5\u003c/sup\u003e](#fn-5) In December 2016, the\nU.S. Census Bureau released a list of the most frequently occurring surnames\nbased on data derived from 2010 Decennial Census of the Population. The updated\n2010 list generally uses the same definitions and formats as the list based on\nthe 2000 Census but includes updated values for total counts and race and\nethnicity shares associated with each surname.[\u003csup\u003e6\u003c/sup\u003e](#fn-6) In total,\nthe new surname list provides information on the 162,253 surnames that appear\nat least 100 times in the 2010 Census, covering approximately 90% of the\npopulation.[\u003csup\u003e7\u003c/sup\u003e](#fn-7) While 146,516 names appear on both the 2000\nand 2010 surname lists, the 2010 list contains 15,737 names that do not appear\non the 2000 list, whereas the 2000 list contains 5,155 names that do not appear\non the 2010 list.[\u003csup\u003e8\u003c/sup\u003e](#fn-8)\n\nAs of April 2017, examination teams are relying on an updated proxy methodology\nthat reflects the newly available surname data from the Census Bureau. Our\nupdated proxy methodology relies on the race and ethnicity shares for the\n162,253 names that appear on the 2010 list and supplements this list with the\nrace and ethnicity shares for the 5,155 names that appear on the 2000 list but\nnot on the 2010 list, resulting in a list of 167,409 surnames in\ntotal.[\u003csup\u003e9\u003c/sup\u003e](#fn-9)\n\nThe updated name list, statistical software code written in Stata, and other\npublicly available data used to build the BISG proxy are now available\nin this repository.\n\nPlease direct all questions, comments, and suggestions to:\nCFPB_proxy_methodology_comments@cfpb.gov.\n\n---\n\n## Update to proxy methodology – April 2024\n\nAs of April 2024, examination teams performing fair lending analysis \nare relying on the newly available demographic data from the Census Bureau. The\nnew demographic data is derived from the 2020 Census Demographic and Housing Characteristic\n(DHC) Files.[\u003csup\u003e10\u003c/sup\u003e](#fn-10) To derive the new demographic files, the Bureau pulled DHC Table P11 at \nthe Block Group, Census Tract, and Zip Code level through the Census API.[\u003csup\u003e11\u003c/sup\u003e](#fn-11)\nThis product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau.\n\nThe updated demographic data and modified versions of the BISG proxy code using the new 2020 demographic\ndata are all now available in the 2024-update folder in this repository.  The repository also continues\nto contain all of the previous code and data for users who would prefer to continue to generate proxies\nusing the 2010 demographic data.\n\nPlease direct all questions, comments, and suggestions to:\nCFPB_proxy_methodology_comments@cfpb.gov.\n\n---\n\n\u003ca aria-hidden=\"true\" href=\"#fn-3\" class=\"anchor\" name=\"user-content-fn-3\"\u003e\u003cspan class=\"octicon octicon-link\"\u003e\u003c/span\u003e\u003c/a\u003e\n\u003csup\u003e3\u003c/sup\u003e\n    See Consumer Financial Protection Bureau,\n    [_Supervisory Highlights: Summer 2014_](http://files.consumerfinance.gov/f/201409_cfpb_supervisory-highlights_auto-lending_summer-2014.pdf)\n    (Sept. 17, 2014).  \n\n\u003ca aria-hidden=\"true\" href=\"#fn-4\" class=\"anchor\" name=\"user-content-fn-4\"\u003e\u003cspan class=\"octicon octicon-link\"\u003e\u003c/span\u003e\u003c/a\u003e\n\u003csup\u003e4\u003c/sup\u003e\n    For more information on the methodology, see Consumer Financial Protection Bureau,\n    [_Using publicly available information to proxy for unidentified race and ethnicity_](paper)\n    (Sept. 2014).\n\n\u003ca aria-hidden=\"true\" href=\"#fn-5\" class=\"anchor\" name=\"user-content-fn-5\"\u003e\u003cspan class=\"octicon octicon-link\"\u003e\u003c/span\u003e\u003c/a\u003e\n\u003csup\u003e5\u003c/sup\u003e\n    See _id_.\n\n\u003ca aria-hidden=\"true\" href=\"#fn-6\" class=\"anchor\" name=\"user-content-fn-6\"\u003e\u003cspan class=\"octicon octicon-link\"\u003e\u003c/span\u003e\u003c/a\u003e\n\u003csup\u003e6\u003c/sup\u003e\n    For more details on the updated 2010 surname list, including revisions to\n    the 2000 methodology and programming, see Joshua Comenetz,\n    [_Frequently Occurring Surnames in the 2010 Census_](http://www2.census.gov/topics/genealogy/2010surnames/surnames.pdf)\n    (Oct. 2016).\n\n\u003ca aria-hidden=\"true\" href=\"#fn-7\" class=\"anchor\" name=\"user-content-fn-7\"\u003e\u003cspan class=\"octicon octicon-link\"\u003e\u003c/span\u003e\u003c/a\u003e\n\u003csup\u003e7\u003c/sup\u003e\n    The surname data are available on the Census Bureau’s website, see\n    [_Frequently Occurring Surnames from the 2010 Census_](https://www.census.gov/topics/population/genealogy/data/2010_surnames.html)\n    (last revised Dec. 27, 2016).\n\n\u003ca aria-hidden=\"true\" href=\"#fn-8\" class=\"anchor\" name=\"user-content-fn-8\"\u003e\u003cspan class=\"octicon octicon-link\"\u003e\u003c/span\u003e\u003c/a\u003e\n\u003csup\u003e8\u003c/sup\u003e\n    Names must appear at least 100 times in the 2010 Decennial Census\n    in order to be included on the surname list.\n\n\u003ca aria-hidden=\"true\" href=\"#fn-9\" class=\"anchor\" name=\"user-content-fn-9\"\u003e\u003cspan class=\"octicon octicon-link\"\u003e\u003c/span\u003e\u003c/a\u003e\n\u003csup\u003e9\u003c/sup\u003e\n    Although these names are not on the 2010 list, and thus likely no longer\n    meet the 100-name threshold, we chose to include them so as to incorporate\n    as much available surname information as possible into the proxy.\n\n\u003ca aria-hidden=\"true\" href=\"#fn-10\" class=\"anchor\" name=\"user-content-fn-10\"\u003e\u003cspan class=\"octicon octicon-link\"\u003e\u003c/span\u003e\u003c/a\u003e\n\u003csup\u003e10\u003c/sup\u003e\n     See\n    [2020 Census Demographic and Housing Characteristics File (DHC)](https://www.census.gov/data/tables/2023/dec/2020-census-dhc.html)\n    (last revised Sep. 27, 2023).\n\n\u003ca aria-hidden=\"true\" href=\"#fn-11\" class=\"anchor\" name=\"user-content-fn-11\"\u003e\u003cspan class=\"octicon octicon-link\"\u003e\u003c/span\u003e\u003c/a\u003e\n\u003csup\u003e11\u003c/sup\u003e\n     See\n    [2020 DHC Table P11](https://data.census.gov/table?q=p11\u0026g=010XX00US$1500000\u0026y=2020\u0026d=DEC%20Demographic%20and%20Housing%20Characteristics).\n\n[paper]: http://www.consumerfinance.gov/reports/using-publicly-available-information-to-proxy-for-unidentified-race-and-ethnicity/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcfpb%2Fproxy-methodology","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcfpb%2Fproxy-methodology","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcfpb%2Fproxy-methodology/lists"}