{"id":17101339,"url":"https://github.com/hackerb9/ssa-baby-names","last_synced_at":"2026-03-12T15:01:16.771Z","repository":{"id":113746466,"uuid":"442931521","full_name":"hackerb9/ssa-baby-names","owner":"hackerb9","description":"List of baby names from 1880 to the present from the US Social Security Administration","archived":false,"fork":false,"pushed_at":"2022-01-04T23:45:15.000Z","size":26952,"stargazers_count":5,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-09T17:13:53.708Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-2.1","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hackerb9.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-30T01:13:38.000Z","updated_at":"2024-12-26T18:00:08.000Z","dependencies_parsed_at":null,"dependency_job_id":"9e738544-612a-4983-81b0-0ff2e585828d","html_url":"https://github.com/hackerb9/ssa-baby-names","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hackerb9/ssa-baby-names","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Fssa-baby-names","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Fssa-baby-names/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Fssa-baby-names/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Fssa-baby-names/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hackerb9","download_url":"https://codeload.github.com/hackerb9/ssa-baby-names/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Fssa-baby-names/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30429287,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-12T14:34:45.044Z","status":"ssl_error","status_checked_at":"2026-03-12T14:09:33.793Z","response_time":114,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-14T15:24:50.635Z","updated_at":"2026-03-12T15:01:16.732Z","avatar_url":"https://github.com/hackerb9.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ssa-baby-names\n\nList of baby names from 1880 to the present from the US Social\nSecurity Administration\n\n## Overview\n\nThis is a corpus of all the first names that were given to five or\nmore babies in the United States for every year from 1880 to the\npresent. It includes scripts for retrieving and processing the\ndatabase from https://www.ssa.gov/oact/babynames/names.zip.\n\n\n## Data Files\n\n* [alldata.txt](alldata.txt): All the Social Security Administration\n  birthname data merged into a single file, instead of having a\n  separate yob*.txt file for each year. Sorted by year then sex then\n  popularity. Format is comma separated values, delimited by newlines:\n\n  \u003e \u0026lt;_name_\u003e,\u0026lt;_sex_\u003e,\u0026lt;_occurances_\u003e,\u0026lt;_year_\u003e\n\n  Note: SSA data only includes names given to at least five babies in\n  a single year and does not include single letter names.\n\n  2,020,863 entries, 33 MiB.\n\n* [maxoccurances.txt](maxoccurances.txt): The maximum number of\n  occurances of each name (merged sex) and the year that maximum was\n  found. (In case of ties, the earlier year is listed).\n    \n  \u003e \u0026lt;_name_\u003e,\u0026lt;_sex_\u003e,\u0026lt;_max occurances_\u003e,\u0026lt;_max year_\u003e\n\n  File is sorted by most babies named in a single year.\n  Each name is listed only once, under the sex that had the most\n  occurances. For example, `Charlie` occurs as male, but not female.\n\n* [maxoccurbysex.txt](maxoccurbysex.txt): The maximum number of\n  occurances of each name (separated by sex) and the year that maximum\n  was found. (In case of ties, the earlier year is listed).\n    \n  \u003e \u0026lt;_name_\u003e,\u0026lt;_sex_\u003e,\u0026lt;_max occurances_\u003e,\u0026lt;_max year_\u003e\n\n  \u003cdetails\u003e\n  \n  File is sorted by number of occurances, from most to least.\n\n  The top five entries:\n\n      Linda,F,99693,1947\n      James,M,94764,1947\n      Michael,M,92718,1957\n      Robert,M,91647,1947\n      John,M,88319,1947\n\n  Names are counted separately by sex, for example:\n\n      Charlie,M,2891,1919\n      Charlie,F,2219,2020\n\n  \u003c/details\u003e\n\n  111,472 entries, 2 MiB.\n  \n* [allnames.txt](allnames.txt): A simple list of every name seen since\n  1880. Sexes are merged, so each name is listed only once. Sorted by\n  most babies in a single year to the least.\n\n  Currently there are 100,364 names, from Aaban to Zzyzx.\n\n* [identifiers.txt](identifiers.txt): Based on allnames.txt, a list of\n  over 100k valid, unique, memorable identifiers. Every line contains\n  a sequence of 2 to 15 lowercase ASCII characters. Most will be\n  recognized by English speakers as given names.\n\n  \u003cdetails\u003e\n\n  This differs from [allnames.txt](allnames.txt) in two ways:\n\n  1. The \"NATO Phonetic Alphabet\" has been prepended at the beginning.\n  2. Names are lower case.\n\n  While this file could be handy for many things, the idea is that a\n  project which is trying to deobfuscate code can simply grab an\n  identifier from this list to use as a function name. This will\n  hopefully make reading such source code easier.\n\n  Note that no attempt has been made to remove homophones (names that\n  are spelled differently but sound similar). If this turns out to be\n  a problem in practice, we can turn to phonetic algorithms, like\n  Soundex, to keep only one. (More popular spellings appear earlier in\n  the list, so we'd keep \"Aaron\", name number 159, and toss \"Erin\",\n  name number 161).\n\n  \u003c/details\u003e\n\n* [atleast1000.txt](atleast1000.txt),\n  [atleast500.txt](atleast500.txt), [atleast100.txt](atleast100.txt):\n  Simple list, restricted to names that have been given to at least\n  _x_ number of babies in a single year. Sorted from greatest number\n  of occurances to least.\n\n* [raw-data](raw-data): A directory for the unzipped data files from\n  the SSA's [names.zip](raw-data/names.zip). The files are named by\n  year of birth, for example\n  [raw-data/yob2020.txt](raw-data/yob2020.txt). The format is\n  described by the SSA like so:\n\n  \u003e National Data on the relative frequency of given names in the population\n  \u003e of U.S. births where the individual has a Social Security Number\n  \u003e \n  \u003e (Tabulated based on Social Security records as of March 7, 2021)\n  \u003e \n  \u003e For each year of birth YYYY after 1879, we created a comma-delimited file\n  \u003e called yobYYYY.txt. Each record in the individual annual files has the\n  \u003e format \"name,sex,number,\" where name is 2 to 15 characters, sex is M\n  \u003e (male) or F (female) and \"number\" is the number of occurrences of the\n  \u003e name. Each file is sorted first on sex and then on number of occurrences\n  \u003e in descending order. When there is a tie on the number of occurrences,\n  \u003e names are listed in alphabetical order. This sorting makes it easy to\n  \u003e determine a name's rank. The first record for each sex has rank 1, the\n  \u003e second record for each sex has rank 2, and so forth. To safeguard\n  \u003e privacy, we restrict our list of names to those with at least 5\n  \u003e occurrences.\n\n## Scripts\n\n### bin/getlatest.sh\n\n[bin/getlatest.sh](bin/getlatest.sh) downloads the corpus of baby\nnames from the US Social Security Administration, if it has not\nalready been downloaded. It is essentially just\n\n    wget https://www.ssa.gov/oact/babynames/names.zip\n\nbut with lots of error checking.\n\n### bin/processnames.sh\n\n[bin/processnames.sh](bin/processnames.sh) reads the raw data and creates\nthe more easily digested text files, such as [allnames.txt](allnames.txt).\n\n#### Future ideas\n\n* Perhaps sort files by _total_ usage of a name instead of by max\n  number of babies named in a year. This would be different in that\n  names which are only briefly popular would be pushed lower in the\n  list and names which have no stand out years but are perennial\n  favorites will move up. \n\n* Create shorter lists by restricting names to the top _x_ percent.\n\n* Merge names that sound the same onto a single line. This could be\n  done using Knuth's soundex or other phonetic algorithm. (See James\n  Turk's [Jellyfish library](https://github.com/jamesturk/jellyfish).)\n\n## Munging\n\nNote that the database SSA provides has the data munged in various\nways which they have not documented. This appears to include:\n\n* One letter names are not recorded\n* ASCII alphabetic characters only (no é, ñ, etc)\n* First letter is capital and the rest lowercase\n* Names longer than fifteen letters are truncated\n* Spaces and hyphens removed (\"Anna Marie\" → \"Annamarie\")\n\nTake the name _María de los Ángeles_ which suffers from all of the\nabove restrictions and is rendered in the database as:\n`Mariadelosangel` and `Mariadelosang`.\n\n### How have munging policies changed over time?\n\nIt is not clear what the SSA policies are concerning the name format\nand how they have changed over the years. Some open questions:\n\n* Is there a better database that the SSA has available? \n\n* The SSA surely has a document describing the precise formatting of names.\n  Where is it?\n\n* Are there other munging operations occuring that are not listed here?\n\n* What is the order of the operations? For examples, the name _María\n  de los Ángeles_ is rendered in three different ways in the database:\n\n  \u003cdetails\u003e\u003csummary\u003e$ grep Mariadelos raw-data/yob*\u003c/summary\u003e\n\n  ```grep\n  raw-data/yob1982.txt:Mariadelosangel,F,6\n  raw-data/yob1986.txt:Mariadelosangel,F,8\n  raw-data/yob1987.txt:Mariadelosangel,F,7\n  raw-data/yob1988.txt:Mariadelosangel,F,7\n  raw-data/yob1989.txt:Mariadelosang,F,6\n  raw-data/yob1994.txt:Mariadelos,F,5\n  raw-data/yob1995.txt:Mariadelosang,F,6\n  raw-data/yob1996.txt:Mariadelosang,F,6\n  raw-data/yob1997.txt:Mariadelosang,F,14\n  raw-data/yob1998.txt:Mariadelosang,F,8\n  raw-data/yob1999.txt:Mariadelosang,F,6\n  raw-data/yob2000.txt:Mariadelosang,F,14\n  raw-data/yob2001.txt:Mariadelosang,F,13\n  raw-data/yob2002.txt:Mariadelosang,F,12\n  raw-data/yob2003.txt:Mariadelosang,F,12\n  raw-data/yob2004.txt:Mariadelosang,F,8\n  raw-data/yob2005.txt:Mariadelosang,F,12\n  raw-data/yob2006.txt:Mariadelosang,F,8\n  raw-data/yob2007.txt:Mariadelosang,F,11\n  raw-data/yob2008.txt:Mariadelosang,F,6\n  raw-data/yob2010.txt:Mariadelosang,F,6\n  ```\n\n  \u003c/details\u003e\n\n  1. `Mariadelosangel` (1982 to 1988)\n  2. `Mariadelosang  ` (1989 and 1995 to 2010)\n  3. `Mariadelos     ` (1994)\n\nIt looks like something odd changed in 1989. One hypothesis that could\n  explain that change is that the transliteration to alphabetic ASCII is\n  happening after the restriction of fifteen characters. Perhaps, because\n  _María de los Ángeles_ has two diacritics, the maximum length is shortened\n  by two.\n\n      María de los Ángeles\n      Mari´a de los A´ngeles\n      Mari´adelosa´ngeles\n      Mari´adelosa´ng          \u003c-- truncate to fifteen characters\n      Mariadelosang\n\n  Supporting evidence: _María del Rosario_ is recorded as\n  `Mariadelrosario` in the 1980s, but is only seen as `Mariadelrosari`\n  after that. It makes sense it would lose only one character because\n  it has only one diacritical mark.\n\n      $ grep Mariadelros raw-data/yob*\n      raw-data/yob1981.txt:Mariadelrosario,F,5\n      raw-data/yob1986.txt:Mariadelrosario,F,6\n      raw-data/yob1987.txt:Mariadelrosario,F,5\n      raw-data/yob1998.txt:Mariadelrosari,F,5\n      raw-data/yob2001.txt:Mariadelrosari,F,6\n      raw-data/yob2002.txt:Mariadelrosari,F,5\n      raw-data/yob2003.txt:Mariadelrosari,F,5\n      raw-data/yob2006.txt:Mariadelrosari,F,5\n      raw-data/yob2007.txt:Mariadelrosari,F,6\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhackerb9%2Fssa-baby-names","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhackerb9%2Fssa-baby-names","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhackerb9%2Fssa-baby-names/lists"}