https://github.com/aramshiva/babies
👶 A parser for every name listed on a Social Security Card between 1880-2023
https://github.com/aramshiva/babies
babies data datagov db graphs mysql names social-security social-security-data sql statistics stats
Last synced: 10 months ago
JSON representation
👶 A parser for every name listed on a Social Security Card between 1880-2023
- Host: GitHub
- URL: https://github.com/aramshiva/babies
- Owner: aramshiva
- Archived: true
- Created: 2024-06-12T05:26:21.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-17T04:01:57.000Z (about 1 year ago)
- Last Synced: 2025-08-19T21:54:17.183Z (10 months ago)
- Topics: babies, data, datagov, db, graphs, mysql, names, social-security, social-security-data, sql, statistics, stats
- Language: Python
- Homepage:
- Size: 8.39 MB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
> [!WARNING]
> As of March 16th 2025, this repo is not maintained, it has been merged into the [`names` repo](https://github.com/aramshiva/names) in the `sql` folder.
> [!NOTE]
> This does **not** include any social security numbers. The only data stored is the name, frequency, sex, year born
> This **is** public data given by the Social Security Administration
# Babies
### A parser for every name listed on a social security card between 1880-2023.
*(Tabulated based on Social Security records as of March 3, 2024)*
Your first question is probably why? to that I ask why not?
This data is pulled from the [US Social Security Administration's Baby Names from Social Security Card Applications - National Dataset](https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data).
This script will insert the data into a MySQL database with the following schema:
```
name VARCHAR(255),
sex CHAR(1),
amount INT,
year INT
```
### Some things to keep in note:
- As of 2024 there are around 2,117,219 rows in the database.
- The data is stored in a folder called "names" in the same directory as this script.
- Names with 5 or less occurrences with the sex and year are defaulted to 5 by the SSA to protect privacy
- The sex is a single character, either "M" or "F" for Male or Female.
- The year is the year the person was born, NOT registered.
- The raw data is a folder. For each year of birth YYYY after 1879, we created a comma-delimited file called yobYYYY.txt.
Each record in the individual annual files has the format "name,sex,number," where name is 2 to 15
characters, sex is M (male) or F (female) and "number" is the number of occurrences of the name.
Each file is sorted first on sex and then on number of occurrences in descending order. When there is
a tie on the number of occurrences, names are listed in alphabetical order. This sorting makes it easy to
determine a name's rank. The first record for each sex has rank 1, the second record for each sex has
rank 2, and so forth.
### Want to run yourself?
- Fill in the `.env` (use `.env.example` as a guide)
- Run `python3 main.py`
- Boom! Your mySQL database is now full with data, and a table with 4 columns: `name, sex, amount, year`