Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/fsouza99/american-baby-names

An exploration of public data about names given to American newborns, seeking to answer some questions through PostgreSQL's capabilities.
https://github.com/fsouza99/american-baby-names

data-science plpgsql postgresql public-data sql

Last synced: 2 days ago
JSON representation

An exploration of public data about names given to American newborns, seeking to answer some questions through PostgreSQL's capabilities.

Host: GitHub
URL: https://github.com/fsouza99/american-baby-names
Owner: fsouza99
Created: 2024-03-31T15:50:52.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-03-31T16:03:17.000Z (10 months ago)
Last Synced: 2024-11-18T05:39:13.584Z (about 2 months ago)
Topics: data-science, plpgsql, postgresql, public-data, sql
Language: PLpgSQL
Homepage:
Size: 500 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## American Baby Names

### Intro

This project employs PostgreSQL to analyze a dataset about American baby names seeking the following information:

- *Names with high usage for over 100 years.*
- *Trends on different time periods.*
- *The top 10 male/female names.*
- *The most popular male/female name starting with a certain letter since some year.*
- *The most popular male/female names by year.*
- *The most popular male/female name for the largest number of years.*

The idea came from a Datacamp's [article](https://www.datacamp.com/blog/sql-projects-for-all-levels) about good projects to practice PostgreSQL.

### Dataset

The dataset was obtained from Social Security's [website](https://www.ssa.gov/oact/babynames/limits.html) on March 23, 2024.

The downloaded pack consisted of one text file for every year between 1880 and 2022, presenting a CSV-like layout with 3 columns and no headers:

Mary,F,7065
Anna,F,2604
Emma,F,2003
...

## Project

### A Python helper

We run some preparing Python procedures before getting into SQL. They are all available in the *helper.py* script.

As previously said, the dataset gives us one text file for every year in the 1822-2022 period. We put everything into a single CSV file by running the *assemble()* function.

Since we're investigating the most frequently used American names, the least used can be discarded without harming our goals. This undersampling might be obtained by two distinct functions described in *helper.py*.

Only the data used in the following tests were uploaded into the *data* folder. If you download the complete data and want to run *helper.py*, put all the content into *data/raw* before doing it.

### Creating database objects

Once the data for the study has been defined, we start the database by running *createdb*, a PostgreSQL command:

> createdb american_names

After accessing the database from root directory, we get into the *sql* folder and run the available scripts in order to create the the database objects:

> american_names=# \cd sql
> american_names=# \i creation.sql
CREATE TYPE
CREATE TABLE
COPY 104189
> american_names=# \i analysis.sql
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION

The 104,189 copied instances are from *data/rel_common_names.csv*, generated by *helper.py*.

### Answering questions about the dataset

The functions in *sql/analysis.sql* offer answers to our previous questions about the data.

#### *Names with high usage for over 100 years*

We can see that 45 names had at least 1000 occurences in every year between 1922 and 2022:

> american_names=# select count(*) from query_recurrent_names(1000, 1922::smallint, 2022::smallint);
count
-------
45
(1 row)

Five of these names are:

> american_names=# select * from query_recurrent_names(1000, 1922::smallint, 2022::smallint) limit 5;
recurrent_name
----------------
Andrew
Anna
Anthony
Benjamin
Calvin
(5 rows)

#### *Trends on different time periods*

We can query the names with highest linear growth in a time period. So, how about the 21st century?

Let's see the 2011-2020 decade:

This function considers only the names that appear every year in the appointed time period.

#### *The top 10 male/female names*

Top 10 male names:

The female part can be found by running this same function, but passing 'F' as argument instead of 'M'.

#### *The number of male/famale names starting with some letter since any year*

The number of male names starting with "Y" in the 21st century.

> american_names=# select * from special_start_count('M', 'Y', 2001::smallint);
special_start_count
---------------------
52
(1 row)

#### *The most popular male/female names by year*

The most popular female names in the 1980's:

Amongst the male names, Michael dominated the entire decade.

#### *The most popular male/female name for the largest number of years*

John and Michael are the male names that most oftenly appeared as annual leaders:

And Mary is by far the female champion of this metric: