Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/soodoku/biocong
Biographical data on members of congress (105th --- 115th).
https://github.com/soodoku/biocong
biography congress
Last synced: 13 days ago
JSON representation
Biographical data on members of congress (105th --- 115th).
- Host: GitHub
- URL: https://github.com/soodoku/biocong
- Owner: soodoku
- Created: 2020-12-31T04:36:32.000Z (almost 4 years ago)
- Default Branch: master
- Last Pushed: 2021-07-17T14:33:15.000Z (over 3 years ago)
- Last Synced: 2024-10-11T12:17:24.373Z (27 days ago)
- Topics: biography, congress
- Language: Jupyter Notebook
- Homepage:
- Size: 41.1 MB
- Stars: 4
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Congressional Biographies
## 97th --- 104th Congress
We use text from the [pdfs](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NZPPJM) (downloaded from Google Books, from where these are freely available) and then parse the text.
### Scripts
1. [parse](scripts/parse-biocong-from-text-all.ipynb)
2. [clean](scripts/clean-biocong-from-text-all.ipynb)## 105th --- 115th Congress
We scrape congressional biographies for 105th to the 115th Congress from the [Congressional Directory](https://www.govinfo.gov/app/collection/cdir/). We download the biographical files, e.g., https://www.govinfo.gov/content/pkg/CDIR-2018-10-29/html/CDIR-2018-10-29-STATISTICALINFORMATION-2.htm and parse them to extract information such as birthdate, number of children, education, etc.
### Scripts
1. [Scrapes the Congressional Directory](scripts/biocong.ipynb) produces [biocong.csv](data/biocong.csv), [biocong-browsepath.csv](data/biocong-browsepath.csv), and [html files (tar.gz)](data/cong_bio_1997_2018.tar.gz)
2. [Download Congressional Biographies Using the API](scripts/biocong-api.ipynb) provides the script for downloading the data using the API. (It produces incomplete data so we don't use this script.)
3. [Parse](scripts/03_parse-biocong.ipynb) iterates through biocong-browsepath.csv and parses the [html files (tar.gz)](data/cong_bio_1997_2018.tar.gz) and produces [biocong-parsed.csv](data/biocong-parsed.csv)
4. [Clean](scripts/04_clean-biocong.ipynb) takes biocong-parsed.csv produces [biocong-cleaned.csv](data/biocong-cleaned.csv)### Data
The final dataset---[biocong-cleaned.csv](data/biocong-cleaned.csv)---has the following columns:
```
'level', 'docCount', 'browsePath', 'title', 'lastpage', 'granuleid', 'packageid', 'pdffile', 'pdf', 'text',
'agencyLevel', 'nodeStatus', 'textfile', 'htmlfile', 'browseline1', 'processingcode', 'nodetype', 'index.1',
'publishdate', 'part', 'forGpo', 'hasChildren', 'hasParents', 'rootNode', 'documentResults', 'hasDocumentResults',
'collectionCode', 'searchPath', 'isContentArea', 'pageSize', 'pageNumber', 'count', 'digitizedFR', 'section',
'firstpage', 'congress', 'biography', 'name', 'party', 'location', 'born_in', 'birthdate', 'education', 'professional',
'married', 'children', 'committees', 'url', 'n_children'
```