https://github.com/nguyenda18/portland-jail-data-crawler
Scraper used for recording changes to Portland jail database
https://github.com/nguyenda18/portland-jail-data-crawler
dataframe datasette python python3 scrapy scrapy-crawler scrapy-spider
Last synced: 28 days ago
JSON representation
Scraper used for recording changes to Portland jail database
- Host: GitHub
- URL: https://github.com/nguyenda18/portland-jail-data-crawler
- Owner: NguyenDa18
- Created: 2022-07-20T05:17:34.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2025-04-23T03:43:31.000Z (28 days ago)
- Last Synced: 2025-04-23T04:11:54.639Z (28 days ago)
- Topics: dataframe, datasette, python, python3, scrapy, scrapy-crawler, scrapy-spider
- Language: Jupyter Notebook
- Homepage:
- Size: 37.8 MB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Multnomah County Jail Crawler
[](https://github.com/NguyenDa18/PDX-Jail-Data-Crawler/actions/workflows/main.yml)

## Purpose
Crawl through bookings of PDX Jail Database for data analysis and data transparency purposes. Update data files with scheduled jobs courtesy of GitHub actions.
- Visit Multnomah County Online Inmate Data website: use URL for all inmates in custody: [Link](https://apps.mcso.us/PAID/Home/SearchResults)
- Scrape inmate names and booking dates and update `csvs/inmate_bookings.csv` file
- Visit each inmate link and update `csvs/inmate_details.csv` with inmate details and total amounts for each type of charge against them
- Update `csvs/inmate_charges.csv` with list of charges for all inmates
- Update JSON files in `counts` folder with counts of each category daily### Scraper Details
- Located at `inmates_spider/inmates_spider/spiders/inmates.py`
- Generate Dataframe of inmates and booking dates and update `csvs/inmate_bookings.csv`, sort by descending order of booking dates
- Follow each inmate's URL and generate metadata for each inmate, update `inmates_charges` MongoDB database with charge totals data## Using
- BeautifulSoup
- Pandas
- GitHub Actions (for cron job running scraper)
- MongoDB (using pymongo Python package)## Enhancements
- [X] Storing data to a Database
- [X] Optimizing crawling
- [X] Using Scrapy Spider instead of BeautifulSoup
- [ ] Creating UI for viewing data
- [ ] Send notification when a "red flag" is released## Running It Yourself
**Prerequisite**: Python 3 needs to be installed
1. Clone repo
2. Activate Virtual Environment```
source venv/bin/activate
```3. Install dependencies in Virtual Environment
```
pip install -r requirements.txt
```4. Best way to experiment is using Jupyter Notebook:
```
jupyter notebook
```Then run experimental code in `Sandbox Notebook.ipynb`