https://github.com/nguyenda18/portland-jail-data-crawler

Scraper used for recording changes to Portland jail database
https://github.com/nguyenda18/portland-jail-data-crawler

dataframe datasette python python3 scrapy scrapy-crawler scrapy-spider

Last synced: 28 days ago
JSON representation

Scraper used for recording changes to Portland jail database

Host: GitHub
URL: https://github.com/nguyenda18/portland-jail-data-crawler
Owner: NguyenDa18
Created: 2022-07-20T05:17:34.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2025-04-23T03:43:31.000Z (28 days ago)
Last Synced: 2025-04-23T04:11:54.639Z (28 days ago)
Topics: dataframe, datasette, python, python3, scrapy, scrapy-crawler, scrapy-spider
Language: Jupyter Notebook
Homepage:
Size: 37.8 MB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Multnomah County Jail Crawler

[![scraper-pdx-jail](https://github.com/NguyenDa18/PDX-Jail-Data-Crawler/actions/workflows/main.yml/badge.svg)](https://github.com/NguyenDa18/PDX-Jail-Data-Crawler/actions/workflows/main.yml)

![Portland Justice](https://media.giphy.com/media/SJXKIfZVq5EWieBZbX/giphy.gif)

## Purpose

Crawl through bookings of PDX Jail Database for data analysis and data transparency purposes. Update data files with scheduled jobs courtesy of GitHub actions.

- Visit Multnomah County Online Inmate Data website: use URL for all inmates in custody: [Link](https://apps.mcso.us/PAID/Home/SearchResults)
- Scrape inmate names and booking dates and update `csvs/inmate_bookings.csv` file
- Visit each inmate link and update `csvs/inmate_details.csv` with inmate details and total amounts for each type of charge against them
- Update `csvs/inmate_charges.csv` with list of charges for all inmates
- Update JSON files in `counts` folder with counts of each category daily

### Scraper Details
- Located at `inmates_spider/inmates_spider/spiders/inmates.py`
- Generate Dataframe of inmates and booking dates and update `csvs/inmate_bookings.csv`, sort by descending order of booking dates
- Follow each inmate's URL and generate metadata for each inmate, update `inmates_charges` MongoDB database with charge totals data

## Using
- BeautifulSoup
- Pandas
- GitHub Actions (for cron job running scraper)
- MongoDB (using pymongo Python package)

## Enhancements
- [X] Storing data to a Database
- [X] Optimizing crawling
- [X] Using Scrapy Spider instead of BeautifulSoup
- [ ] Creating UI for viewing data
- [ ] Send notification when a "red flag" is released

## Running It Yourself

**Prerequisite**: Python 3 needs to be installed

1. Clone repo
2. Activate Virtual Environment

```
source venv/bin/activate
```

3. Install dependencies in Virtual Environment

```
pip install -r requirements.txt
```

4. Best way to experiment is using Jupyter Notebook:

```
jupyter notebook
```

Then run experimental code in `Sandbox Notebook.ipynb`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nguyenda18/portland-jail-data-crawler

Awesome Lists containing this project

README