https://github.com/titaniumbones/download-exec-orders
https://github.com/titaniumbones/download-exec-orders
Last synced: 9 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/titaniumbones/download-exec-orders
- Owner: titaniumbones
- Created: 2025-01-30T00:32:30.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-06T20:22:38.000Z (over 1 year ago)
- Last Synced: 2025-04-04T22:11:35.447Z (about 1 year ago)
- Language: Python
- Size: 127 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Federal Register Executive Orders Text Extractor
This script downloads executive orders from the Federal Register, extracts their text content, and creates a consolidated JSON file containing both the metadata and full text of each order.
## Description
The main script (`download-and-update-json.py`) performs the following operations:
1. Downloads a JSON file containing metadata about executive orders from the Federal Register API
2. Downloads PDF versions of each executive order
3. Extracts text content from the PDFs
4. Creates a new JSON file that includes both the original metadata and extracted text content
5. Saves individual text files for each executive order
Because there is a delay between signing of an order and publication on the Federal Register, it can be convenient to scrape the orders themselves from another source. [The American Presidency Project](https://www.presidency.ucsb.edu/documents/app-categories/written-presidential-orders/presidential/executive-orders?items_per_page=40&field_docs_start_date_time_value[value][date]=2025) posts all Executive Orders to their website quite rapidly, and has a convenient listing. `alternative-scraper.py` crawls and scrapes Executive Orders from here, using a simple/ad-hoc JSON schema. The download script is preferred for projects dependent on authoritative data sources; the scraper is useful for timely or urgent analysis and response. It performs the following tasks:
1. Find links to individual EOs
2. Crawl to and parse the individual pages
3. Record results in `documents_scraped.json`
## Prerequisites
- Python 3.6 or higher
- pip (Python package installer)
## Installation
1. Clone this repository or download the files
2. Install the required packages:
```bash
pip install -r requirements.txt
```
## Project Structure
```
exec-orders/
├── alternative-scraper.py
├── download-and-update-json.py
├── requirements.txt
├── pdfs/ # Created during execution
├── texts/ # Created during execution
└── documents.json # Created during execution
└── documents_with_text.json # Created during execution
└── executive_orders_2025.json # Created during execution of scraper
```
## Usage
Run the download script from the command line:
```bash
python download-and-update-json.py
```
Or alternatively run the scraper:
```bash
python alternative-scraper.py
```
## Output
The `download-and-update-json` script creates several directories and files:
- `pdfs/`: Contains downloaded PDF files of executive orders
- `texts/`: Contains extracted text files (one per executive order)
- `documents.json`: Original JSON data from the Federal Register API
- `documents_with_text.json`: Enhanced JSON file including extracted text content
`alternative-scraper.py`, by contrast, creates only the final output file:
- `documents_scraped.json`: simple JSON file recording date, president, title, and text content
## File Naming Convention
- PDF files are named using the document number: `{document_number}.pdf`
- Text files are named using both document number and title: `{document_number}_{title}.txt`
## Error Handling
The script includes error handling for:
- Failed JSON downloads
- Failed PDF downloads
- Failed text extraction
- Failed file writing operations
If any individual document fails to process, the script will continue with the remaining documents.
*TODO*: document error handling in scraper.
## Dependencies
- `requests`: For downloading JSON and PDF files
- `PyPDF2`: For extracting text from PDF files
- `eautifulsoup4`: For scraping content from HTML pages
See `requirements.txt` for specific version requirements.
## Success Metrics
The script provides a summary of:
- Total PDFs processed
- Successfully processed PDFs
- Failed PDFs
- Total pages processed
## Note
The download script is specifically designed to work with the Federal Register API and assumes a specific format for the input JSON data. The target URL is configured to fetch executive orders signed by Donald Trump starting on January 20, 2025.