https://github.com/paceaux/mechon-mamre-scraper
Python scraper that converts mechon-mamre.org into JSON
https://github.com/paceaux/mechon-mamre-scraper
Last synced: 5 months ago
JSON representation
Python scraper that converts mechon-mamre.org into JSON
- Host: GitHub
- URL: https://github.com/paceaux/mechon-mamre-scraper
- Owner: paceaux
- Created: 2019-08-14T14:07:40.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2019-08-14T14:08:17.000Z (almost 7 years ago)
- Last Synced: 2024-05-02T00:40:25.835Z (about 2 years ago)
- Language: Python
- Size: 6.84 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Mechon-Mamre HTML to JSON Converter
A utility to convert [Mechon-Mamre](https://www.mechon-mamre.org/p/pt/pt0.htm) content from HTML to JSON, useful for building an API. **This project is not endorsed by Mechon-Mamre**.
## Prerequisites
- **Python 3.x**
- **Beautiful Soup** - for parsing HTML content. Install via `pip`:
```bash
pip install beautifulsoup4 requests
```
## Command Line Usage
### 1. Convert a Single Book to JSON
To create a JSON file for a single book:
```bash
python bookScraper.py -u https://mechon-mamre.org/p/pt/pt0101.htm
```
This command finds all chapters in the specified book and generates a single JSON file containing the book's content.
### 2. Generate a JSON List of All Books
To create a JSON file that lists all books in the Tanakh:
```bash
python tanakScraper.py -u https://www.mechon-mamre.org/p/pt/pt0.htm
```
### 3. Generate JSON Files for Selected or All Books from the Book List
To scrape books from the Tanakh JSON list and create individual JSON files:
- Use `-g` to specify the group (`torah`, `prophets`, or `writings`).
- Use `-b` to specify specific books (comma-separated).
- Use `-a` to scrape *all* books.
#### Scrape Specific Books
```bash
python scrapeAllBooks.py -g prophets -b Zephaniah,Haggai
```
This example scrapes and saves JSON files for *Zephaniah* and *Haggai* from the *prophets* group.
#### Scrape All Books
```bash
python scrapeAllBooks.py -g writings -a
```
This command scrapes and saves JSON files for *all books* in the *writings* group.
## File Structure
The script saves HTML files to a `data/html` directory to prevent re-downloading content on repeated runs. This caching speeds up the process and reduces unnecessary server requests.
## Important Notes
- **Copyright**: Mechon-Mamre states that their content is copyrighted with all rights reserved. This project aims to respect these rights, and permission has been sought to perform this scraping; however, no response has been received.
- **Use Responsibly**: This tool is intended for educational and non-commercial use. Please ensure your usage aligns with Mechon-Mamre’s terms.
---
**Disclaimer**: This utility is independently created and is not affiliated with or endorsed by Mechon-Mamre.