https://github.com/kdroidfilter/sefariaexport
An automated, reproducible pipeline to build Sefaria exports from a MongoDB dump using the official Sefaria-Project exporter, and publish the resulting archives as GitHub Releases.
https://github.com/kdroidfilter/sefariaexport
Last synced: 5 months ago
JSON representation
An automated, reproducible pipeline to build Sefaria exports from a MongoDB dump using the official Sefaria-Project exporter, and publish the resulting archives as GitHub Releases.
- Host: GitHub
- URL: https://github.com/kdroidfilter/sefariaexport
- Owner: kdroidFilter
- License: agpl-3.0
- Created: 2025-11-06T00:23:39.000Z (8 months ago)
- Default Branch: master
- Last Pushed: 2026-02-01T23:00:43.000Z (5 months ago)
- Last Synced: 2026-02-01T23:18:07.547Z (5 months ago)
- Language: Shell
- Homepage:
- Size: 41.9 MB
- Stars: 18
- Watchers: 0
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
SefariaExport
==============
An automated, reproducible pipeline to build Sefaria exports from a MongoDB dump using the official Sefaria-Project exporter, and publish the resulting archives as GitHub Releases.
This repository is a collection of small, composable Bash and Python scripts that:
- Prepare a build environment (tools, Python, MongoDB Database Tools)
- Download a small sample MongoDB dump for quick end-to-end runs
- Clone the upstream `Sefaria-Project` repository and install its dependencies
- Restore the database, run the exporters, verify results
- Package, post-process, and split the archives
- Optionally create a GitHub Release and upload the generated assets
Contents
--------
- Top-level scripts `01_...` to `21_...` implement each step in the pipeline, designed to be run sequentially.
- Supporting Python utilities:
- `configure_local_settings.py`
- `ensure_history_collection.py`
- `run_exports.py`
- `check_export_module.py`
- GitHub Actions workflow: `.github/workflows/release.yml` for CI-driven builds and releases.
Prerequisites (local)
---------------------
You can run the pipeline on Linux or macOS. The GitHub Actions workflow shows a fully automated reference run. For a local run, install or ensure access to:
- Bash and coreutils
- Python 3.9 (to mirror CI) with `pip`
- Git, curl, unzip, jq
- MongoDB Database Tools (for `mongorestore`)
- A running MongoDB instance on `localhost:27017`
- Quick start with Docker: `docker run --rm -p 27017:27017 --name mongo mongo:7`
The scripts will attempt to install/prepare some tools automatically, but having the above ready smooths the process.
Quick Start (local)
-------------------
The scripts are designed to be executed in order. A minimal local end-to-end run using the small sample dump looks like this:
1) Compute a timestamp used for naming artifacts
```
bash 01_compute_timestamp.sh
```
2) Install base tools (curl, jq, unzip, etc.)
```
bash 02_install_base_tools.sh
```
3) Install MongoDB Database Tools (mongorestore)
```
bash 03_install_mongo_tools.sh
```
4) Download a small MongoDB dump suitable for quick tests
```
bash 04_download_small_dump.sh
```
5) Clone the upstream Sefaria codebase
```
bash 05_clone_sefaria_project.sh
```
6) Install build dependencies and Python requirements
```
bash 06_install_build_deps.sh
bash 07_pip_install_requirements.sh
```
7) Fallback build for Google RE2 (only if needed by your environment)
```
bash 08_fallback_built_google_re2.sh
```
8) Prepare local project settings and export directories
```
bash 09_create_exports_dir.sh
bash 10_create_local_settings.sh
```
9) Ensure MongoDB is up, then restore the sample dump
```
bash 11_wait_for_mongodb.sh
bash 12_restore_db_from_dump.sh
```
10) Sanity-check exporter module, run exports, verify outputs
```
bash 13_check_export_module.sh
bash 14_run_exports.sh
bash 15_verify_exports.sh
```
11) (Optional) Drop the database to free space
```
bash 16_drop_db.sh
```
12) Build and post-process archives
```
bash 17_build_combined_archive.sh
# Optional content processing helpers:
bash 17a_remove_english_in_exports.sh
bash 17b_flatten_hebrew_in_exports.sh
bash 18_split_archive.sh
```
13) (Optional) Create a GitHub Release and upload assets
```
bash 19_ensure_gh_cli.sh
bash 20_create_or_update_release.sh
bash 21_upload_release_assets.sh
```
Notes
- The scripts are idempotent where practical; if something fails, re-running from the last successful step is typically fine.
- By default, scripts assume `localhost:27017` for MongoDB. Adjust environment variables as needed if your setup differs.
Environment variables
---------------------
Some scripts accept environment variables to tweak behavior. Common ones include:
- `PYTHON_VERSION` – Pin a Python version (the CI uses 3.9)
- `MONGODB_URI` – Override the default MongoDB connection string (e.g., `mongodb://localhost:27017`)
- `GITHUB_TOKEN` – Personal Access Token with `repo` scope, required for release steps when running locally
- `RELEASE_TAG` / `RELEASE_NAME` – Override the computed tag/name for releases
Refer to each script for any additional, script-specific variables.
Running in GitHub Actions
-------------------------
The workflow at `.github/workflows/release.yml` provides a full CI pipeline that:
- Spins up a MongoDB service
- Runs the numbered scripts in sequence
- Packages artifacts
- Creates/updates a release and uploads artifacts
Trigger it manually (workflow_dispatch) or configure schedules/conditions as desired. The workflow expects default permissions or a token with sufficient rights to create releases.
Troubleshooting
---------------
- MongoDB connection errors: ensure MongoDB is listening on `localhost:27017` and reachable. If using Docker, check the container logs and port mapping.
- `mongorestore` not found: re-run `03_install_mongo_tools.sh` or install MongoDB Database Tools from MongoDB’s official distribution.
- Python build issues (e.g., `re2`): run `08_fallback_built_google_re2.sh` to build a compatible wheel as a fallback.
- Exporter module not found: run `05_clone_sefaria_project.sh` and `07_pip_install_requirements.sh` again, then `13_check_export_module.sh`.
Project goals and scope
-----------------------
This repository focuses on orchestration and reproducibility of Sefaria exports. It does not modify Sefaria content or implement the exporter itself; those come from the upstream `Sefaria-Project`.
License
-------
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See `LICENSE` for details.