An open API service indexing awesome lists of open source software.

https://github.com/v-bible/bible-scraper

Scrape bible from multiple resources
https://github.com/v-bible/bible-scraper

bible bible-api bible-scraper bibledata bibledotcom biblegateway playwright-typescript typescript

Last synced: 28 days ago
JSON representation

Scrape bible from multiple resources

Awesome Lists containing this project

README

          

Bible Scraper


Scrape bible from multiple resources



contributors


last update


forks


stars


open issues


license


View Demo
·
Documentation
·
Report Bug
·
Request Feature



# :notebook_with_decorative_cover: Table of Contents

- [About the Project](#star2-about-the-project)
- [Features](#dart-features)
- [Environment Variables](#key-environment-variables)
- [Getting Started](#toolbox-getting-started)
- [Prerequisites](#bangbang-prerequisites)
- [Run Locally](#running-run-locally)
- [Usage](#eyes-usage)
- [Scripts](#scripts)
- [Scrape Bible](#scrape-bible)
- [Inject FTS Content](#inject-fts-content)
- [Others](#others)
- [Storage](#storage)
- [Implemented Features](#implemented-features)
- [FTS Content Structure](#fts-content-structure)
- [Notes](#notes)
- [Bible Version Denominations](#bible-version-denominations)
- [Bible Old Testament Books Comparison](#bible-old-testament-books-comparison)
- [Missing Verses](#missing-verses)
- [Contributing](#wave-contributing)
- [Code of Conduct](#scroll-code-of-conduct)
- [License](#warning-license)
- [Contact](#handshake-contact)
- [Acknowledgements](#gem-acknowledgements)

## :star2: About the Project

### :dart: Features

- Scrape bible from:
- [biblegateway.com](https://www.biblegateway.com/).
- [bible.com](https://www.bible.com/).
- [ktcgkpv.org](https://ktcgkpv.org/).
- Currently supports:
- Verses (with poetry).
- Footnotes.
- Headings.
- References.
- Psalm metadata (like author, title, etc.).
- Progress logging.
- Save to Postgres & SQLite database.

### :key: Environment Variables

To run this project, you will need to add the following environment variables to
your `.env` file:

- **App configs:**

`DB_URL`: Postgres database connection URL. Example:
- Postgres: `postgres://postgres:postgres@localhost:5432/bible`

- Sqlite: `file:../../dumps/ktcgkpv_org.sqlite3?connection_limit=1&socket_timeout=10`

`LOG_LEVEL`: Log level.

E.g:

```
# .env
DB_URL="postgres://postgres:postgres@localhost:65439/bible"
LOG_LEVEL=info
```

You can also check out the file `.env.example` to see all required environment
variables.

## :toolbox: Getting Started

### :bangbang: Prerequisites

This project uses [pnpm](https://pnpm.io/) as package manager:

```bash
npm install --global pnpm
```

Playwright:

Run the following command to download new browser binaries:

```bash
npx playwright install
```

### :running: Run Locally

Clone the project:

```bash
git clone https://github.com/v-bible/bible-scraper.git
```

Go to the project directory:

```bash
cd bible-scraper
```

Install dependencies:

```bash
pnpm install
```

Setup Postgres database using Docker compose:

```bash
docker-compose up -d
```

Migrate the database:

- Sqlite:

```bash
pnpm prisma:migrate:sqlite
```

- Postgres:

```bash
pnpm prisma:migrate:pg
```

Generate Prisma client:

- Sqlite:

```bash
pnpm prisma:generate --schema ./prisma/sqlite/schema.prisma
```

- Postgres:

```bash
pnpm prisma:generate --schema ./prisma/pg/schema.prisma
```

## :eyes: Usage

### Scripts

#### Scrape Bible

> [!NOTE]
> To prevent the error `net::ERR_NETWORK_CHANGED`, you can temporarily disable
> the ipv6 on your network adapter:
>
> ```bash
> sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
> sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
> ```

- Scrape bible (from [biblegateway.com](https://www.biblegateway.com/)):

```bash
npx tsx ./src/biblegateway.com/main.ts
```

- Scrape bible (from [bible.com](https://www.bible.com/)):

```bash
npx tsx ./src/bible.com/main.ts
```

> [!NOTE]
> For the `bible.com` script, it doesn't use the **local** version code, which
> may vary for different languages. For example, in Vietnamese language, version
> `"VCB"` has local code is `"KTHD"`.

- Scrape bible (from [ktcgkpv.org](https://ktcgkpv.org/bible?version=1)):

```bash
npx tsx ./src/ktcgkpv.org/main.ts
```

#### Inject FTS Content

Inject FTS content for SQLite database:

```bash
npx tsx ./src/scripts/inject-fts.ts
```

- Source DB: Defined from `DB_URL` environment variable for Prisma.
- Target DB: Defined in the script.

> [!NOTE]
> For table fields, please refer to the
> [`prisma/sqlite/schema.prisma`](./prisma/sqlite/schema.prisma) and [FTS
> Content Structure](#fts-content-structure)

#### Others

- Scrape Liturgical resources for **Ordinary Times** (Weekdays & Sundays) from
[catholic-resources.org](https://catholic-resources.org/):

> The Lectionary for Mass - Second USA Edition
> (Sunday Volume, 1998; Weekday Volumes, 2002)

```bash
npx tsx ./src/catholic-resources/main.ts
```

> [!NOTE]
> The script `get-ordinary-time.ts` will log out **mismatch** gospel reading for
> Weekday OT between Year I & II. You can see it in
> [`dumps/catholic-resources/note-ot.txt`](./dumps/catholic-resources/note-ot.txt).

> [!NOTE]
> You can update `SOURCE_DB` and `TARGET_DB` in the script to change the source
> & destination database.

### Storage

Scrape data is stored on Huggingface
[dataset](https://huggingface.co/datasets/v-bible/catholic-resources).

### Implemented Features

Comparing the scraped data from different sources:

| **Features** | **biblegateway.com** | **bible.com** | **ktcgkpv.org** |
|---------------------------------|----------------------|---------------|-----------------|
| Verse | ✔️ | ✔️ | ✔️ |
| Poetry | ✔️ | ✔️ | ✔️ |
| Footnote | ✔️ | ✔️ | ✔️ |
| Cross Reference | ✔️ | ✔️ | ✔️ |
| Psalm Metadata | ✔️ | ✔️ | ✔️ |
| Words of Jesus (red letter) | ✔️ | ✔️ | ❌ |
| Proper Names (name translation) | ❌ | ❌ | ✔️ |

### FTS Content Structure

The FTS content structure is as follows:

```ts
{
objectId: string; // Unique identifier for the content
content: string; // The text content to be indexed
sortOrder: number; // Sort order for the content
bookCode: string; // Code of the book (e.g., "gen" for Genesis)
bookName: string; // Name of the book (e.g., "Genesis")
testament: string; // Testament type (e.g., "ot", "nt")
chapterNumber: number; // Chapter number
chapterId: string; // Unique identifier for the chapter
type: 'verse' | 'footnote' | 'heading' | 'psalm_metadata' | 'words_of_jesus'; // Type of content
}
```

### Notes

#### Bible Version Denominations

| Version Code | Source | Name | Denomination |
| ------------ | ---------------- | ------------------------------------ | ------------ |
| KT2011 | ktcgpv.org | KPA : ấn bản KT 2011 | Catholic |
| BD2011 | bible.com | Kinh Thánh Tiếng Việt, Bản Dịch 2011 | Protestant |
| BD2011 | biblegateway.com | Bản Dịch 2011 (BD2011) | Protestant |

#### Bible Old Testament Books Comparison

| Thánh Kinh Do Thái | Thánh Kinh Hy Lạp (Bảy Mươi) | Cựu Ước Công Giáo | Cựu Ước Tin Lành |
|---|---|---|---|
| **I. Luật (Torah)**
1. Sáng Thế
2. Xuất Hành
3. Lêvi
4. Dân Số
5. Đệ Nhị Luật | **I. Ngũ Thư**
1. Sáng Thế
2. Xuất Hành
3. Lêvi
4. Dân Số
5. Đệ Nhị Luật | **I. Ngũ Thư**
1. Sáng Thế
2. Xuất Hành
3. Lêvi
4. Dân Số
5. Đệ Nhị Luật | **I. Ngũ Thư**
1. Sáng Thế
2. Xuất Hành
3. Lêvi
4. Dân Số
5. Đệ Nhị Luật |
| **II. Ngôn sứ**
- Ngôn sứ tiền
6. Giôsuê
7. Thẩm phán
8. 1 & 2 Samuel
9. 1 & 2 Vua
- Ngôn sứ hậu
10. Isaia
11. Giêrêmia
12. Êzêkiel
13. Mười hai ngôn sứ





| **II. Lịch sử**
6. Giôsuê
7. Thẩm phán
8. Ruth
9. 1 & 2 Samuel
10. 1 & 2 Vua
11. 1 & 2 Sử biên niên
12. Ezra – Nêhêmia
13. Ester
14. Giuđitha
15. Tôbit
16. 1 & 2 Maccabê




| **II. Lịch sử**
6. Giôsuê
7. Thẩm phán
8. Ruth
9. Samuel 1
10. Samuel 2
11. Vua 1
12. Vua 2
13. Sử biên niên 1
14. Sử biên niên 2
15. Ezra
16. Nêhêmia
17. Tobia\*
18. Giuđitha\*
19. Ester
20. Maccabê 1\*
21. Maccabê 2\* | **II. Lịch sử**
6. Giôsuê
7. Thẩm phán
8. Ruth
9. Samuel 1
10. Samuel 2
11. Vua 1
12. Vua 2
13. Sử biên niên 1
14. Sử biên niên 2
15. Ezra
16. Nêhêmia
17. Ester



|
| **III. Các sách khác**
14. Thánh vịnh
15. Giob
16. Châm ngôn
17. Ruth
18. Diễm ca
19. Giảng viên
20. Ai ca
21. Ester
22. Đaniel
23. Ezra – Nêhêmia
24. 1 & 2 Sử biên niên | **III. Giáo huấn – Khôn ngoan**
17. Thánh vịnh
18. Châm ngôn
19. Giảng viên
20. Diễm ca
21. Giob
22. Khôn ngoan
23. Huấn ca



| **III. Giáo huấn – Khôn ngoan**
22. Giob
23. Thánh vịnh
24. Châm ngôn
25. Giảng viên
26. Diễm ca
27. Khôn ngoan\*
28. Huấn ca\*




| **III. Giáo huấn – Khôn ngoan**
18. Giob
19. Thánh vịnh
20. Châm ngôn
21. Giảng viên
22. Diễm ca





|
|


















| **IV. Ngôn sứ**
24. Ôsê
25. Amos
26. Mica
27. Giôel
28. Abđia
29. Giôna
30. Nahum
31. Habacuc
32. Sôphônia
33. Aggai
34. Zacaria
35. Malaki
36. Isaia
37. Giêrêmia
38. Baruc
39. Ai ca
40. Thư của Giêrêmia
41. Êzêkiel
42. Đaniel | **IV. Ngôn sứ**
29. Isaia
30. Giêrêmia
31. Ai ca
32. Baruc\*
33. Êzêkiel
34. Đaniel
35. Ôsê
36. Giôel
37. Amos
38. Abđia
39. Giôna
40. Mica
41. Nahum
42. Habacuc
43. Sôphônia
44. Aggai
45. Zacaria
46. Malaki
| **IV. Ngôn sứ**
23. Isaia
24. Giêrêmia
25. Ai ca
26. Êzêkiel
27. Đaniel
28. Ôsê
29. Giôel
30. Amos
31. Abđia
32. Giôna
33. Mica
34. Nahum
35. Habacuc
36. Sôphônia
37. Aggai
38. Zacaria
39. Malaki

|

> [!NOTE]
> Source: Stephen L. Harris, _Understanding the Bible_, 1997.

> [!NOTE]
> Books marked with `*` is not included in the Old Testament of the Protestant.

#### Missing Verses

- Version: KT2011 - (ktcgkpv.org)

| Book | Book Code | Missing Verses | Notes |
| --------- | --------- | ---------------------------------------------- | ---------------- |
| Tô-bi-a | tb | chapter 9: 4 | Corrected: 3-4 |
| Tô-bi-a | tb | chapter 14: 9 | Corrected: 8-9 |
| Châm ngôn | cn | chapter 14: 32 | Intended |
| Huấn ca | hc | chapter 1: 5, 7, 21 | Intended |
| Huấn ca | hc | chapter 3: 19, 25 | Intended |
| Huấn ca | hc | chapter 10: 21 | Intended |
| Huấn ca | hc | chapter 11: 15, 16 | Intended |
| Huấn ca | hc | chapter 13: 14 | Intended |
| Huấn ca | hc | chapter 16: 15, 16 | Intended |
| Huấn ca | hc | chapter 17: 5, 9, 16, 18, 21 | Intended |
| Huấn ca | hc | chapter 18: 3 | Intended |
| Huấn ca | hc | chapter 19: 18, 19, 21 | Intended |
| Huấn ca | hc | chapter 22: 7, 8 | Intended |
| Huấn ca | hc | chapter 24: 18, 24 | Intended |
| Huấn ca | hc | chapter 25: 12 | Intended |
| Huấn ca | hc | chapter 26: 19, 20, 21, 22, 23, 24, 25, 26, 27 | Intended |
| Gio-an | ga | chapter 7: 38 | Corrected: 37-38 |

> [!NOTE]
> For missing verses like `tb 9: 3-4`, verse is stored as: number is `3` and label
> is `3-4` or `ga 7: 37-38`, verse is stored as: number is `37` and label is
> `37-38`.

- Version: BD2011 - (biblegateway.com)

| Book | Book Code | Missing Verses | Notes |
| ---- | --------- | ----------------- | ----------------------- |
| Mác | mark | chapter 9: 45, 47 | Corrected: 45-46, 47-48 |

- Version: BD2011 - (bible.com)

| Book | Book Code | Missing Verses | Notes |
| ---- | --------- | ----------------- | ----------------------- |
| Mác | mrk | chapter 9: 45, 47 | Corrected: 45-46, 47-48 |

## :wave: Contributing



Contributions are always welcome!

Please read the [contribution guidelines](./CONTRIBUTING.md).

### :scroll: Code of Conduct

Please read the [Code of Conduct](./CODE_OF_CONDUCT.md).

## :warning: License

This project is licensed under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)** License.

[![License: CC BY-NC-SA 4.0](https://licensebuttons.net/l/by-nc-sa/4.0/88x31.png)](https://creativecommons.org/licenses/by-nc-sa/4.0/).

See the **[LICENSE.md](./LICENSE.md)** file for full details.

## :handshake: Contact

Duong Vinh - [@duckymomo20012](https://twitter.com/duckymomo20012) -
tienvinh.duong4@gmail.com

Project Link: [https://github.com/v-bible/bible-scraper](https://github.com/v-bible/bible-scraper).

## :gem: Acknowledgements

Here are useful resources and libraries that we have used in our projects:

- [bible.com](https://www.bible.com/): bible.com website.
- [biblegateway.com](https://www.biblegateway.com/): biblegateway.com website.
- [ktcgkpv.org](https://ktcgkpv.org/): Nhóm Phiên Dịch Các Giờ Kinh Phụng Vụ
website.
- [The Lectionary for Mass (1998/2002 USA
Edition)](https://catholic-resources.org/Lectionary/1998USL.htm): compiled by
Felix Just, S.J., Ph.D.