https://github.com/v-bible/crawler

A collection of web crawlers to crawl Catholic resources in Vietnamese language
https://github.com/v-bible/crawler
catholic corpus-linguistics crawler nlp playwright
Last synced: 2 months ago
JSON representation
A collection of web crawlers to crawl Catholic resources in Vietnamese language
Host: GitHub
URL: https://github.com/v-bible/crawler
Owner: v-bible
License: other
Created: 2025-06-07T01:57:19.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2026-03-26T09:44:52.000Z (3 months ago)
Last Synced: 2026-03-27T00:50:27.303Z (3 months ago)
Topics: catholic, corpus-linguistics, crawler, nlp, playwright
Language: TypeScript
Homepage: https://huggingface.co/datasets/v-bible/catholic-resources
Size: 915 KB
Stars: 0
Watchers: 0
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
- Code of conduct: CODE_OF_CONDUCT.md
- Agents: AGENTS.md
Awesome Lists containing this project

README

          


  
Crawler


  


    A collection of web crawlers to crawl Catholic resources in Vietnamese

    language

  




  

    

  

  

    

  

  

    

  

  

    

  

  

    

  

  

    

  





    View Demo

   · 

    Documentation

   · 

    Report Bug

   · 

    Request Feature

  






# :notebook_with_decorative_cover: Table of Contents

- [About the Project](#star2-about-the-project)

  - [Environment Variables](#key-environment-variables)

- [Getting Started](#toolbox-getting-started)

  - [Prerequisites](#bangbang-prerequisites)

  - [Run Locally](#running-run-locally)

- [Usage](#eyes-usage)

  - [CLI Usage](#cli-usage)

  - [Library Usage](#library-usage)

  - [Category Guidelines](#category-guidelines)

    - [Category ID](#category-id)

    - [Folder Structure](#folder-structure)

    - [Category References](#category-references)

  - [Document Metadata](#document-metadata)

  - [Named Entity Recognition (NER)](#named-entity-recognition-ner)

    - [Entity Label Categories](#entity-label-categories)

    - [Setup Label Tools](#setup-label-tools)

    - [Getting API Token](#getting-api-token)

    - [Label Procedure](#label-procedure)

- [Contributing](#wave-contributing)

  - [Code of Conduct](#scroll-code-of-conduct)

- [License](#warning-license)

- [Contact](#handshake-contact)

## :star2: About the Project

### :key: Environment Variables

To run this project, you will need to add the following environment variables to

your `.env` file:

- **App configs:**

  - `LOG_LEVEL`: Log level.

- **Label Studio configs:**

  - `LABEL_STUDIO_URL`: URL of the Label Studio instance. E.g.:

    `http://localhost:8080`.

  - `LABEL_STUDIO_LEGACY_TOKEN`: Legacy token for Label Studio API. You can

    generate it in the Label Studio settings page.

  - `LABEL_STUDIO_PROJECT_TITLE`: Title of the Label Studio project. This is used to

    import and export NER tasks, in `src/ner-processing/import-ner-task.ts` and

    `src/ner-processing/export-ner-task.ts` scripts.

> [!NOTE]

> These Label Studio environments only required for `ner-processing` scripts to

> connect to Label Studio instance.

E.g:

```

# .env

LOG_LEVEL=info

LABEL_STUDIO_URL=http://localhost:8080

LABEL_STUDIO_LEGACY_TOKEN=eyJhb***

LABEL_STUDIO_PROJECT_TITLE=v-bible

```

You can also check out the file `.env.example` to see all required environment

variables.

## :toolbox: Getting Started

### :bangbang: Prerequisites

- This project uses [pnpm](https://pnpm.io/) as package manager:

  ```bash

  npm install --global pnpm

  ```

- Playwright: Run the following command to download new browser binaries:

  ```bash

  npx playwright install

  ```

- `asdf` environment: Please setup `asdf` to install corresponding dependencies

  specified in `.tool-versions` file.

  - nodejs: `https://github.com/asdf-vm/asdf-nodejs.git`.

### :running: Run Locally

Clone the project:

```bash

git clone https://github.com/v-bible/crawler.git

```

Go to the project directory:

```bash

cd crawler

```

Install dependencies:

```bash

pnpm install

```

## :eyes: Usage

> [!NOTE]

> **This package can be used both as a CLI tool and as a library.**

>

> - **CLI**: Run commands directly from the terminal for convenient web crawling

> - **Library**: Import and use crawler functions in your TypeScript/JavaScript code

### CLI Usage

The crawler provides a command-line interface for crawling websites easily.

**Basic usage:**

```bash

# Crawl a specific site

crawler crawl --site thanhlinh.net

# Crawl all sites

crawler crawl --site all

# Crawl with verbose logging

crawler crawl --site augustino.net --verbose

# Crawl with custom timeout (in milliseconds)

crawler crawl --site conggiao.org --timeout 900000

```

**Available sites:**

- `augustino.net`

- `conggiao.org`

- `dongten.net`

- `hdgmvietnam.com`

- `ktcgkpv.org`

- `rongmotamhon.net`

- `thanhlinh.net`

- `all` - Crawl all available sites

**Command flags:**

```

FLAGS

     [--site]                 Site to crawl. Available: augustino.net, conggiao.org,

                              dongten.net, hdgmvietnam.com, ktcgkpv.org, rongmotamhon.net,

                              thanhlinh.net, or 'all' for all sites [default = all]

     [--timeout]              Timeout in milliseconds for each crawl operation

     [--verbose/--noVerbose]  Enable verbose logging

  -h  --help                  Print help information and exit

```

**Bash completion:**

Install bash completion to get auto-completion for commands and flags:

```bash

# Install bash completion

crawler install

# Uninstall bash completion

crawler uninstall

```

### Library Usage

You can also use the crawler programmatically by importing the site modules directly:

```ts

import { crawler as thanhlinhCrawler } from './src/sites/thanhlinh.net/main';

// Run the crawler

await thanhlinhCrawler.run();

```

**Direct execution:**

Each site crawler can also be run directly using tsx:

```bash

npx tsx src/sites/thanhlinh.net/main.ts

```

This provides flexibility to run crawlers either through the CLI or programmatically.

### Category Guidelines

#### Category ID

- **Sentence ID**: Each sentence ID **MUST** have the following format:

  ```

  _fff.ccc.ppp.ss

  ```

  - `domain`: Domain code. **Format**: 1 character, in uppercase. E.g: `R`.

  - `subDomain`: Subdomain code. **Format**: 1 character, in uppercase. E.g:

    `C`.

  - `genre`: Genre code. **Format**: 1 character, in uppercase. E.g: `D`.

  - `documentNumber` (`fff`): Document number of `genre`. **Format**: 3 digits,

    starting from `001`.

  - `chapterNumber` (`ccc`): Chapter number of `documentNumber`. **Format**: 3

    digits, starting from `001`.

  - `pageNumber` (`ppp`):

    - For text based data it is the paragraph number of `chapterNumber`.

    - For OCR data it is the page number of `chapterNumber`.

    - **Format**: 3 digits, starting from `001`.

  - `sentenceNumber` (`ss`): Sentence number of `pageNumber`. **Format**: 2

    digits, starting from `01`.

- **File ID**: Each file ID **MUST** have the following format:

  ```

  _fff.ccc.xml

  ```

  - `domain`: domain code. **Format**: 1 character, in uppercase. E.g: `R`.

  - `subDomain`: subdomain code. **Format**: 1 character, in uppercase. E.g:

    `C`.

  - `genre`: genre code. **Format**: 1 character, in uppercase. E.g: `D`.

  - `documentNumber` (`fff`): document number of `genre`. **Format**: 3 digits,

    starting from `001`.

  - `chapterNumber` (`ccc`): chapter number of `documentNumber`. **Format**: 3

    digits, starting from `001`.

#### Folder Structure

> [!NOTE]

> Data is stored on Huggingface dataset:

> [catholic-resources](https://huggingface.co/datasets/v-bible/catholic-resources).

> [!NOTE]

> Data is stored as folder of ``s instead of

> `corpus///`, because this repository is only stored

> for the **Catholic resources** (`RC`).

```

catholic-resources

└── corpus

    └── 

        └── _fff ()

            ├── _fff.ccc.xml

            ├── _fff.ccc.json

            ├── _fff.ccc.md

            └── ...

```

- `documentTitle`: The title of the document, which is used to identify the

  document.

#### Category References

> [!NOTE]

> Any changes to the category references should be reflected in the

> [`src/mapping.ts`](./src/mapping.ts) file.

- Domains:

| code | category | vietnamese |

| :--: | :------: | :--------: |

|  R   | religion |  Tôn giáo  |

- Subdomains:

| code | category | vietnamese |

| :--: | :------: | :--------: |

|  C   | catholic | Công Giáo  |

- Genres:

> [!NOTE]

> Genres with no category are **reserved** for future use.

| code |           category            |        vietnamese        |

| :--: | :---------------------------: | :----------------------: |

|  A   |     advent contemplation      |    Suy niệm Mùa Vọng     |

|  B   |                               |                          |

|  C   |          catechesis           |    Giáo lý/Giáo huấn     |

|  D   |        church document        |    Văn kiện Giáo Hội     |

|  E   |      exegesis/commentary      |    Chú giải/Bình luận    |

|  F   |      lent contemplation       |    Suy niệm Mùa Chay     |

|  G   |     easter contemplation      |  Suy niệm Mùa Phục Sinh  |

|  H   |       ot contemplation        | Suy niệm Mùa Thường Niên |

|  I   |      other contemplation      |      Suy niệm khác       |

|  J   |                               |                          |

|  K   |                               |                          |

|  L   |          liturgical           |         Phụng vụ         |

|  M   |            memoir             |          Hồi ký          |

|  N   |         new testament         |    Kinh Thánh Tân Ước    |

|  O   |         old testament         |    Kinh Thánh Cựu Ước    |

|  P   |            prayer             |        Cầu nguyện        |

|  Q   |                               |                          |

|  R   |                               |                          |

|  S   | saint/beatification biography | Tiểu sử Thánh/Chân phước |

|  T   |           theology            |         Thần học         |

|  U   |                               |                          |

|  V   |                               |                          |

|  W   |                               |                          |

|  X   |    christmas contemplation    | Suy niệm Mùa Giáng Sinh  |

|  Y   |          philosophy           |        Triết học         |

|  Z   |            others             |           Khác           |

- Tags:

> [!NOTE]

> Tags are used to further classify the genres. Currently, they are not used to

> construct the sentence ID. However, this information is stored in the metadata

> of the sentence.

Tag references

| code |                                        category 
| :--: | :------------------------------------------------- 
|      |                                 apostolic constitution 
|      |                                    encyclical letter 
|      |                                    apostolic letter 
|      |                                      declarations 
|      |                                      motu proprio 
|      |                                 apostolic exhortations 
|      |                                      note document 
|      |                                  urbi et orbi message 
|      |                                      constitution 
|      |                                         decrees 
|      |                                  instrumentum laboris 
|      |                                  synod of bishops note 
|      |                                         letters 
|      |                                        messages 
|      |                                bible pentateuch division 
|      |                             bible historical books division 
|      |                           bible poetic/wisdom books division 
|      |                             bible prophetic books division 
|      |                                 bible gospels division 
|      |                                   bible acts division 
|      |                             bible pauline letters division 
|      |                             bible general epistles division 
|      |                                bible revelation division 
|      |                               morning and evening prayers 
|      | offertory prayer, prayers of preparation 
|      |                            the stations of the cross prayers 
|      |                                     rosary prayers 
|      |                                         prayers 
|      |                                      daily prayers 
|      |                                          maria 
|      |                                         advent 
|      |                                        christmas 
|      |                                          lent 
|      |                                         triduum 
|      |                                         easter 
|      |                                           ot 
|      |                                      celebrations 
|      |                                      jubilee year

|                                 vietnamese                                  | ------------------------------------: | :-------------------------------------------------------------------------: | |                        Tông hiến (Văn kiện Giáo Hội)                        | |                       Thông điệp (Văn kiện Giáo Hội)                        | |                        Tông thư (Văn kiện Giáo Hội)                         | |                       Tuyên ngôn (Văn kiện Giáo Hội)                        | |                Tài liệu dưới dạng tự sắc (Văn kiện Giáo Hội)                | |                        Tông huấn (Văn kiện Giáo Hội)                        | |                         Ghi chú (Văn kiện Giáo Hội)                         | |              Sứ điệp Giáng Sinh/Phục Sinh (Văn kiện Giáo Hội)               | |                        Hiến chế (Văn kiện Giáo Hội)                         | |                        Sắc lệnh (Văn kiện Giáo Hội)                         | |                    Tài liệu làm việc (Văn kiện Giáo Hội)                    | |          Ghi chú của Thượng Hội đồng Giám mục (Văn kiện Giáo Hội)           | |                           Thư (Văn kiện Giáo Hội)                           | |                         Sứ điệp (Văn kiện Giáo Hội)                         | |                        Ngũ Thư (Kinh Thánh Cựu Ước)                         | |                        Lịch Sử (Kinh Thánh Cựu Ước)                         | |                 Giáo huấn - Khôn ngoan (Kinh Thánh Cựu Ước)                 | |                   Ngôn sứ - Tiên tri (Kinh Thánh Cựu Ước)                   | |                Sách Phúc Âm - Tin Mừng (Kinh Thánh Tân Ước)                 | |                  Sách Công vụ Tông đồ (Kinh Thánh Tân Ước)                  | |            Các thư mục vụ của Thánh Phao-lô (Kinh Thánh Tân Ước)            | |                     Các thư chung (Kinh Thánh Tân Ước)                      | |                    Sách Khải Huyền (Kinh Thánh Tân Ước)                     | |         Các kinh đọc sáng tối ngày thường và Chúa Nhật (Cầu nguyện)         | for holy communion and prayers of thanksgiving | Kinh dâng lễ, những kinh dọn mình chịu lễ và những kinh cám ơn (Cầu nguyện) | |         Kinh ngắm Đàng Thánh giá và ít nhiều kinh khác (Cầu nguyện)         | |                      Phép lần hạt Mân Côi (Cầu nguyện)                      | |                            Kinh cầu (Cầu nguyện)                            | |                       Kinh đọc hàng ngày (Cầu nguyện)                       | |                                  Mẹ Maria                                   | |                                  Mùa Vọng                                   | |                               Mùa Giáng Sinh                                | |                                  Mùa Chay                                   | |                            Mùa Chay - Tuần Thánh                            | |                                Mùa Phục Sinh                                | |                               Mùa Thường niên                               | |                                   Lễ lớn                                    | |                                  Năm Thánh                                  |

### Document Metadata

Document Metadata is stored in the [data/main.tsv](./data/main.tsv), which is

downloaded from Google Sheets [\[NLP\] Danh sách tài liệu Công

giáo](https://docs.google.com/spreadsheets/d/1YETFmWnGOM1E2Z0pLMkxkqHczyzCmIbqptQ25VFuBGM/edit?usp=sharing)

and is updated periodically.

> [!IMPORTANT]

> Only export file to `TSV` format, do not export to `CSV` format.

### Named Entity Recognition (NER)

#### Entity Label Categories

> [!NOTE]

> Any changes to the category references should be reflected in the

> [`src/lib/ner/mapping.ts`](./src/lib/ner/mapping.ts) file.

| label |   category   |              examples              |          vietnamese examples           |

| :---: | :----------: | :--------------------------------: | :------------------------------------: |

|  PER  |    person    | Jesus, Mary, Peter, Paul, John,... | Giêsu, Maria, Phêrô, Phaolô, Gioan,... |

|  LOC  |   location   |   Jerusalem, Rome, Bethlehem,...   |      Giêrusalem, Rôma, Bêlem,...       |

|  ORG  | organization |    Vatican, Catholic Church,...    |    Vatican, Giáo Hội Công Giáo,...     |

| TITLE |    title     |     Pope, Bishop, Cardinal,...     |    Giáo hoàng, Giám mục, Hồng y,...    |

|  TME  |     time     |    Sunday, Monday, January,...     |  Chúa Nhật, Thứ Hai, Tháng Giêng,...   |

|  NUM  |    number    |         1, 2, 3, 4, 5,...          |           1, 2, 3, 4, 5,...            |

#### Setup Label Tools

Please use [Label Studio](https://labelstud.io/) to label the NER data. Please

refer to the

[v-bible/nlp-label-studio](https://github.com/v-bible/nlp-label-studio) for

setuping Label Studio.

#### Getting API Token

To get Label Studio Legacy API token, go to

`http://localhost:8080/organization` > `API Tokens Settings` > Check `Legacy

Tokens` > `Save`.

#### Label Procedure

> [!NOTE]

> Instead labeling all the sentences in the corpus at once, we should **chunk

> the corpus into genres**, and then label each genre separately. This will help

> to reduce the complexity of the labeling process and make it easier to manage.

The label procedure is as follows:

1.  Extract NER tasks:

    - Use script

      [`src/ner-processing/extract-ner-task.ts`](./src/ner-processing/extract-ner-task.ts) to

      extract NER tasks from the corpus data tree.

    - Read JSON corpus data tree from `dist/corpus` **by genre** and write output

      to `dist/task-data`.

    - The output structure:

      ```

      dist/task-data

      └── 

          ├── _fff.ccc.json

          └── ...

      ```

    - The data is stored in JSON format, which is compatible with Label Studio.

      It may contains annotated data from previous labeling sessions, these will

      be imported as ground truth data.

    - Sample data:

      ```json

      [

        {

          "data": {

            "text": "Đây là gia phả Đức Giê-su Ki-tô, con cháu vua Đa-vít, con cháu tổ phụ Áp-ra-ham :",

            "documentId": "RCN_001",

            "chapterId": "RCN_001.001",

            "sentenceId": "RCN_001.001.001.01",

            "sentenceType": "single",

            "title": "Phúc Âm theo Thánh Mát-thêu",

            "genreCode": "N"

          },

          "annotations": [

            {

              "result": [

                {

                  "value": {

                    "start": 15,

                    "end": 31,

                    "text": "Đức Giê-su Ki-tô",

                    "labels": ["PER"]

                  },

                  "from_name": "label",

                  "to_name": "text",

                  "type": "labels"

                },

                {

                  "value": {

                    "start": 42,

                    "end": 52,

                    "text": "vua Đa-vít",

                    "labels": ["PER"]

                  },

                  "from_name": "label",

                  "to_name": "text",

                  "type": "labels"

                },

                {

                  "value": {

                    "start": 63,

                    "end": 79,

                    "text": "tổ phụ Áp-ra-ham",

                    "labels": ["PER"]

                  },

                  "from_name": "label",

                  "to_name": "text",

                  "type": "labels"

                }

              ]

            }

          ]

        }

      ]

      ```

> [!NOTE]

> If the task has not been labeled yet, don't add any annotations to the

> task data, else it will be considered as ground truth data.

2.  Import NER tasks to Label Studio:

    - Import the NER tasks by creating a new project and selecting the `dist/task-data`

      folder as the data source, or use script

      [`src/ner-processing/import-ner-task.ts`](./src/ner-processing/import-ner-task.ts)

      (**recommended**)

      to import NER tasks to Label Studio using Label Studio API.

3.  Label NER tasks:

    - Use Label Studio to label the NER tasks. The labeling interface is

      configured in the project settings, which is described in the

      [v-bible/nlp-label-studio](https://github.com/v-bible/nlp-label-studio)

      repository.

4.  Export NER labels:

    - Use script

      [`src/ner-processing/export-ner-task.ts`](./src/ner-processing/export-ner-task.ts)

      to export the NER tasks from Label Studio to `dist/task-data`.

5.  Inject annotations to data tree:

    - Use script

      [`src/ner-processing/inject-annotation.ts`](./src/ner-processing/inject-annotation.ts)

      to inject the annotations from `dist/task-data` to the corpus data tree in

      `dist/corpus`.

## :wave: Contributing



  



Contributions are always welcome!

Please read the [contribution guidelines](./CONTRIBUTING.md).

### :scroll: Code of Conduct

Please read the [Code of Conduct](./CODE_OF_CONDUCT.md).

## :warning: License

This project is licensed under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)** License.

[![License: CC BY-NC-SA 4.0](https://licensebuttons.net/l/by-nc-sa/4.0/88x31.png)](https://creativecommons.org/licenses/by-nc-sa/4.0/).

See the **[LICENSE.md](./LICENSE.md)** file for full details.

## :handshake: Contact

Duong Vinh - [@duckymomo20012](https://twitter.com/duckymomo20012) -

tienvinh.duong4@gmail.com

Project Link: [https://github.com/v-bible/crawler](https://github.com/v-bible/crawler).