{"id":49177242,"url":"https://github.com/v-bible/crawler","last_synced_at":"2026-04-22T23:36:20.475Z","repository":{"id":298138060,"uuid":"997699584","full_name":"v-bible/crawler","owner":"v-bible","description":"A collection of web crawlers to crawl Catholic resources in Vietnamese language","archived":false,"fork":false,"pushed_at":"2026-03-26T09:44:52.000Z","size":937,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-27T00:50:27.303Z","etag":null,"topics":["catholic","corpus-linguistics","crawler","nlp","playwright"],"latest_commit_sha":null,"homepage":"https://huggingface.co/datasets/v-bible/catholic-resources","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/v-bible.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2025-06-07T01:57:19.000Z","updated_at":"2026-03-26T04:12:18.000Z","dependencies_parsed_at":"2025-06-09T16:34:38.339Z","dependency_job_id":"8a0fa6bc-1e4f-4fba-9475-9388556f831b","html_url":"https://github.com/v-bible/crawler","commit_stats":null,"previous_names":["v-bible/nlp","v-bible/crawler"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/v-bible/crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/v-bible%2Fcrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/v-bible%2Fcrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/v-bible%2Fcrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/v-bible%2Fcrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/v-bible","download_url":"https://codeload.github.com/v-bible/crawler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/v-bible%2Fcrawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32159959,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-22T17:06:48.269Z","status":"ssl_error","status_checked_at":"2026-04-22T17:06:19.037Z","response_time":58,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["catholic","corpus-linguistics","crawler","nlp","playwright"],"created_at":"2026-04-22T23:36:18.626Z","updated_at":"2026-04-22T23:36:20.074Z","avatar_url":"https://github.com/v-bible.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n  \u003ch1\u003eCrawler\u003c/h1\u003e\n\n  \u003cp\u003e\n    A collection of web crawlers to crawl Catholic resources in Vietnamese\n    language\n  \u003c/p\u003e\n\n\u003c!-- Badges --\u003e\n\u003cp\u003e\n  \u003ca href=\"https://github.com/v-bible/crawler/graphs/contributors\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/contributors/v-bible/crawler\" alt=\"contributors\" /\u003e\n  \u003c/a\u003e\n  \u003ca href=\"\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/last-commit/v-bible/crawler\" alt=\"last update\" /\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/v-bible/crawler/network/members\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/forks/v-bible/crawler\" alt=\"forks\" /\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/v-bible/crawler/stargazers\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/stars/v-bible/crawler\" alt=\"stars\" /\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/v-bible/crawler/issues/\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/issues/v-bible/crawler\" alt=\"open issues\" /\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/v-bible/crawler/blob/main/LICENSE.md\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/license/v-bible/crawler.svg\" alt=\"license\" /\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\u003ch4\u003e\n    \u003ca href=\"https://github.com/v-bible/crawler/\"\u003eView Demo\u003c/a\u003e\n  \u003cspan\u003e · \u003c/span\u003e\n    \u003ca href=\"https://github.com/v-bible/crawler\"\u003eDocumentation\u003c/a\u003e\n  \u003cspan\u003e · \u003c/span\u003e\n    \u003ca href=\"https://github.com/v-bible/crawler/issues/\"\u003eReport Bug\u003c/a\u003e\n  \u003cspan\u003e · \u003c/span\u003e\n    \u003ca href=\"https://github.com/v-bible/crawler/issues/\"\u003eRequest Feature\u003c/a\u003e\n  \u003c/h4\u003e\n\u003c/div\u003e\n\n\u003cbr /\u003e\n\n\u003c!-- Table of Contents --\u003e\n\n# :notebook_with_decorative_cover: Table of Contents\n\n- [About the Project](#star2-about-the-project)\n  - [Environment Variables](#key-environment-variables)\n- [Getting Started](#toolbox-getting-started)\n  - [Prerequisites](#bangbang-prerequisites)\n  - [Run Locally](#running-run-locally)\n- [Usage](#eyes-usage)\n  - [CLI Usage](#cli-usage)\n  - [Library Usage](#library-usage)\n  - [Category Guidelines](#category-guidelines)\n    - [Category ID](#category-id)\n    - [Folder Structure](#folder-structure)\n    - [Category References](#category-references)\n  - [Document Metadata](#document-metadata)\n  - [Named Entity Recognition (NER)](#named-entity-recognition-ner)\n    - [Entity Label Categories](#entity-label-categories)\n    - [Setup Label Tools](#setup-label-tools)\n    - [Getting API Token](#getting-api-token)\n    - [Label Procedure](#label-procedure)\n- [Contributing](#wave-contributing)\n  - [Code of Conduct](#scroll-code-of-conduct)\n- [License](#warning-license)\n- [Contact](#handshake-contact)\n\n\u003c!-- About the Project --\u003e\n\n## :star2: About the Project\n\n\u003c!-- Env Variables --\u003e\n\n### :key: Environment Variables\n\nTo run this project, you will need to add the following environment variables to\nyour `.env` file:\n\n- **App configs:**\n  - `LOG_LEVEL`: Log level.\n\n- **Label Studio configs:**\n  - `LABEL_STUDIO_URL`: URL of the Label Studio instance. E.g.:\n    `http://localhost:8080`.\n  - `LABEL_STUDIO_LEGACY_TOKEN`: Legacy token for Label Studio API. You can\n    generate it in the Label Studio settings page.\n  - `LABEL_STUDIO_PROJECT_TITLE`: Title of the Label Studio project. This is used to\n    import and export NER tasks, in `src/ner-processing/import-ner-task.ts` and\n    `src/ner-processing/export-ner-task.ts` scripts.\n\n\u003e [!NOTE]\n\u003e These Label Studio environments only required for `ner-processing` scripts to\n\u003e connect to Label Studio instance.\n\nE.g:\n\n```\n# .env\nLOG_LEVEL=info\n\nLABEL_STUDIO_URL=http://localhost:8080\nLABEL_STUDIO_LEGACY_TOKEN=eyJhb***\nLABEL_STUDIO_PROJECT_TITLE=v-bible\n```\n\nYou can also check out the file `.env.example` to see all required environment\nvariables.\n\n\u003c!-- Getting Started --\u003e\n\n## :toolbox: Getting Started\n\n\u003c!-- Prerequisites --\u003e\n\n### :bangbang: Prerequisites\n\n- This project uses [pnpm](https://pnpm.io/) as package manager:\n\n  ```bash\n  npm install --global pnpm\n  ```\n\n- Playwright: Run the following command to download new browser binaries:\n\n  ```bash\n  npx playwright install\n  ```\n\n- `asdf` environment: Please setup `asdf` to install corresponding dependencies\n  specified in `.tool-versions` file.\n  - nodejs: `https://github.com/asdf-vm/asdf-nodejs.git`.\n\n\u003c!-- Run Locally --\u003e\n\n### :running: Run Locally\n\nClone the project:\n\n```bash\ngit clone https://github.com/v-bible/crawler.git\n```\n\nGo to the project directory:\n\n```bash\ncd crawler\n```\n\nInstall dependencies:\n\n```bash\npnpm install\n```\n\n\u003c!-- Usage --\u003e\n\n## :eyes: Usage\n\n\u003e [!NOTE]\n\u003e **This package can be used both as a CLI tool and as a library.**\n\u003e\n\u003e - **CLI**: Run commands directly from the terminal for convenient web crawling\n\u003e - **Library**: Import and use crawler functions in your TypeScript/JavaScript code\n\n### CLI Usage\n\nThe crawler provides a command-line interface for crawling websites easily.\n\n**Basic usage:**\n\n```bash\n# Crawl a specific site\ncrawler crawl --site thanhlinh.net\n\n# Crawl all sites\ncrawler crawl --site all\n\n# Crawl with verbose logging\ncrawler crawl --site augustino.net --verbose\n\n# Crawl with custom timeout (in milliseconds)\ncrawler crawl --site conggiao.org --timeout 900000\n```\n\n**Available sites:**\n\n- `augustino.net`\n- `conggiao.org`\n- `dongten.net`\n- `hdgmvietnam.com`\n- `ktcgkpv.org`\n- `rongmotamhon.net`\n- `thanhlinh.net`\n- `all` - Crawl all available sites\n\n**Command flags:**\n\n```\nFLAGS\n     [--site]                 Site to crawl. Available: augustino.net, conggiao.org,\n                              dongten.net, hdgmvietnam.com, ktcgkpv.org, rongmotamhon.net,\n                              thanhlinh.net, or 'all' for all sites [default = all]\n     [--timeout]              Timeout in milliseconds for each crawl operation\n     [--verbose/--noVerbose]  Enable verbose logging\n  -h  --help                  Print help information and exit\n```\n\n**Bash completion:**\n\nInstall bash completion to get auto-completion for commands and flags:\n\n```bash\n# Install bash completion\ncrawler install\n\n# Uninstall bash completion\ncrawler uninstall\n```\n\n### Library Usage\n\nYou can also use the crawler programmatically by importing the site modules directly:\n\n```ts\nimport { crawler as thanhlinhCrawler } from './src/sites/thanhlinh.net/main';\n\n// Run the crawler\nawait thanhlinhCrawler.run();\n```\n\n**Direct execution:**\n\nEach site crawler can also be run directly using tsx:\n\n```bash\nnpx tsx src/sites/thanhlinh.net/main.ts\n```\n\nThis provides flexibility to run crawlers either through the CLI or programmatically.\n\n### Category Guidelines\n\n#### Category ID\n\n- **Sentence ID**: Each sentence ID **MUST** have the following format:\n\n  ```\n  \u003cdomain\u003e\u003csubDomain\u003e\u003cgenre\u003e_fff.ccc.ppp.ss\n  ```\n\n  - `domain`: Domain code. **Format**: 1 character, in uppercase. E.g: `R`.\n  - `subDomain`: Subdomain code. **Format**: 1 character, in uppercase. E.g:\n    `C`.\n  - `genre`: Genre code. **Format**: 1 character, in uppercase. E.g: `D`.\n  - `documentNumber` (`fff`): Document number of `genre`. **Format**: 3 digits,\n    starting from `001`.\n  - `chapterNumber` (`ccc`): Chapter number of `documentNumber`. **Format**: 3\n    digits, starting from `001`.\n  - `pageNumber` (`ppp`):\n    - For text based data it is the paragraph number of `chapterNumber`.\n    - For OCR data it is the page number of `chapterNumber`.\n    - **Format**: 3 digits, starting from `001`.\n  - `sentenceNumber` (`ss`): Sentence number of `pageNumber`. **Format**: 2\n    digits, starting from `01`.\n\n- **File ID**: Each file ID **MUST** have the following format:\n\n  ```\n  \u003cdomain\u003e\u003csubDomain\u003e\u003cgenre\u003e_fff.ccc.xml\n  ```\n\n  - `domain`: domain code. **Format**: 1 character, in uppercase. E.g: `R`.\n  - `subDomain`: subdomain code. **Format**: 1 character, in uppercase. E.g:\n    `C`.\n  - `genre`: genre code. **Format**: 1 character, in uppercase. E.g: `D`.\n  - `documentNumber` (`fff`): document number of `genre`. **Format**: 3 digits,\n    starting from `001`.\n  - `chapterNumber` (`ccc`): chapter number of `documentNumber`. **Format**: 3\n    digits, starting from `001`.\n\n#### Folder Structure\n\n\u003e [!NOTE]\n\u003e Data is stored on Huggingface dataset:\n\u003e [catholic-resources](https://huggingface.co/datasets/v-bible/catholic-resources).\n\n\u003e [!NOTE]\n\u003e Data is stored as folder of `\u003cgenre\u003e`s instead of\n\u003e `corpus/\u003cdomain\u003e/\u003csubDomain\u003e/\u003cgenre\u003e`, because this repository is only stored\n\u003e for the **Catholic resources** (`RC`).\n\n```\ncatholic-resources\n└── corpus\n    └── \u003cgenre\u003e\n        └── \u003cdomain\u003e\u003csubDomain\u003e\u003cgenre\u003e_fff (\u003cdocumentTitle\u003e)\n            ├── \u003cdomain\u003e\u003csubDomain\u003e\u003cgenre\u003e_fff.ccc.xml\n            ├── \u003cdomain\u003e\u003csubDomain\u003e\u003cgenre\u003e_fff.ccc.json\n            ├── \u003cdomain\u003e\u003csubDomain\u003e\u003cgenre\u003e_fff.ccc.md\n            └── ...\n```\n\n- `documentTitle`: The title of the document, which is used to identify the\n  document.\n\n#### Category References\n\n\u003e [!NOTE]\n\u003e Any changes to the category references should be reflected in the\n\u003e [`src/mapping.ts`](./src/mapping.ts) file.\n\n- Domains:\n\n| code | category | vietnamese |\n| :--: | :------: | :--------: |\n|  R   | religion |  Tôn giáo  |\n\n- Subdomains:\n\n| code | category | vietnamese |\n| :--: | :------: | :--------: |\n|  C   | catholic | Công Giáo  |\n\n- Genres:\n\n\u003e [!NOTE]\n\u003e Genres with no category are **reserved** for future use.\n\n| code |           category            |        vietnamese        |\n| :--: | :---------------------------: | :----------------------: |\n|  A   |     advent contemplation      |    Suy niệm Mùa Vọng     |\n|  B   |                               |                          |\n|  C   |          catechesis           |    Giáo lý/Giáo huấn     |\n|  D   |        church document        |    Văn kiện Giáo Hội     |\n|  E   |      exegesis/commentary      |    Chú giải/Bình luận    |\n|  F   |      lent contemplation       |    Suy niệm Mùa Chay     |\n|  G   |     easter contemplation      |  Suy niệm Mùa Phục Sinh  |\n|  H   |       ot contemplation        | Suy niệm Mùa Thường Niên |\n|  I   |      other contemplation      |      Suy niệm khác       |\n|  J   |                               |                          |\n|  K   |                               |                          |\n|  L   |          liturgical           |         Phụng vụ         |\n|  M   |            memoir             |          Hồi ký          |\n|  N   |         new testament         |    Kinh Thánh Tân Ước    |\n|  O   |         old testament         |    Kinh Thánh Cựu Ước    |\n|  P   |            prayer             |        Cầu nguyện        |\n|  Q   |                               |                          |\n|  R   |                               |                          |\n|  S   | saint/beatification biography | Tiểu sử Thánh/Chân phước |\n|  T   |           theology            |         Thần học         |\n|  U   |                               |                          |\n|  V   |                               |                          |\n|  W   |                               |                          |\n|  X   |    christmas contemplation    | Suy niệm Mùa Giáng Sinh  |\n|  Y   |          philosophy           |        Triết học         |\n|  Z   |            others             |           Khác           |\n\n- Tags:\n\n\u003e [!NOTE]\n\u003e Tags are used to further classify the genres. Currently, they are not used to\n\u003e construct the sentence ID. However, this information is stored in the metadata\n\u003e of the sentence.\n\n\u003cdetails\u003e\n\u003csummary\u003eTag references\u003c/summary\u003e\n\n| code |                                        category                                         |                                 vietnamese                                  |\n| :--: | :-------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------: |\n|      |                                 apostolic constitution                                  |                        Tông hiến (Văn kiện Giáo Hội)                        |\n|      |                                    encyclical letter                                    |                       Thông điệp (Văn kiện Giáo Hội)                        |\n|      |                                    apostolic letter                                     |                        Tông thư (Văn kiện Giáo Hội)                         |\n|      |                                      declarations                                       |                       Tuyên ngôn (Văn kiện Giáo Hội)                        |\n|      |                                      motu proprio                                       |                Tài liệu dưới dạng tự sắc (Văn kiện Giáo Hội)                |\n|      |                                 apostolic exhortations                                  |                        Tông huấn (Văn kiện Giáo Hội)                        |\n|      |                                      note document                                      |                         Ghi chú (Văn kiện Giáo Hội)                         |\n|      |                                  urbi et orbi message                                   |              Sứ điệp Giáng Sinh/Phục Sinh (Văn kiện Giáo Hội)               |\n|      |                                      constitution                                       |                        Hiến chế (Văn kiện Giáo Hội)                         |\n|      |                                         decrees                                         |                        Sắc lệnh (Văn kiện Giáo Hội)                         |\n|      |                                  instrumentum laboris                                   |                    Tài liệu làm việc (Văn kiện Giáo Hội)                    |\n|      |                                  synod of bishops note                                  |          Ghi chú của Thượng Hội đồng Giám mục (Văn kiện Giáo Hội)           |\n|      |                                         letters                                         |                           Thư (Văn kiện Giáo Hội)                           |\n|      |                                        messages                                         |                         Sứ điệp (Văn kiện Giáo Hội)                         |\n|      |                                bible pentateuch division                                |                        Ngũ Thư (Kinh Thánh Cựu Ước)                         |\n|      |                             bible historical books division                             |                        Lịch Sử (Kinh Thánh Cựu Ước)                         |\n|      |                           bible poetic/wisdom books division                            |                 Giáo huấn - Khôn ngoan (Kinh Thánh Cựu Ước)                 |\n|      |                             bible prophetic books division                              |                   Ngôn sứ - Tiên tri (Kinh Thánh Cựu Ước)                   |\n|      |                                 bible gospels division                                  |                Sách Phúc Âm - Tin Mừng (Kinh Thánh Tân Ước)                 |\n|      |                                   bible acts division                                   |                  Sách Công vụ Tông đồ (Kinh Thánh Tân Ước)                  |\n|      |                             bible pauline letters division                              |            Các thư mục vụ của Thánh Phao-lô (Kinh Thánh Tân Ước)            |\n|      |                             bible general epistles division                             |                     Các thư chung (Kinh Thánh Tân Ước)                      |\n|      |                                bible revelation division                                |                    Sách Khải Huyền (Kinh Thánh Tân Ước)                     |\n|      |                               morning and evening prayers                               |         Các kinh đọc sáng tối ngày thường và Chúa Nhật (Cầu nguyện)         |\n|      | offertory prayer, prayers of preparation for holy communion and prayers of thanksgiving | Kinh dâng lễ, những kinh dọn mình chịu lễ và những kinh cám ơn (Cầu nguyện) |\n|      |                            the stations of the cross prayers                            |         Kinh ngắm Đàng Thánh giá và ít nhiều kinh khác (Cầu nguyện)         |\n|      |                                     rosary prayers                                      |                      Phép lần hạt Mân Côi (Cầu nguyện)                      |\n|      |                                         prayers                                         |                            Kinh cầu (Cầu nguyện)                            |\n|      |                                      daily prayers                                      |                       Kinh đọc hàng ngày (Cầu nguyện)                       |\n|      |                                          maria                                          |                                  Mẹ Maria                                   |\n|      |                                         advent                                          |                                  Mùa Vọng                                   |\n|      |                                        christmas                                        |                               Mùa Giáng Sinh                                |\n|      |                                          lent                                           |                                  Mùa Chay                                   |\n|      |                                         triduum                                         |                            Mùa Chay - Tuần Thánh                            |\n|      |                                         easter                                          |                                Mùa Phục Sinh                                |\n|      |                                           ot                                            |                               Mùa Thường niên                               |\n|      |                                      celebrations                                       |                                   Lễ lớn                                    |\n|      |                                      jubilee year                                       |                                  Năm Thánh                                  |\n\n\u003c/details\u003e\n\n### Document Metadata\n\nDocument Metadata is stored in the [data/main.tsv](./data/main.tsv), which is\ndownloaded from Google Sheets [\\[NLP\\] Danh sách tài liệu Công\ngiáo](https://docs.google.com/spreadsheets/d/1YETFmWnGOM1E2Z0pLMkxkqHczyzCmIbqptQ25VFuBGM/edit?usp=sharing)\nand is updated periodically.\n\n\u003e [!IMPORTANT]\n\u003e Only export file to `TSV` format, do not export to `CSV` format.\n\n### Named Entity Recognition (NER)\n\n#### Entity Label Categories\n\n\u003e [!NOTE]\n\u003e Any changes to the category references should be reflected in the\n\u003e [`src/lib/ner/mapping.ts`](./src/lib/ner/mapping.ts) file.\n\n| label |   category   |              examples              |          vietnamese examples           |\n| :---: | :----------: | :--------------------------------: | :------------------------------------: |\n|  PER  |    person    | Jesus, Mary, Peter, Paul, John,... | Giêsu, Maria, Phêrô, Phaolô, Gioan,... |\n|  LOC  |   location   |   Jerusalem, Rome, Bethlehem,...   |      Giêrusalem, Rôma, Bêlem,...       |\n|  ORG  | organization |    Vatican, Catholic Church,...    |    Vatican, Giáo Hội Công Giáo,...     |\n| TITLE |    title     |     Pope, Bishop, Cardinal,...     |    Giáo hoàng, Giám mục, Hồng y,...    |\n|  TME  |     time     |    Sunday, Monday, January,...     |  Chúa Nhật, Thứ Hai, Tháng Giêng,...   |\n|  NUM  |    number    |         1, 2, 3, 4, 5,...          |           1, 2, 3, 4, 5,...            |\n\n#### Setup Label Tools\n\nPlease use [Label Studio](https://labelstud.io/) to label the NER data. Please\nrefer to the\n[v-bible/nlp-label-studio](https://github.com/v-bible/nlp-label-studio) for\nsetuping Label Studio.\n\n#### Getting API Token\n\nTo get Label Studio Legacy API token, go to\n`http://localhost:8080/organization` \u003e `API Tokens Settings` \u003e Check `Legacy\nTokens` \u003e `Save`.\n\n#### Label Procedure\n\n\u003e [!NOTE]\n\u003e Instead labeling all the sentences in the corpus at once, we should **chunk\n\u003e the corpus into genres**, and then label each genre separately. This will help\n\u003e to reduce the complexity of the labeling process and make it easier to manage.\n\nThe label procedure is as follows:\n\n1.  Extract NER tasks:\n    - Use script\n      [`src/ner-processing/extract-ner-task.ts`](./src/ner-processing/extract-ner-task.ts) to\n      extract NER tasks from the corpus data tree.\n\n    - Read JSON corpus data tree from `dist/corpus` **by genre** and write output\n      to `dist/task-data`.\n\n    - The output structure:\n\n      ```\n      dist/task-data\n      └── \u003cgenre\u003e\n          ├── \u003cdomain\u003e\u003csubDomain\u003e\u003cgenre\u003e_fff.ccc.json\n          └── ...\n      ```\n\n    - The data is stored in JSON format, which is compatible with Label Studio.\n      It may contains annotated data from previous labeling sessions, these will\n      be imported as ground truth data.\n\n    - Sample data:\n\n      ```json\n      [\n        {\n          \"data\": {\n            \"text\": \"Đây là gia phả Đức Giê-su Ki-tô, con cháu vua Đa-vít, con cháu tổ phụ Áp-ra-ham :\",\n            \"documentId\": \"RCN_001\",\n            \"chapterId\": \"RCN_001.001\",\n            \"sentenceId\": \"RCN_001.001.001.01\",\n            \"sentenceType\": \"single\",\n            \"title\": \"Phúc Âm theo Thánh Mát-thêu\",\n            \"genreCode\": \"N\"\n          },\n          \"annotations\": [\n            {\n              \"result\": [\n                {\n                  \"value\": {\n                    \"start\": 15,\n                    \"end\": 31,\n                    \"text\": \"Đức Giê-su Ki-tô\",\n                    \"labels\": [\"PER\"]\n                  },\n                  \"from_name\": \"label\",\n                  \"to_name\": \"text\",\n                  \"type\": \"labels\"\n                },\n                {\n                  \"value\": {\n                    \"start\": 42,\n                    \"end\": 52,\n                    \"text\": \"vua Đa-vít\",\n                    \"labels\": [\"PER\"]\n                  },\n                  \"from_name\": \"label\",\n                  \"to_name\": \"text\",\n                  \"type\": \"labels\"\n                },\n                {\n                  \"value\": {\n                    \"start\": 63,\n                    \"end\": 79,\n                    \"text\": \"tổ phụ Áp-ra-ham\",\n                    \"labels\": [\"PER\"]\n                  },\n                  \"from_name\": \"label\",\n                  \"to_name\": \"text\",\n                  \"type\": \"labels\"\n                }\n              ]\n            }\n          ]\n        }\n      ]\n      ```\n\n\u003e [!NOTE]\n\u003e If the task has not been labeled yet, don't add any annotations to the\n\u003e task data, else it will be considered as ground truth data.\n\n2.  Import NER tasks to Label Studio:\n    - Import the NER tasks by creating a new project and selecting the `dist/task-data`\n      folder as the data source, or use script\n      [`src/ner-processing/import-ner-task.ts`](./src/ner-processing/import-ner-task.ts)\n      (**recommended**)\n      to import NER tasks to Label Studio using Label Studio API.\n\n3.  Label NER tasks:\n    - Use Label Studio to label the NER tasks. The labeling interface is\n      configured in the project settings, which is described in the\n      [v-bible/nlp-label-studio](https://github.com/v-bible/nlp-label-studio)\n      repository.\n\n4.  Export NER labels:\n    - Use script\n      [`src/ner-processing/export-ner-task.ts`](./src/ner-processing/export-ner-task.ts)\n      to export the NER tasks from Label Studio to `dist/task-data`.\n\n5.  Inject annotations to data tree:\n    - Use script\n      [`src/ner-processing/inject-annotation.ts`](./src/ner-processing/inject-annotation.ts)\n      to inject the annotations from `dist/task-data` to the corpus data tree in\n      `dist/corpus`.\n\n\u003c!-- Contributing --\u003e\n\n## :wave: Contributing\n\n\u003ca href=\"https://github.com/v-bible/crawler/graphs/contributors\"\u003e\n  \u003cimg src=\"https://contrib.rocks/image?repo=v-bible/crawler\" /\u003e\n\u003c/a\u003e\n\nContributions are always welcome!\n\nPlease read the [contribution guidelines](./CONTRIBUTING.md).\n\n\u003c!-- Code of Conduct --\u003e\n\n### :scroll: Code of Conduct\n\nPlease read the [Code of Conduct](./CODE_OF_CONDUCT.md).\n\n\u003c!-- License --\u003e\n\n## :warning: License\n\nThis project is licensed under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)** License.\n\n[![License: CC BY-NC-SA 4.0](https://licensebuttons.net/l/by-nc-sa/4.0/88x31.png)](https://creativecommons.org/licenses/by-nc-sa/4.0/).\n\nSee the **[LICENSE.md](./LICENSE.md)** file for full details.\n\n\u003c!-- Contact --\u003e\n\n## :handshake: Contact\n\nDuong Vinh - [@duckymomo20012](https://twitter.com/duckymomo20012) -\ntienvinh.duong4@gmail.com\n\nProject Link: [https://github.com/v-bible/crawler](https://github.com/v-bible/crawler).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fv-bible%2Fcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fv-bible%2Fcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fv-bible%2Fcrawler/lists"}