{"id":15563718,"url":"https://github.com/glutexo/onigumo","last_synced_at":"2025-10-07T11:30:54.256Z","repository":{"id":42045923,"uuid":"201752196","full_name":"Glutexo/onigumo","owner":"Glutexo","description":"Parallel web scraping framework","archived":false,"fork":false,"pushed_at":"2025-01-04T19:46:26.000Z","size":344,"stargazers_count":3,"open_issues_count":63,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-01-19T20:59:52.862Z","etag":null,"topics":["crawler"],"latest_commit_sha":null,"homepage":"","language":"Elixir","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Glutexo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-08-11T10:48:54.000Z","updated_at":"2024-12-17T16:51:08.000Z","dependencies_parsed_at":"2023-02-19T01:30:48.587Z","dependency_job_id":"f84bea65-09ef-4857-aa76-acbad1c1cbfc","html_url":"https://github.com/Glutexo/onigumo","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Glutexo%2Fonigumo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Glutexo%2Fonigumo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Glutexo%2Fonigumo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Glutexo%2Fonigumo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Glutexo","download_url":"https://codeload.github.com/Glutexo/onigumo/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235621547,"owners_count":19019520,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler"],"created_at":"2024-10-02T16:25:07.553Z","updated_at":"2025-10-07T11:30:53.964Z","avatar_url":"https://github.com/Glutexo.png","language":"Elixir","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Onigumo #\n\n## About ##\n\nOnigumo is yet another web-crawler. It “crawls” websites or webapps, storing their data in a structured form suitable for further machine processing.\n\n## Architecture ##\n\nThe crawling part of Onigumo is composed of three sequentially interconnected components:\n\n* [the Operator](#operator),\n* [the Downloader](#downloader),\n* [the Parser](#parser).\n\nThe flowcharts below illustrate the flow of data between those parts:\n\n```mermaid\nflowchart LR\n    subgraph Crawling\n        direction BT\n        spider_parser(🕷️ PARSER)\n        spider_operator(🕷️ OPERATOR)\n        onigumo_downloader[DOWNLOADER]\n    end\n\n    start([START]) --\u003e onigumo_feeder[FEEDER]\n\n    onigumo_feeder -- .raw --\u003e Crawling\n    onigumo_feeder -- .urls --\u003e Crawling\n    onigumo_feeder -- .json --\u003e Crawling\n\n    Crawling --\u003e spider_materializer(🕷️ MATERIALIZER)\n\n    spider_materializer --\u003e done([END])\n\n    spider_operator -. \"\u003chash\u003e.urls\" .-\u003e onigumo_downloader\n    onigumo_downloader -. \"\u003chash\u003e.raw\" .-\u003e spider_parser\n    spider_parser -. \"\u003chash\u003e.json\" .-\u003e spider_operator\n```\n\n```mermaid\nflowchart LR\n    subgraph \"🕷️ Spider\"\n        direction TB\n        spider_parser(PARSER)\n        spider_operator(OPERATOR)\n        spider_materializer(MATERIALIZER)\n    end\n\n    subgraph Onigumo\n        onigumo_feeder[FEEDER]\n        onigumo_downloader[DOWNLOADER]\n    end\n\n    onigumo_feeder -- .json --\u003e spider_operator\n    onigumo_feeder -- .urls --\u003e onigumo_downloader\n    onigumo_feeder -- .raw --\u003e spider_parser\n\n    spider_parser -. \"\u003chash\u003e.json\" .-\u003e spider_operator\n    onigumo_downloader -. \"\u003chash\u003e.raw\" .-\u003e spider_parser\n    spider_operator -. \"\u003chash\u003e.urls\" .-\u003e onigumo_downloader\n\n    spider_operator ---\u003e spider_materializer\n```\n\n### Operator ###\n\nThe Operator determines URL addresses for the Downloader. A Spider is responsible for adding the URLs, which it gets from the structured form of the data provided by the Parser.\n\nThe Operator’s job is to:\n\n1. initialize a Spider,\n2. extract new URLs from structured data,\n3. insert those URLs onto the Downloader queue.\n\n### Downloader ###\n\nThe Downloader fetches and saves the contents and metadata from the unprocessed URL addresses.\n\nThe Downloader’s job is to:\n\n1. read URLs for download,\n2. check for the already downloaded URLs,\n3. fetch the URLs contents along with its metadata,\n4. save the downloaded data.\n\n### Parser ###\n\nZpracovává data ze staženého obsahu a metadat do strukturované podoby.\n\nČinnost _parseru_ se skládá z:\n\n1. kontroly stažených URL adres ke zpracování,\n2. zpracovávání obsahu a metadat stažených URL do strukturované podoby,\n3. ukládání strukturovaných dat.\n\n## Aplikace (pavouci) ##\n\nZe strukturované podoby dat vytáhne potřebné informace.\n\nPodstata výstupních dat či informací je závislá na uživatelských potřebách a také podobě internetového obsahu. Je nemožné vytvořit univerzálního pavouka splňujícího všechny požadavky z kombinace obou výše zmíněných. Z tohoto důvodu je nutné si napsat vlastního pavouka.\n\n### Materializer ###\n\n## Usage ##\n\n## Credits ##\n\n© [Glutexo](https://github.com/Glutexo), [nappex](https://github.com/nappex) 2019 – 2022\n\nLicenced under the [MIT license](LICENSE.txt).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fglutexo%2Fonigumo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fglutexo%2Fonigumo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fglutexo%2Fonigumo/lists"}