Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/glutexo/onigumo
Parallel web scraping framework
https://github.com/glutexo/onigumo
crawler
Last synced: 12 days ago
JSON representation
Parallel web scraping framework
- Host: GitHub
- URL: https://github.com/glutexo/onigumo
- Owner: Glutexo
- License: mit
- Created: 2019-08-11T10:48:54.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-11-09T20:17:24.000Z (about 2 months ago)
- Last Synced: 2024-12-16T20:39:58.900Z (15 days ago)
- Topics: crawler
- Language: Elixir
- Homepage:
- Size: 335 KB
- Stars: 3
- Watchers: 4
- Forks: 1
- Open Issues: 61
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Onigumo #
## About ##
Onigumo is yet another web-crawler. It “crawls” websites or webapps, storing their data in a structured form suitable for further machine processing.
## Architecture ##
The crawling part of Onigumo is composed of three sequentially interconnected components:
* [the Operator](#operator),
* [the Downloader](#downloader),
* [the Parser](#parser).The flowcharts below illustrate the flow of data between those parts:
```mermaid
flowchart LR
subgraph Crawling
direction BT
spider_parser(🕷️ PARSER)
spider_operator(🕷️ OPERATOR)
onigumo_downloader[DOWNLOADER]
endstart([START]) --> onigumo_feeder[FEEDER]
onigumo_feeder -- .raw --> Crawling
onigumo_feeder -- .urls --> Crawling
onigumo_feeder -- .json --> CrawlingCrawling --> spider_materializer(🕷️ MATERIALIZER)
spider_materializer --> done([END])
spider_operator -. ".urls" .-> onigumo_downloader
onigumo_downloader -. ".raw" .-> spider_parser
spider_parser -. ".json" .-> spider_operator
``````mermaid
flowchart LR
subgraph "🕷️ Spider"
direction TB
spider_parser(PARSER)
spider_operator(OPERATOR)
spider_materializer(MATERIALIZER)
endsubgraph Onigumo
onigumo_feeder[FEEDER]
onigumo_downloader[DOWNLOADER]
endonigumo_feeder -- .json --> spider_operator
onigumo_feeder -- .urls --> onigumo_downloader
onigumo_feeder -- .raw --> spider_parserspider_parser -. ".json" .-> spider_operator
onigumo_downloader -. ".raw" .-> spider_parser
spider_operator -. ".urls" .-> onigumo_downloaderspider_operator ---> spider_materializer
```### Operator ###
The Operator determines URL addresses for the Downloader. A Spider is responsible for adding the URLs, which it gets from the structured form of the data provided by the Parser.
The Operator’s job is to:
1. initialize a Spider,
2. extract new URLs from structured data,
3. insert those URLs onto the Downloader queue.### Downloader ###
The Downloader fetches and saves the contents and metadata from the unprocessed URL addresses.
The Downloader’s job is to:
1. read URLs for download,
2. check for the already downloaded URLs,
3. fetch the URLs contents along with its metadata,
4. save the downloaded data.### Parser ###
Zpracovává data ze staženého obsahu a metadat do strukturované podoby.
Činnost _parseru_ se skládá z:
1. kontroly stažených URL adres ke zpracování,
2. zpracovávání obsahu a metadat stažených URL do strukturované podoby,
3. ukládání strukturovaných dat.## Aplikace (pavouci) ##
Ze strukturované podoby dat vytáhne potřebné informace.
Podstata výstupních dat či informací je závislá na uživatelských potřebách a také podobě internetového obsahu. Je nemožné vytvořit univerzálního pavouka splňujícího všechny požadavky z kombinace obou výše zmíněných. Z tohoto důvodu je nutné si napsat vlastního pavouka.
### Materializer ###
## Usage ##
## Credits ##
© [Glutexo](https://github.com/Glutexo), [nappex](https://github.com/nappex) 2019 – 2022
Licenced under the [MIT license](LICENSE.txt).