https://github.com/glutexo/onigumo

Parallel web scraping framework
https://github.com/glutexo/onigumo

crawler

Last synced: 5 months ago
JSON representation

Parallel web scraping framework

Host: GitHub
URL: https://github.com/glutexo/onigumo
Owner: Glutexo
License: mit
Created: 2019-08-11T10:48:54.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2025-01-04T19:46:26.000Z (6 months ago)
Last Synced: 2025-01-19T20:59:52.862Z (5 months ago)
Topics: crawler
Language: Elixir
Homepage:
Size: 336 KB
Stars: 3
Watchers: 4
Forks: 1
Open Issues: 63
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

# Onigumo #

## About ##

Onigumo is yet another web-crawler. It “crawls” websites or webapps, storing their data in a structured form suitable for further machine processing.

## Architecture ##

The crawling part of Onigumo is composed of three sequentially interconnected components:

* [the Operator](#operator),
* [the Downloader](#downloader),
* [the Parser](#parser).

The flowcharts below illustrate the flow of data between those parts:

```mermaid
flowchart LR
subgraph Crawling
direction BT
spider_parser(🕷️ PARSER)
spider_operator(🕷️ OPERATOR)
onigumo_downloader[DOWNLOADER]
end

start([START]) --> onigumo_feeder[FEEDER]

onigumo_feeder -- .raw --> Crawling
onigumo_feeder -- .urls --> Crawling
onigumo_feeder -- .json --> Crawling

Crawling --> spider_materializer(🕷️ MATERIALIZER)

spider_materializer --> done([END])

spider_operator -. ".urls" .-> onigumo_downloader
onigumo_downloader -. ".raw" .-> spider_parser
spider_parser -. ".json" .-> spider_operator
```

```mermaid
flowchart LR
subgraph "🕷️ Spider"
direction TB
spider_parser(PARSER)
spider_operator(OPERATOR)
spider_materializer(MATERIALIZER)
end

subgraph Onigumo
onigumo_feeder[FEEDER]
onigumo_downloader[DOWNLOADER]
end

onigumo_feeder -- .json --> spider_operator
onigumo_feeder -- .urls --> onigumo_downloader
onigumo_feeder -- .raw --> spider_parser

spider_parser -. ".json" .-> spider_operator
onigumo_downloader -. ".raw" .-> spider_parser
spider_operator -. ".urls" .-> onigumo_downloader

spider_operator ---> spider_materializer
```

### Operator ###

The Operator determines URL addresses for the Downloader. A Spider is responsible for adding the URLs, which it gets from the structured form of the data provided by the Parser.

The Operator’s job is to:

1. initialize a Spider,
2. extract new URLs from structured data,
3. insert those URLs onto the Downloader queue.

### Downloader ###

The Downloader fetches and saves the contents and metadata from the unprocessed URL addresses.

The Downloader’s job is to:

1. read URLs for download,
2. check for the already downloaded URLs,
3. fetch the URLs contents along with its metadata,
4. save the downloaded data.

### Parser ###

Zpracovává data ze staženého obsahu a metadat do strukturované podoby.

Činnost _parseru_ se skládá z:

1. kontroly stažených URL adres ke zpracování,
2. zpracovávání obsahu a metadat stažených URL do strukturované podoby,
3. ukládání strukturovaných dat.

## Aplikace (pavouci) ##

Ze strukturované podoby dat vytáhne potřebné informace.

Podstata výstupních dat či informací je závislá na uživatelských potřebách a také podobě internetového obsahu. Je nemožné vytvořit univerzálního pavouka splňujícího všechny požadavky z kombinace obou výše zmíněných. Z tohoto důvodu je nutné si napsat vlastního pavouka.

### Materializer ###

## Usage ##

## Credits ##

Licenced under the [MIT license](LICENSE.txt).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/glutexo/onigumo

Awesome Lists containing this project

README