Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/diixo/wordacy-crawler

Wordacy-crawler (html-crawler)
https://github.com/diixo/wordacy-crawler

Last synced: 16 days ago
JSON representation

Wordacy-crawler (html-crawler)

Host: GitHub
URL: https://github.com/diixo/wordacy-crawler
Owner: diixo
Created: 2022-10-27T00:08:36.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-05-23T02:43:27.000Z (8 months ago)
Last Synced: 2024-05-23T03:37:36.562Z (8 months ago)
Language: HTML
Homepage:
Size: 36.9 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # **wordacy** crawler

### Crawler

Crawler - search and collect all finding URL's from specified source web-page. Save generated urls list into json-format

### Tokenizer

Tokenizer - split sentences on separated words

### Analizer

Analizer - analyze input page by content of DOM-elements like keywords, lists, p-tags, a-tags, span-tags, h1-h6 headers.

### Params:

Parameter **Crawler2** (**recursive**=False): crawl only links from specified urls.

Parameter **Crawler2** (**recursive**=True): crawl with traversing recursively (all links from each domain).

## Examples:

### 1) Crawl web-page:

```python

   crawler = Crawler2(recursive=False)

   crawler.open_json("urls.json")

   crawler.enqueue_url("https://name-1.com/sub/page")

   crawler.enqueue_url("https://name-2.com/sub/page")

   crawler.enqueue_url("https://name-3.com/sub/page")

   crawler.set_filter("https://name-1.com", ["/privacy-policy"])

   crawler.set_filter("https://name-2.com", ["/privacy-policy"])

   crawler.set_filter("https://name-3.com", ["/privacy-policy"])

   crawler.run()

   # save "urls.json"

   crawler.save_json()

```

### 2) Crawl local html-file:

Extract all links of specified domain from local file:

```python

   crawler = Crawler2()

   crawler.extract_from_file("filename.html", "https://domain.com/", ["/privacy-policy"])

   # save output-file as ./storage/domain.com.json

   crawler.save_json()

```

**Crawler2** output json-file as urls-list:

```json

{

   "https://name.com": [

      "https://name.com/url_1/",

      "https://name.com/url_2/"

   ],

   ".new": []

}

```

### 3) Smart analysis the content from url's:

```python

   analyzer = Analyzer()

   analyzer.open_json("some.json")

   analyzer.learn_file("template/template.html")

   analyzer.learn_url(url_1)

   analyzer.learn_url(url_2)

   analyzer.learn_url(url_3)

   analyzer.save_json()

```

Format of output json-file:

```json

{

   "urls": {},

   "keywords": [],

   "data": {},

   "headings": {}

}

```