Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/diixo/wordacy-crawler
Wordacy-crawler (html-crawler)
https://github.com/diixo/wordacy-crawler
Last synced: 16 days ago
JSON representation
Wordacy-crawler (html-crawler)
- Host: GitHub
- URL: https://github.com/diixo/wordacy-crawler
- Owner: diixo
- Created: 2022-10-27T00:08:36.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-23T02:43:27.000Z (8 months ago)
- Last Synced: 2024-05-23T03:37:36.562Z (8 months ago)
- Language: HTML
- Homepage:
- Size: 36.9 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# **wordacy** crawler
### Crawler
Crawler - search and collect all finding URL's from specified source web-page. Save generated urls list into json-format### Tokenizer
Tokenizer - split sentences on separated words### Analizer
Analizer - analyze input page by content of DOM-elements like keywords, lists, p-tags, a-tags, span-tags, h1-h6 headers.### Params:
Parameter **Crawler2** (**recursive**=False): crawl only links from specified urls.Parameter **Crawler2** (**recursive**=True): crawl with traversing recursively (all links from each domain).
## Examples:
### 1) Crawl web-page:
```python
crawler = Crawler2(recursive=False)
crawler.open_json("urls.json")crawler.enqueue_url("https://name-1.com/sub/page")
crawler.enqueue_url("https://name-2.com/sub/page")
crawler.enqueue_url("https://name-3.com/sub/page")crawler.set_filter("https://name-1.com", ["/privacy-policy"])
crawler.set_filter("https://name-2.com", ["/privacy-policy"])
crawler.set_filter("https://name-3.com", ["/privacy-policy"])crawler.run()
# save "urls.json"
crawler.save_json()
```
### 2) Crawl local html-file:
Extract all links of specified domain from local file:
```python
crawler = Crawler2()
crawler.extract_from_file("filename.html", "https://domain.com/", ["/privacy-policy"])# save output-file as ./storage/domain.com.json
crawler.save_json()
```
**Crawler2** output json-file as urls-list:
```json
{
"https://name.com": [
"https://name.com/url_1/",
"https://name.com/url_2/"
],
".new": []
}
```
### 3) Smart analysis the content from url's:```python
analyzer = Analyzer()
analyzer.open_json("some.json")
analyzer.learn_file("template/template.html")
analyzer.learn_url(url_1)
analyzer.learn_url(url_2)
analyzer.learn_url(url_3)
analyzer.save_json()
```Format of output json-file:
```json
{
"urls": {},
"keywords": [],
"data": {},
"headings": {}
}
```