https://github.com/altescy/mincrawler
A minimal web crawler.
https://github.com/altescy/mincrawler
configurable crawler python scraping
Last synced: 7 months ago
JSON representation
A minimal web crawler.
- Host: GitHub
- URL: https://github.com/altescy/mincrawler
- Owner: altescy
- License: mit
- Created: 2020-05-29T13:32:15.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2022-01-03T05:34:50.000Z (almost 4 years ago)
- Last Synced: 2025-01-26T11:42:17.337Z (9 months ago)
- Topics: configurable, crawler, python, scraping
- Language: Python
- Size: 180 KB
- Stars: 1
- Watchers: 3
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
mincrawler
==========[](https://github.com/altescy/mincrawler/actions?query=workflow%3Abuild)
[](https://github.com/altescy/mincrawler/blob/master/LICENSE)### Installation
```
pip install git+https://github.com/altescy/mincrawler
```### Usage
```
$ cat config.jsonnet
local storage = {
"@type": "tinydb",
"path": "storage.json",
};
local tweet_collection = "tweets";{
"@type": "basic",
"crawler": {
"@type": "twitter_tweet_search",
"client_id": std.extVar("TWITTER_CLIENT_ID"),
"client_secret": std.extVar("TWITTER_CLIENT_SECRET"),
"token": std.extVar("TWITTER_TOKEN"),
"token_secret": std.extVar("TWITTER_TOKEN_SECRET"),
"q": "python",
"lang": "en",
"count": 10,
"max_requests": 1,
},
"pipeline": {
"@type": "basic",
"stages": [
{
"@type": "drop_duplicate",
"storage": storage,
"collection": tweet_collection,
},
{
"@type": "store_item",
"storage": storage,
"collection": tweet_collection,
},
]
}
}
$ mincrawler config.jsonnet
$ cat storage.json | jq ".tweets | .[].content.text"
```