Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/diegov/searchbox
Personal crawling and indexing
https://github.com/diegov/searchbox
Last synced: 14 days ago
JSON representation
Personal crawling and indexing
- Host: GitHub
- URL: https://github.com/diegov/searchbox
- Owner: diegov
- License: gpl-3.0
- Created: 2020-03-20T18:52:06.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2024-06-09T12:16:56.000Z (5 months ago)
- Last Synced: 2024-10-20T07:26:28.730Z (23 days ago)
- Language: Python
- Size: 105 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Intro
======Simple crawling and indexing of personal bookmarks with Elasticsearch backend.
Both indexing and search are WIP and need lots of work.Sources:
- [x] Github stars
- [x] Pocket bookmarks
- [ ] Saved Reddit posts
- [X] Liked tweetsConfiguration
==============The application expects a `~/.config/searchbox/secrets.py` file, with the following structure:
```python
github = {
# Users whose stars will be crawled
"users_to_crawl": ["user1"],
# Username for API access
"username": "apiuser1",
"personal_access_token": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}pocket = {
"consumer_key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"access_token": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}elastic = {
"url": "https://xxxxxxxx",
# Optional
"username": "xxxxxxx",
# Optional
"password": "xxxxxxx"
}```
The `elastic` section is required, `github` or `pocket` can be omitted if the specific spider won't be used.
Unfortunately each service requires its own type of API access, registering app keys, OAuth, etc... so:
-------It requires an application to be registered in https://getpocket.com/developer/apps/, which will give you the `consumer_key`.
After that we've to do the OAuh dance as specified in https://getpocket.com/developer/docs/authentication, which will result in an authenticated `access_token`.Github
-------This is much simpler, see https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line
Elastic
--------See the details from your Elasticsearch provider, local deployment, container, etc...
If the Elasticsearch configuration does not automatically create indexes, you'll need to create one called `scrapy` before indexing.
Running
========From the root of the project
```sh
python3 -m venv venv
. venv/bin/activate
pip install -r requirements.txt
bin/crawl
```There are two spiders, `github_stars` and `pocket`. `crawl` is a shell script that will run both, but they can be run individually instead, eg:
```sh
scrapy crawl github_stars
```At the end of this some data should be stored in Elasticsearch. There's a simple test script that will query the results
```sh
bin/query q python
```Will return the first 30 results matching `python`.
```sh
bin/query q python performance
```Will return the first 30 results matching `python` AND `performance`.
```sh
bin/query reset-index
```Will delete all the data in the elastic index, and re-create the index.