Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Kukei-eu/spider
https://github.com/Kukei-eu/spider
Last synced: 5 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/Kukei-eu/spider
- Owner: Kukei-eu
- License: gpl-3.0
- Created: 2023-11-28T18:34:24.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-07-23T09:35:57.000Z (4 months ago)
- Last Synced: 2024-08-02T16:42:31.119Z (3 months ago)
- Language: JavaScript
- Size: 431 KB
- Stars: 2
- Watchers: 0
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# Kukei.eu crawler
## What is it
This is a crawler behind [kukei.eu](https://kukei.eu) website.
It's being used to crawl websites that are registered in [index-sources.js](./index-sources.js) file.It's written in Javascript with nodejs runtime in mind.
## What is kukei.eu
Full information about the website can be found [here](https://kukei.eu/about).Long story short, Kukei.eu is a search engine that is curated and focused on providing good assistance for *web developers*.
If you use it already and want to contribute, the best way to start is to add your blog, or blog of someone you love to read to [index-sources.js](./index-sources.js) file via a PR.
All PRs will be merged, unless the author of this website finds the blog to have low quality content (e.g. uses medium-like articles buzz-words and not much actual content).
## Configuration
To run it you need MeiliSearch and mongodb instance access.
To configure it, create an .env file with following content:
```bash
MEILI_MASTER_KEY=
MEILI_HOST=https://example.com
MONGO_URI=mongodb+srv://user:[email protected]/?retryWrites=true&w=majority
MONGO_DATABASE=
# Also uses as a collections base prefix. e.g. `sources` also creates `sources-links` collection.
MONGO_COLLECTION=
# Can be empty, if filles will be used to create proper MeiliSearch index names. See `src/manage.js`
MEILI_INDEX_PREFIX=
```### Mongo settings
Not automated yet in src/manage.js:
To make mongo queries work properly you need to set up few indexes:
- to MONGO_COLLECTION: index on `url` as unique
- same for -linksThe reason is that it uses `upsert` heavily in many processes.
## How to run it
You can run it out of the box as long as you have [meilisearch]() access (self-hosted or cloud).
To run it you need
```bash
yarn install &&
yarn crawl:roots
```This will do the initial crawling of all the sources. Then to start continous crawling of found pages, you need to run `yarn crawl:auto` process.
This process runs for maximum ms configured in `PROCESS_TIME_TO_LIVE_MS` env variable. Default is 10 minutes.
It's not guaranteed it will run exactly N milliseconds. It means it will not start another crawling iteration when this time passes.
## How to contribute
If you want to join the project by providing some bug fixes or new features, please first reach out on github issues to discuss the feature you want to add.