Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ahmia/ahmia-crawler
Collection of crawlers used by the ahmia search engine
https://github.com/ahmia/ahmia-crawler
Last synced: 2 months ago
JSON representation
Collection of crawlers used by the ahmia search engine
- Host: GitHub
- URL: https://github.com/ahmia/ahmia-crawler
- Owner: ahmia
- License: bsd-3-clause
- Created: 2016-05-23T10:07:41.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2024-07-23T12:31:05.000Z (6 months ago)
- Last Synced: 2024-08-04T23:11:07.349Z (6 months ago)
- Language: Python
- Homepage:
- Size: 291 KB
- Stars: 141
- Watchers: 13
- Forks: 44
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-security-collection - **51**星
README
![https://ahmia.fi/](https://raw.githubusercontent.com/ahmia/ahmia-site/master/ahmia-logotype.png)
Ahmia is the search engine for `.onion` domains on the Tor anonymity
network. It is led by [Juha Nurmi](//github.com/juhanurmi) and is based
in Finland. This repository contains crawlers used by [Ahmia](https://ahmia.fi/) search engine.# Prerequisites
[Ahmia-index](https://github.com/ahmia/ahmia-index) should be installed and running# Installation guide
## Install requirements in a virtual environment
```sh
python3 -m virtualenv venv3
source venv3/bin/activate
pip install -r requirements.txt
```## Prefer own python HTTP proxy
Look fleet installation
[here](https://github.com/ahmia/ahmia-crawler/tree/master/torfleet).## Configuration
`ahmia/ahmia/example.env` contains some default values that should work out of the box.
Copy this to `.env` to create your own instance of environment settings:```
cp ahmia/ahmia/example.env ahmia/ahmia/.env
```# Usage
In order to execute the crawler to run permanently:
```
source venv/bin/activate
./run.sh &> crawler.log
```# Specific run examples
```sh
scrapy crawl ahmia-tor -s DEPTH_LIMIT=1 -s LOG_LEVEL=DEBUG
or
scrapy crawl ahmia-tor -s DEPTH_LIMIT=1 -O items.json:json
or
scrapy crawl ahmia-tor -s DEPTH_LIMIT=3
```# Crontabs
```sh
# Every day
PATH=/usr/local/bin:/usr/bin:/bin:/home/juha/.local/bin
30 06 * * * cd /home/juha/ahmia-crawler/ && bash run_daily.sh > ./daily.log 2>&1
# First day of each month
PATH=/usr/local/bin:/usr/bin:/bin:/home/juha/.local/bin
30 01 01 * * cd /home/juha/ahmia-crawler/ && bash run.sh > ./monthly.log 2>&1
```