https://github.com/simon987/od-database

Distributed crawler, database and web frontend for public directories indexing
https://github.com/simon987/od-database

bootstrap elasticsearch scraping

Last synced: over 1 year ago
JSON representation

Distributed crawler, database and web frontend for public directories indexing

Host: GitHub
URL: https://github.com/simon987/od-database
Owner: simon987
License: mit
Created: 2018-06-02T21:26:58.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2020-01-31T16:26:49.000Z (over 6 years ago)
Last Synced: 2025-03-24T13:09:34.672Z (over 1 year ago)
Topics: bootstrap, elasticsearch, scraping
Language: Python
Homepage:
Size: 1.88 MB
Stars: 139
Watchers: 13
Forks: 24
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# OD-Database

OD-Database is a web-crawling project that aims to index a very large number of file links and their basic metadata from open directories (misconfigured Apache/Nginx/FTP servers, or more often, mirrors of various public services).

Each crawler instance fetches tasks from the central server and pushes the result once completed. A single instance can crawl hundreds of websites at the same time (Both FTP and HTTP(S)) and the central server is capable of ingesting thousands of new documents per second.

The data is indexed into elasticsearch and made available via the web frontend (Currently hosted at https://od-db.the-eye.eu/). There is currently ~1.93 billion files indexed (total of about 300Gb of raw data). The raw data is made available as a CSV file [here](https://od-db.the-eye.eu/dl).

![2018-09-20-194116_1127x639_scrot](https://user-images.githubusercontent.com/7120851/45852325-281cca00-bd0d-11e8-9fed-49a54518e972.png)

### Contributing
Suggestions/concerns/PRs are welcome

## Installation (Docker)
```bash
git clone --recursive https://github.com/simon987/od-database
cd od-database
mkdir oddb_pg_data/ tt_pg_data/ es_data/ wsb_data/
docker-compose up
```

## Architecture

![diag](high_level_diagram.png)

## Running the crawl server
The python crawler that was a part of this project is discontinued,
[the go implementation](https://github.com/terorie/od-database-crawler) is currently in use.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/simon987/od-database

Awesome Lists containing this project

README