https://github.com/stupidcucumber/elephant-crawler

System for mining texts from websites.
https://github.com/stupidcucumber/elephant-crawler

data data-mining-python python

Last synced: 3 months ago
JSON representation

System for mining texts from websites.

Host: GitHub
URL: https://github.com/stupidcucumber/elephant-crawler
Owner: stupidcucumber
License: mit
Created: 2024-10-18T17:23:00.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-11-27T10:27:04.000Z (over 1 year ago)
Last Synced: 2025-01-19T08:15:31.402Z (over 1 year ago)
Topics: data, data-mining-python, python
Language: Python
Homepage:
Size: 111 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# elephant-crowler
![Python Version from PEP 621 TOML](https://img.shields.io/python/required-version-toml?tomlFilePath=https%3A%2F%2Fraw.githubusercontent.com%2Fstupidcucumber%2Felephant-crowler%2Frefs%2Fheads%2Fmain%2Fpyproject.toml)

![Logotype](./assets/logo.png)

## Development
To start contributing this repository:

1. Install requirements:
```
python -m pip install -r requirements.dev.txt
```

2. Install pre-commit hook:
```
pre-commit install
```

You're good to go!

### Architecture
![Architecture](./assets/architecture.png)

1. DB stores all data from the texts.
2. Core-API provides access to the database for the external services.
3. Crawler-SVC starts all

## Deployment

Only thing you need to do is:
```
docker-compose up --build
```

Then all scrapped texts are available on the endpoint:
```
http://localhost:8081/scrapped-texts
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/stupidcucumber/elephant-crawler

Awesome Lists containing this project

README