Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dapalex/dbfeeder
An all-in-one solution to crawl scrap and populate a DB
https://github.com/dapalex/dbfeeder
Last synced: about 2 months ago
JSON representation
An all-in-one solution to crawl scrap and populate a DB
- Host: GitHub
- URL: https://github.com/dapalex/dbfeeder
- Owner: dapalex
- License: apache-2.0
- Created: 2023-09-17T03:10:19.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-02-28T06:32:17.000Z (11 months ago)
- Last Synced: 2024-10-12T13:34:25.052Z (3 months ago)
- Language: C#
- Size: 470 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DBFeeder
_The development is in progress as well as the documentation_
## Introduction
DBFeeder is an all-in-one solution that crawls and scraps information from the web to then populate a relational database.
## Using DBFeeder
### Configuration
The solution can be configured following the steps below:
1) Create json configuration files for crawler (instructions [here](https://github.com/dapalex/DBFeeder/blob/main/CrawlerService/configs/README.md))
2) Create json configuration files for scraper (instructions [here](https://github.com/dapalex/DBFeeder/blob/main/ScraperService/configs/README.md))
3) Define entities (EF Core) using Devart Entity Developer (instructions [here]((https://github.com/dapalex/DBFeeder/blob/main/DBFeederEntity/README.md)).
4) Update [`docker-compose.yml`](https://github.com/dapalex/DBFeeder/blob/main/docker-compose.yml) file in order to create a DAC service for each entity created
### Launching the solution
The solution runs using docker-compose.yml file:
#### Build
```bash
docker compose build
```#### Launch
```bash
docker compose up
```## Execution workflow
A complete retrieval of a single entity information comprehends the following phases:
- Crawler extracting the target url
- Scraper extracting information from the target url
- Data Access Command generating the entity and populating the corresponding table## Architecture
The solution is composed of the following docker images:- Crawler: a container from an image of a .Net 7 worker service running in multithreading, 1 task for each source/configuration
- Scraper: a container containing multiple .Net 7 worker service processes, 1 process for each source
- DataAccessCommand: 1 container for each entity/DB table![image](https://github.com/dapalex/DBFeeder/blob/main/Docs/DBFeeder%20Architecture.png)
Stack:
- Docker
- .Net 7
- RabbitMQ
- EF Core
- SQLitemaximize throughput
allow scalability
efficiency
ensure robustness (needs more work)
allow reusabilityA simplified CQRS pattern has been applied consisting of a single DB and one DAC service for each table
## Services Overview
### Crawler
In charge of retrieving urls from an HTML source page.
More information [here](https://github.com/dapalex/DBFeeder/blob/main/CrawlerService)### Scraper
In charge of retrieving information for the database population from the crawled urls.
More information [here](https://github.com/dapalex/DBFeeder/blob/main/ScraperService)### Data Access Command
In charge of populating the database with the information scraped.
More information [here](https://github.com/dapalex/DBFeeder/blob/main/DACService)## Services instantiation
![image](https://github.com/dapalex/DBFeeder/blob/main/Docs/DBFeeder%20Creation%20Workflow.png)
## Last words
This repo is dedicated to Peter, a friend who gave me the chance to learn how life can be enjoyable.