Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/suryadev99/data-pipeline-news-api
https://github.com/suryadev99/data-pipeline-news-api
Last synced: 27 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/suryadev99/data-pipeline-news-api
- Owner: suryadev99
- License: mit
- Created: 2024-07-16T12:08:24.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-09-12T17:44:02.000Z (about 2 months ago)
- Last Synced: 2024-09-13T06:57:52.726Z (about 2 months ago)
- Language: Python
- Size: 1.9 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# End to End Data Pipeline for News API
[![Python 3.8](https://img.shields.io/badge/python-3.8-blue.svg)](https://www.python.org/downloads/release/python-380/)**End to End Data Pipeline for News API** is an implementation of the data pipeline which consumes the latest news from RSS Feeds and makes them available for users via handy API
The pipeline infrastructure is built using popular, open-source projects.**Access the latest news and headlines in one place.** :muscle:
## Table of Contents
* [Architecture diagram](#architecture-diagram)
* [How it works](#how-it-works)
* [Data scraping](#data-scraping)
* [Data flow](#data-flow)
* [Data access](#data-access)
* [Prerequisites](#prerequisites)
* [Running project](#running-project)
* [Testing](#testing)
* [API service](#api-service)
* [References](#references)
* [Contributions](#contributions)
* [License](#license)
* [Contact](#contact)## Architecture diagram
![MVP Architecture](./images/architecture_diagram.png)
## How it works
#### Data Scraping
Airflow DAG is responsible for the execution of Python scraping modules.
It runs periodically every X minutes producing micro-batches.
- First task updates **proxypool**. Using proxies in combination with rotating user agents can help get scrapers past most of the anti-scraping measures and prevent being detected as a scraper.- Second task extracts news from RSS feeds provided in the configuration file, validates the quality and sends data into **Kafka topic A**. The extraction process is using validated proxies from **proxypool**.
#### Data flow
- Kafka Connect **Mongo Sink** consumes data from **Kafka topic A** and stores news in MongoDB using upsert functionality based on **_id** field.
- **Debezium MongoDB Source** tracks a MongoDB replica set for document changes in databases and collections, recording those changes as events in **Kafka topic B**.
- Kafka Connect **Elasticsearch Sink** consumes data from **Kafka topic B** and upserts news in Elasticsearch. Data replicated between topics **A** and **B** ensures MongoDB and ElasticSearch synchronization. Command Query Responsibility Segregation (CQRS) pattern allows the use of separate models for updating and reading information.
- Kafka Connect **S3-Minio Sink** consumes records from **Kafka topic B** and stores them in MinIO (high-performance object storage) to ensure data persistency.#### Data access
- Data gathered by previous steps can be easily accessed in [API service](api) using public endpoints.## Prerequisites
Software required to run the project. Install:
- [Docker](https://docs.docker.com/get-docker/) - You must allocate a minimum of 8 GB of Docker memory resource.
- [Python 3.8+ (pip)](https://www.python.org/)
- [docker-compose](https://docs.docker.com/compose/install/)## Running project
Script `manage.sh` - wrapper for `docker-compose` works as a managing tool.- Build project infrastructure
```sh
./manage.sh up
```- Stop project infrastructure
```sh
./manage.sh stop
```- Delete project infrastructure
```sh
./manage.sh down
```## Testing
Script `run_tests.sh` executes unit tests against Airflow scraping modules and Django Rest Framework applications.```sh
./run_tests.sh
```## API service
Read detailed [documentation](api) on how to interact with data collected by pipeline using **search** endpoints.Example searches:
- see all news
```
http://127.0.0.1:5000/api/v1/news/
```
- add `search_fields` title and description, see all of the news containing the `Virat Kohli` phrase
```
http://127.0.0.1:5000/api/v1/news/?search=Virat%20Kohli
```- find news containing the `Kohli` phrase in their titles
```
http://127.0.0.1:5000/api/v1/news/?search=title|Kohli
```- see all the english news containing the `Kohli` phrase
```
http://127.0.0.1:5000/api/v1/news/?search=kohli&language=en
```## Contributions
Contributions are what makes the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.1. Fork the Project
2. Create your Feature Branch (`git checkout -b new_feature`)
3. Commit your Changes (`git commit -m 'Add some features'`)
4. Push to the Branch (`git push origin new_feature`)
5. Open a Pull Request## License
Distributed under the MIT License. See [LICENSE](LICENSE) for more information.## Contact
Please feel free to contact me if you have any questions and I would love to hear your comments.
[linkedin](https://www.linkedin.com/in/surya-dev-01410072/) [@X](https://twitter.com/suryadev_99)