https://github.com/suryadev99/data-pipeline-news-api
End to End Data Pipeline for News API using Kafka , Airflow , MongoDB , Elasticsearch and Minio
https://github.com/suryadev99/data-pipeline-news-api
airflow dag docker elasticsearch kafka minio mongodb
Last synced: 9 months ago
JSON representation
End to End Data Pipeline for News API using Kafka , Airflow , MongoDB , Elasticsearch and Minio
- Host: GitHub
- URL: https://github.com/suryadev99/data-pipeline-news-api
- Owner: suryadev99
- License: mit
- Created: 2024-07-16T12:08:24.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-07T11:16:02.000Z (10 months ago)
- Last Synced: 2025-01-08T18:29:49.391Z (10 months ago)
- Topics: airflow, dag, docker, elasticsearch, kafka, minio, mongodb
- Language: Python
- Homepage:
- Size: 1.9 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# End to End Data Pipeline for News API
[](https://www.python.org/downloads/release/python-380/)
**End to End Data Pipeline for News API** is an implementation of the data pipeline which consumes the latest news from RSS Feeds and makes them available for users via handy API
The pipeline infrastructure is built using popular, open-source projects.
**Access the latest news and headlines in one place.** :muscle:
## Table of Contents
* [Architecture diagram](#architecture-diagram)
* [How it works](#how-it-works)
* [Data scraping](#data-scraping)
* [Data flow](#data-flow)
* [Data access](#data-access)
* [Prerequisites](#prerequisites)
* [Running project](#running-project)
* [Testing](#testing)
* [API service](#api-service)
* [References](#references)
* [Contributions](#contributions)
* [License](#license)
* [Contact](#contact)
## Architecture diagram

## How it works
#### Data Scraping
Airflow DAG is responsible for the execution of Python scraping modules.
It runs periodically every X minutes producing micro-batches.
- First task updates **proxypool**. Using proxies in combination with rotating user agents can help get scrapers past most of the anti-scraping measures and prevent being detected as a scraper.
- Second task extracts news from RSS feeds provided in the configuration file, validates the quality and sends data into **Kafka topic A**. The extraction process is using validated proxies from **proxypool**.
#### Data flow
- Kafka Connect **Mongo Sink** consumes data from **Kafka topic A** and stores news in MongoDB using upsert functionality based on **_id** field.
- **Debezium MongoDB Source** tracks a MongoDB replica set for document changes in databases and collections, recording those changes as events in **Kafka topic B**.
- Kafka Connect **Elasticsearch Sink** consumes data from **Kafka topic B** and upserts news in Elasticsearch. Data replicated between topics **A** and **B** ensures MongoDB and ElasticSearch synchronization. Command Query Responsibility Segregation (CQRS) pattern allows the use of separate models for updating and reading information.
- Kafka Connect **S3-Minio Sink** consumes records from **Kafka topic B** and stores them in MinIO (high-performance object storage) to ensure data persistency.
#### Data access
- Data gathered by previous steps can be easily accessed in [API service](api) using public endpoints.
## Prerequisites
Software required to run the project. Install:
- [Docker](https://docs.docker.com/get-docker/) - You must allocate a minimum of 8 GB of Docker memory resource.
- [Python 3.8+ (pip)](https://www.python.org/)
- [docker-compose](https://docs.docker.com/compose/install/)
## Running project
Script `manage.sh` - wrapper for `docker-compose` works as a managing tool.
- Build project infrastructure
```sh
./manage.sh up
```
- Stop project infrastructure
```sh
./manage.sh stop
```
- Delete project infrastructure
```sh
./manage.sh down
```
## Testing
Script `run_tests.sh` executes unit tests against Airflow scraping modules and Django Rest Framework applications.
```sh
./run_tests.sh
```
## API service
Read detailed [documentation](api) on how to interact with data collected by pipeline using **search** endpoints.
Example searches:
- see all news
```
http://127.0.0.1:5000/api/v1/news/
```
- add `search_fields` title and description, see all of the news containing the `Virat Kohli` phrase
```
http://127.0.0.1:5000/api/v1/news/?search=Virat%20Kohli
```
- find news containing the `Kohli` phrase in their titles
```
http://127.0.0.1:5000/api/v1/news/?search=title|Kohli
```
- see all the english news containing the `Kohli` phrase
```
http://127.0.0.1:5000/api/v1/news/?search=kohli&language=en
```
## Contributions
Contributions are what makes the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
1. Fork the Project
2. Create your Feature Branch (`git checkout -b new_feature`)
3. Commit your Changes (`git commit -m 'Add some features'`)
4. Push to the Branch (`git push origin new_feature`)
5. Open a Pull Request
## License
Distributed under the MIT License. See [LICENSE](LICENSE) for more information.
## Contact
Please feel free to contact me if you have any questions and I would love to hear your comments.
[linkedin](https://www.linkedin.com/in/surya-dev-01410072/) [@X](https://twitter.com/suryadev_99)