https://github.com/suryadev99/data-pipeline-news-api

End to End Data Pipeline for News API using Kafka , Airflow , MongoDB , Elasticsearch and Minio
https://github.com/suryadev99/data-pipeline-news-api

airflow dag docker elasticsearch kafka minio mongodb

Last synced: 4 months ago
JSON representation

End to End Data Pipeline for News API using Kafka , Airflow , MongoDB , Elasticsearch and Minio

Host: GitHub
URL: https://github.com/suryadev99/data-pipeline-news-api
Owner: suryadev99
License: mit
Created: 2024-07-16T12:08:24.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-01-07T11:16:02.000Z (6 months ago)
Last Synced: 2025-01-08T18:29:49.391Z (6 months ago)
Topics: airflow, dag, docker, elasticsearch, kafka, minio, mongodb
Language: Python
Homepage:
Size: 1.9 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# End to End Data Pipeline for News API
[![Python 3.8](https://img.shields.io/badge/python-3.8-blue.svg)](https://www.python.org/downloads/release/python-380/)

**End to End Data Pipeline for News API** is an implementation of the data pipeline which consumes the latest news from RSS Feeds and makes them available for users via handy API
The pipeline infrastructure is built using popular, open-source projects.

**Access the latest news and headlines in one place.** :muscle:

## Table of Contents

* [Architecture diagram](#architecture-diagram)
* [How it works](#how-it-works)
* [Data scraping](#data-scraping)
* [Data flow](#data-flow)
* [Data access](#data-access)
* [Prerequisites](#prerequisites)
* [Running project](#running-project)
* [Testing](#testing)
* [API service](#api-service)
* [References](#references)
* [Contributions](#contributions)
* [License](#license)
* [Contact](#contact)

## Architecture diagram

![MVP Architecture](./images/architecture_diagram.png)

## How it works

#### Data Scraping
Airflow DAG is responsible for the execution of Python scraping modules.
It runs periodically every X minutes producing micro-batches.
- First task updates **proxypool**. Using proxies in combination with rotating user agents can help get scrapers past most of the anti-scraping measures and prevent being detected as a scraper.

- Second task extracts news from RSS feeds provided in the configuration file, validates the quality and sends data into **Kafka topic A**. The extraction process is using validated proxies from **proxypool**.

#### Data flow
- Kafka Connect **Mongo Sink** consumes data from **Kafka topic A** and stores news in MongoDB using upsert functionality based on **_id** field.
- **Debezium MongoDB Source** tracks a MongoDB replica set for document changes in databases and collections, recording those changes as events in **Kafka topic B**.
- Kafka Connect **Elasticsearch Sink** consumes data from **Kafka topic B** and upserts news in Elasticsearch. Data replicated between topics **A** and **B** ensures MongoDB and ElasticSearch synchronization. Command Query Responsibility Segregation (CQRS) pattern allows the use of separate models for updating and reading information.
- Kafka Connect **S3-Minio Sink** consumes records from **Kafka topic B** and stores them in MinIO (high-performance object storage) to ensure data persistency.

#### Data access
- Data gathered by previous steps can be easily accessed in [API service](api) using public endpoints.

## Prerequisites
Software required to run the project. Install:
- [Docker](https://docs.docker.com/get-docker/) - You must allocate a minimum of 8 GB of Docker memory resource.
- [Python 3.8+ (pip)](https://www.python.org/)
- [docker-compose](https://docs.docker.com/compose/install/)

## Running project
Script `manage.sh` - wrapper for `docker-compose` works as a managing tool.

- Build project infrastructure
```sh
./manage.sh up
```

- Stop project infrastructure
```sh
./manage.sh stop
```

- Delete project infrastructure
```sh
./manage.sh down
```

## Testing
Script `run_tests.sh` executes unit tests against Airflow scraping modules and Django Rest Framework applications.

```sh
./run_tests.sh
```

## API service
Read detailed [documentation](api) on how to interact with data collected by pipeline using **search** endpoints.

Example searches:
- see all news
```
http://127.0.0.1:5000/api/v1/news/
```
- add `search_fields` title and description, see all of the news containing the `Virat Kohli` phrase
```
http://127.0.0.1:5000/api/v1/news/?search=Virat%20Kohli
```

- find news containing the `Kohli` phrase in their titles

```
http://127.0.0.1:5000/api/v1/news/?search=title|Kohli
```

- see all the english news containing the `Kohli` phrase

```
http://127.0.0.1:5000/api/v1/news/?search=kohli&language=en
```

## Contributions
Contributions are what makes the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

1. Fork the Project
2. Create your Feature Branch (`git checkout -b new_feature`)
3. Commit your Changes (`git commit -m 'Add some features'`)
4. Push to the Branch (`git push origin new_feature`)
5. Open a Pull Request

## License
Distributed under the MIT License. See [LICENSE](LICENSE) for more information.

## Contact
Please feel free to contact me if you have any questions and I would love to hear your comments.
[linkedin](https://www.linkedin.com/in/surya-dev-01410072/) [@X](https://twitter.com/suryadev_99)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/suryadev99/data-pipeline-news-api

Awesome Lists containing this project

README