Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tashi-2004/apache-kafka-and-frequent-item-sets
This Bash script 🌟 automates the setup and execution of a data processing pipeline 🔄 using Apache Kafka 📦 and Python scripts 🐍, ensuring fault tolerance 🛡️ and streamlined management 🚀 of Kafka-based data pipelines 📊💡.
https://github.com/tashi-2004/apache-kafka-and-frequent-item-sets
kafka kafka-consumer kafka-producer kafka-streams mongodb mongodbcompass
Last synced: about 1 month ago
JSON representation
This Bash script 🌟 automates the setup and execution of a data processing pipeline 🔄 using Apache Kafka 📦 and Python scripts 🐍, ensuring fault tolerance 🛡️ and streamlined management 🚀 of Kafka-based data pipelines 📊💡.
- Host: GitHub
- URL: https://github.com/tashi-2004/apache-kafka-and-frequent-item-sets
- Owner: tashi-2004
- Created: 2024-04-24T07:10:53.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-07-26T00:57:54.000Z (5 months ago)
- Last Synced: 2024-07-26T02:37:51.339Z (5 months ago)
- Topics: kafka, kafka-consumer, kafka-producer, kafka-streams, mongodb, mongodbcompass
- Language: Python
- Homepage:
- Size: 67.4 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Apache-Kafka-and-Frequent-Itemsets
## Overview
This project implements a streaming pipeline for processing and analyzing the Amazon Metadata dataset using various techniques such as sampling, preprocessing, frequent itemset mining, and database integration.## Why this project?
This project is designed to:
- Process and analyze the Amazon Metadata dataset in real-time using a streaming pipeline
- Discover frequent itemsets in the dataset using algorithms like Apriori and PCY
- Optimize data processing using the Bloom Filter data structure
- Store and visualize the results in a MongoDB Compass database
- Handle large datasets and scale the processing using Kafka
- Perform real-time analytics and gain insights from the dataset
- Preprocess the dataset to prepare it for analysis
- Use multiple consumer applications to perform different tasks and analyses on the data stream
## Files
- `sampling.py`: Python script for sampling the Amazon Metadata dataset.
- `pre-processing.py`: Python script for preprocessing the sampled dataset.
- `producer.py`: Python script for the producer application in the streaming pipeline setup.
- `consumer1.py`, `consumer2.py`, `consumer3.py`: Python scripts for the consumer applications subscribing to the producer's data stream.
- `Apriori.py`: Python script implementing the Apriori algorithm for frequent itemset mining.
- `PCY.py`: Python script implementing the PCY algorithm for frequent itemset mining.
- `Bloomfilter.py`: Python script for implementing the Bloom Filter data structure.
- `Database_Apriori.py`: Python script for integrating Apriori results with the MongoDB Compass database.
- `Database_PCY.py`: Python script for integrating PCY results with the MongoDB Compass database.
- `Database_Bloomfilter.py`: Python script for integrating Bloom Filter results with the MongoDB Compass database.## Description
### Sampling and Preprocessing
The `sampling.py` script samples the Amazon Metadata dataset, followed by preprocessing using `pre-processing.py`.### Streaming Pipeline Setup
The `producer.py` script generates a data stream, while `consumer1.py`, `consumer2.py`, and `consumer3.py` subscribe to this stream to perform various tasks.### Frequent Itemset Mining
`Apriori.py` and `PCY.py` implement different algorithms for frequent itemset mining, while `Bloomfilter.py` provides support for efficient data processing.### Database Integration
The `Database_Apriori.py`, `Database_PCY.py`, and `Database_Bloomfilter.py` scripts connect to a MongoDB Compass database and store the results of the analysis.### MongoDB Compass
The output generated by `Database_Apriori.py`, `Database_PCY.py`, and `Database_Bloomfilter.py` can be viewed in MongoDB Compass after integration with the MongoDB database.### Analysis
This project performs simple analysis on the dataset, including:
- Frequent itemset mining using Apriori and PCY algorithms## Usage
1. Run `sampling.py` followed by `pre-processing.py` to sample and preprocess the dataset.
2. Run `producer.py` to start the data stream.
3. Run `consumer1.py`, `consumer2.py`, and `consumer3.py` to subscribe to the data stream and perform analysis.
4. Optionally, run `Apriori.py`, `PCY.py`, and `Bloomfilter.py` for additional analysis.
5. Run `Database_Apriori.py`, `Database_PCY.py`, and `Database_Bloomfilter.py` to integrate with the MongoDB database.## Requirements
- Python 3.x
- Kafka
- MongoDB Compass & MongoDB Connector
- Other dependencies as specified in the code## Note
- Kindly, remove numbers at the start of the files (e.g., change `(3) producer.py` to `producer.py`). They are just given for clarifications.
- Run the `BONUS.sh` file on the terminal with the command `./BONUS.sh`.## Contributors
- M.Tashfeen Abbasi
- [Laiba Mazhar](https://github.com/laiba-mazhar)
- [Rafia Khan](https://github.com/rakhan2)Thank you!