Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mohammad-malik/amazon-frequent-items-kafka
This repository houses an implementation of finding frequent items utilizing A-Priori and PCY Algorithms on Apache Kafka. It leverages a 15GB .json file as a sample of the 100+GB Amazon_Reviews_Metadata Dataset. This was developed as part of an assignment for the course Fundamentals of Big Data Analytics (DS2004).
https://github.com/mohammad-malik/amazon-frequent-items-kafka
a-priori-algorithm amazon metadata pcy python similarity-search
Last synced: 4 days ago
JSON representation
This repository houses an implementation of finding frequent items utilizing A-Priori and PCY Algorithms on Apache Kafka. It leverages a 15GB .json file as a sample of the 100+GB Amazon_Reviews_Metadata Dataset. This was developed as part of an assignment for the course Fundamentals of Big Data Analytics (DS2004).
- Host: GitHub
- URL: https://github.com/mohammad-malik/amazon-frequent-items-kafka
- Owner: mohammad-malik
- License: mit
- Created: 2024-04-13T09:58:36.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-11-23T10:02:12.000Z (2 months ago)
- Last Synced: 2024-12-01T06:40:47.700Z (2 months ago)
- Topics: a-priori-algorithm, amazon, metadata, pcy, python, similarity-search
- Language: Python
- Homepage:
- Size: 493 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Amazon Metadata Streaming Data Pipeline and Itemset Mining
#### This repository houses an implementation of finding similar items utilising A-Priori and PCY Algorithms on Apache Kafka.
Using a 12GB .json file as a sample of the 100+GB Amazon_Reviews Dataset, it was developed as part of an assignment for the course Fundamentals of Big Data Analytics (DS2004).### The project leverages:
1. Apache Kafka for robust real-time data streaming.
2. (Optional) Use of Azure VMs and Blobs, providing a scalable solution for large datasets.## Repository Structure:
```
└── preprocessing.py # Script for preprocessing data locally
└── sampling.py # Script used to randomly sample the original 100+GB into 15 GB.
├── preprocessing_for_azure.py # Script for preprocessing and loading data to Azure Blob Storage
├── blob_to_kafka_producer.py # Script for streaming data from Azure Blob to Kafka
├── consumer1.py # Kafka consumer implementing the Apriori algorithm
├── consumer2.py # Kafka consumer implementing the PCY algorithm
└── consumer3.py # Kafka consumer for anomaly detection
└── producer_for_1_2.py # Kafka producer for Apriori and PCY consumers
└── producer_for_3.py # Kafka producer for Anomaly detection consumer
```## Setup Instructions
### 1. Data Preparation
The first step is to download and preprocess the Amazon Metadata dataset.
└── Preprocessing_for_azure.py if using Azure,
└── Preprocessing.py if not.
### 2. Streaming Pipeline
Next up is setting up Kafka (and optionally Azure Blob Storage):
### 3. Consumer Applications
Then deploy the consumer scripts:
## Technologies and Challenges:
### Used Technologies:
### Streaming Challenges and Solutions:
Sliding Window Approach
Approximation Techniques
## Why This Implementation with Kafka and Sliding Window Approach?
This project leverages Apache Kafka and a sliding window approach for real-time data processing due to several key advantages:
### Scalability of Kafka:
Kafka's distributed architecture allows for horizontal scaling by adding more nodes to the cluster. This ensures the system can handle ever-increasing data volumes in e-commerce scenarios without performance degradation.
### Real-time Processing with Sliding Window:
Traditional batch processing wouldn't be suitable for real-time analytics. The sliding window approach, implemented within Kafka consumers, enables processing data chunks (windows) as they arrive in the stream. This provides near real-time insights without waiting for the entire dataset.
### Low Latency with Kafka:
Kafka's high throughput and low latency are crucial for e-commerce applications. With minimal delays in data processing, businesses can gain quicker insights into customer behavior and product trends, allowing for faster decision-making.
While Azure Blob Storage provides excellent cloud storage for the preprocessed data, and Azure VMs allow for easier clustering, it's Kafka that facilitates the real-time processing aspects crucial for this assignment's goals. The combination of Kafka's streaming capabilities and the sliding window approach within consumers unlocks the power of real-time analytics for e-commerce data.
## Team:
- **Manal Aamir**: [GitHub](https://github.com/manal-aamir)
- **Mohammad Malik**: [GitHub](https://github.com/mohammad-malik)
- **Aqsa Fayaz**: [GitHub](https://github.com/Aqsa-Fayaz)