Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mohammad-malik/amazon-frequent-items-kafka

This repository houses an implementation of finding frequent items utilizing A-Priori and PCY Algorithms on Apache Kafka. It leverages a 15GB .json file as a sample of the 100+GB Amazon_Reviews_Metadata Dataset. This was developed as part of an assignment for the course Fundamentals of Big Data Analytics (DS2004).
https://github.com/mohammad-malik/amazon-frequent-items-kafka

a-priori-algorithm amazon metadata pcy python similarity-search

Last synced: 4 days ago
JSON representation

This repository houses an implementation of finding frequent items utilizing A-Priori and PCY Algorithms on Apache Kafka. It leverages a 15GB .json file as a sample of the 100+GB Amazon_Reviews_Metadata Dataset. This was developed as part of an assignment for the course Fundamentals of Big Data Analytics (DS2004).

Awesome Lists containing this project

README

        

# Amazon Metadata Streaming Data Pipeline and Itemset Mining

#### This repository houses an implementation of finding similar items utilising A-Priori and PCY Algorithms on Apache Kafka.
Using a 12GB .json file as a sample of the 100+GB Amazon_Reviews Dataset, it was developed as part of an assignment for the course Fundamentals of Big Data Analytics (DS2004).

### The project leverages:
1. Apache Kafka for robust real-time data streaming.
2. (Optional) Use of Azure VMs and Blobs, providing a scalable solution for large datasets.

## Repository Structure:
```
└── preprocessing.py # Script for preprocessing data locally
└── sampling.py # Script used to randomly sample the original 100+GB into 15 GB.
├── preprocessing_for_azure.py # Script for preprocessing and loading data to Azure Blob Storage
├── blob_to_kafka_producer.py # Script for streaming data from Azure Blob to Kafka
├── consumer1.py # Kafka consumer implementing the Apriori algorithm
├── consumer2.py # Kafka consumer implementing the PCY algorithm
└── consumer3.py # Kafka consumer for anomaly detection
└── producer_for_1_2.py # Kafka producer for Apriori and PCY consumers
└── producer_for_3.py # Kafka producer for Anomaly detection consumer
```

## Setup Instructions
### 1. Data Preparation
The first step is to download and preprocess the Amazon Metadata dataset.

  • Download the dataset from the provided Amazon link. Use EITHER of:

        └── Preprocessing_for_azure.py if using Azure,
        └── Preprocessing.py if not.


  • Upload the preprocessed data to Azure Blob Storage (set blob and connection string in the script) (If not using Azure, skip this step).

  • The original dataset's size necessitated sampling for efficient analysis. We ensured a good mix of metadata for our analysis.

  • ### 2. Streaming Pipeline
    Next up is setting up Kafka (and optionally Azure Blob Storage):
  • Deploy Apache Kafka. Ensure Kafka brokers are accessible.

  • Modify azure_blob_data.py with your Azure Blob Storage connection details and Kafka bootstrap servers.

  • Run blob_to_kafka_producer.py to stream data from Azure Blob Storage to Kafka.
  • ### 3. Consumer Applications
    Then deploy the consumer scripts:

  • consumer1.py: Consumes data for frequent itemset mining using Apriori. Adjust Kafka topic and MongoDB details.
  • consumer2.py: Similar setup as Apriori, but implements the PCY algorithm.
  • consumer3.py: Implements anomaly detection. Configure for the relevant Kafka topic.

    ## Technologies and Challenges:
    ### Used Technologies:

  • Azure Blob Storage: For storing and managing large-scale dataset preprocessing.
  • Apache Kafka: Utilized for robust real-time data streaming.
  • Python: Scripting language for data processing and mining algorithms.
  • MongoDB (optional): Recommended for storing consumer application outputs for persistent analysis

    ### Streaming Challenges and Solutions:
    Sliding Window Approach
    Approximation Techniques

    ## Why This Implementation with Kafka and Sliding Window Approach?

    This project leverages Apache Kafka and a sliding window approach for real-time data processing due to several key advantages:
    ### Scalability of Kafka:
    Kafka's distributed architecture allows for horizontal scaling by adding more nodes to the cluster. This ensures the system can handle ever-increasing data volumes in e-commerce scenarios without performance degradation.

    ### Real-time Processing with Sliding Window:
    Traditional batch processing wouldn't be suitable for real-time analytics. The sliding window approach, implemented within Kafka consumers, enables processing data chunks (windows) as they arrive in the stream. This provides near real-time insights without waiting for the entire dataset.

    ### Low Latency with Kafka:
    Kafka's high throughput and low latency are crucial for e-commerce applications. With minimal delays in data processing, businesses can gain quicker insights into customer behavior and product trends, allowing for faster decision-making.



    While Azure Blob Storage provides excellent cloud storage for the preprocessed data, and Azure VMs allow for easier clustering, it's Kafka that facilitates the real-time processing aspects crucial for this assignment's goals. The combination of Kafka's streaming capabilities and the sliding window approach within consumers unlocks the power of real-time analytics for e-commerce data.

    ## Team:
    - **Manal Aamir**: [GitHub](https://github.com/manal-aamir)
    - **Mohammad Malik**: [GitHub](https://github.com/mohammad-malik)
    - **Aqsa Fayaz**: [GitHub](https://github.com/Aqsa-Fayaz)