https://github.com/aryan4codes/stockio
StockIO is a real-time data streaming solution designed to process and analyze stock market data using Apache Kafka and AWS services.
https://github.com/aryan4codes/stockio
apache-kafka aws-athena aws-ec2 aws-glue aws-s3
Last synced: 6 months ago
JSON representation
StockIO is a real-time data streaming solution designed to process and analyze stock market data using Apache Kafka and AWS services.
- Host: GitHub
- URL: https://github.com/aryan4codes/stockio
- Owner: aryan4codes
- Created: 2024-07-23T20:40:39.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-02T09:18:05.000Z (11 months ago)
- Last Synced: 2025-02-08T17:30:14.169Z (8 months ago)
- Topics: apache-kafka, aws-athena, aws-ec2, aws-glue, aws-s3
- Language: Jupyter Notebook
- Homepage:
- Size: 2.62 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# StockIO: Real-Time Stock Market Data Streaming and Analysis with Kafka and AWS
*StockIO is a real-time data streaming solution designed to process and analyze stock market data using Apache Kafka and AWS services.*

## Project Overview
StockIO is a real-time streaming application that simulates stock market data and processes it using Apache Kafka and various AWS services.
- Hosted on AWS EC2
- The processed data is stored in Amazon S3
- Analyzed using AWS Glue and Amazon Athena.## Data:
1. **Ensure the stock market dataset is available:**
- Make sure you have access to the Kaggle Stock Market Dataset with the following features:
- Dataset is given in /data folder.
- `Index`
- `Date`
- `Open`
- `High`
- `Low`
- `Close`
- `Adj Close`
- `Volume`
- `CloseUSD`2. **Implement the sleep function:**
- To simulate the real-time data flow into Kafka, the producer script should include a sleep function. This will introduce delays between sending each data entry, mimicking real-time data streaming.3. **Execute the producer script:**
- Run the script that sends data to the Kafka topic, with the sleep function applied.
4. **Execute the consumer script:**
- Run the script that reads data from the Kafka topic and stores it in S3.## Architecture
The project architecture is designed to handle real-time stock market data and process it efficiently using the following components:
1. **Producer**: Simulates stock market data and sends it to a Kafka topic.
2. **Kafka**: Acts as the message broker to handle the stream of data.
3. **Consumer**: Reads data from the Kafka topic and stores it in Amazon S3.
4. **AWS S3**: Stores the processed stock market data.
5. **AWS Glue**: Crawls the data in S3 to create a metadata catalog.
6. **Amazon Athena**: Queries and analyzes the data stored in S3.## How to Run the Project
### Set up Kafka on AWS EC2
1. **Launch an EC2 instance and install Kafka:**
- Follow the instructions provided by Kafka to install it on your EC2 instance.2. **Start the Kafka server:**
- Use the command to start the Kafka server, usually something like `bin/kafka-server-start.sh config/server.properties`.### Run the Producer
1. **Ensure the stock market dataset is available:**
- Make sure you have access to the dataset required for the producer script.2. **Execute the producer script:**
- Run the script that sends data to the Kafka topic.### Run the Consumer
1. **Execute the consumer script:**
- Run the script that reads data from the Kafka topic and stores it in S3.### Set up AWS Glue
1. **Create a Glue crawler:**
- Configure the Glue crawler to crawl the S3 bucket and create a metadata catalog.### Query Data with Amazon Athena
1. **Use Athena to query the data:**
- Utilize Amazon Athena to run queries on the data stored in S3.### Dependencies
- `pandas`
- `kafka-python`
- `s3fs`
- `boto3`