https://github.com/vasanthakumar70/project_stockmarket
This project uses PySpark to create an ETL pipeline. It extracts stock market data from the Alpha Vantage API, transforms it, and then loads it into a SQL Server database for analysis.
https://github.com/vasanthakumar70/project_stockmarket
etl-pipeline json mssql pyspark python
Last synced: about 2 months ago
JSON representation
This project uses PySpark to create an ETL pipeline. It extracts stock market data from the Alpha Vantage API, transforms it, and then loads it into a SQL Server database for analysis.
- Host: GitHub
- URL: https://github.com/vasanthakumar70/project_stockmarket
- Owner: vasanthakumar70
- Created: 2024-10-16T17:27:19.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-17T10:37:22.000Z (over 1 year ago)
- Last Synced: 2025-02-05T11:37:29.067Z (over 1 year ago)
- Topics: etl-pipeline, json, mssql, pyspark, python
- Language: Python
- Homepage:
- Size: 28.3 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Project_stockmarket
This project shows how to get stock market data from API, use PySpark to process it, and then save the processed data in an SQL Database.
## Project Summary
The aim of this project is to make it simpler to gather data for many companies, making it easier to analyze and store this data in a relational database. This solution uses:
- **PySpark** for parallel data processing
- **Alpha Vantage API** for stock market data
- **SQL Server** for storing and managing structured data
The project is designed to run on a set schedule (e.g., daily) to stay up-to-date with the latest stock market data.
## Highlights
- **Data Mining**: Gets stock prices and volumes for using the Alpha Vantage API
- **Data Transformation**: Converts the raw JSON data into useful information, including the date, opening price, closing price, highest price, lowest price, and volume.
- **Data Loading**: Saves the processed data into the SQL Database.
- **Logging**: It includes logging process to track successes and failures.
### Process:

## Project Structure
```
.
├── etl_process.py # Main Python script
├── .env # Environment variables (API key, database credentials, etc.)
├── etl_process.log # Log file
├── requirements.txt # Required Python packages
├── README.md # This readme file
└── sqljdbc_12.8/ # JDBC driver for connecting PySpark to SQL Server
```
### References & Downloads
- **Alpha Vantage API**: [RapidAPI](https://rapidapi.com/alphavantage/api/alpha-vantage/playground/)
- **Apache Spark**: [Spark Downloads](https://spark.apache.org/downloads.html)
- **Java Development Kit (JDK)**: [Oracle JDK 11 Downloads](https://www.oracle.com/java/technologies/javase/jdk11-archive-downloads.html)
- **Hadoop Winutils**: [Winutils GitHub Repository](https://github.com/cdarlint/winutils)