https://github.com/touradbaba/web_scraping
This repository contains a Python tool for scraping, storing, and visualizing Yahoo Finance stock data. Features automated scraping with GitHub Actions, data storage in HDF5 format, and data visualization using Matplotlib and Seaborn.
https://github.com/touradbaba/web_scraping
beautifulsoup data-visualization github-actions hdf5 python web-scraping
Last synced: 4 months ago
JSON representation
This repository contains a Python tool for scraping, storing, and visualizing Yahoo Finance stock data. Features automated scraping with GitHub Actions, data storage in HDF5 format, and data visualization using Matplotlib and Seaborn.
- Host: GitHub
- URL: https://github.com/touradbaba/web_scraping
- Owner: TouradBaba
- License: mit
- Created: 2024-08-20T22:40:16.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-11-16T18:24:51.000Z (over 1 year ago)
- Last Synced: 2025-05-01T14:36:38.504Z (about 1 year ago)
- Topics: beautifulsoup, data-visualization, github-actions, hdf5, python, web-scraping
- Language: Python
- Homepage:
- Size: 1.19 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Web Scraping Tool
## Introduction
This repository contains a Python tool designed for collecting, storing, and visualizing stock data from Yahoo Finance. This project integrates web scraping techniques, efficient data storage using the HDF5 format, and data visualization. Automation of the scraping process is achieved through GitHub Actions and Windows Task Scheduler.
## Features
- **Web Scraping**: Extracts stock data such as price, price change, and percentage change from Yahoo Finance.
- **Data Storage**: Utilizes HDF5 format for efficient storage and retrieval of large datasets.
- **Visualization**: Provides data visualizations using Matplotlib and Seaborn to analyze stock trends.
- **Automation**: Implements GitHub Actions and Windows Task Scheduler for scheduled data collection.
## Table of Contents
1. [Overview](#Overview)
2. [Python Script](#python-script)
3. [Data Storage in HDF5 Format](#data-storage-in-hdf5-format)
4. [Visualization](#Visualization)
5. [Repository Structure](#repository-structure)
6. [Getting Started](#getting-started)
7. [Clarification on Data Collection](#clarification-on-data-collection)
## Overview
This project involves scraping stock data from Yahoo Finance pages, processing the information, and visualizing it to discern trends and patterns.
### Objectives
- Develop a web scraping script to extract stock data.
- Store the data in an HDF5 format for efficient access.
- Create visualizations to analyze trends in stock data.
- Automate the scraping process using GitHub Actions and Windows Task Scheduler.
## Python Script
### Overview of the Scraping Approach
This repository contains two methods for scraping stock data from Yahoo Finance:
1. Using `yfinance` to retrieve stock data directly via ticker symbols.
2. Using BeautifulSoup to scrape data directly from Yahoo Finance web pages.
### Detailed Explanation of the Scripts
- **`yfinance` Integration**: This method uses `yfinance` to pull stock data using ticker symbols directly. It is a straightforward and efficient way to retrieve financial data, especially when working with well-known stocks.
- **BeautifulSoup-Based Scraper (`scraper.py`)**: This script is designed to scrape stock data directly from Yahoo Finance pages using BeautifulSoup. It parses HTML to extract stock prices, price changes, and percentage changes.
> **Note:** The `scraper.py` script demonstrates how to scrape data from Yahoo Finance using BeautifulSoup, but the repository primarily relies on `yfinance` for data collection. The scraped data is stored in an HDF5 file named `data.h5`.
### Automating the Scraping Process
#### Using GitHub Actions
The `auto_scrap.yml` file configures GitHub Actions to automate the scraping task at regular intervals.
#### Alternative Automation Method: Windows Task Scheduler
A batch file (`scraping.bat`) is provided to schedule scraping tasks on Windows systems.
## Data Storage in HDF5 Format
### Rationale for Using HDF5
HDF5 is chosen for its capability to efficiently handle large datasets, providing fast access and storage capabilities.
### Data Model Design
- **Structure of the HDF5 File**: Organizes data by date and company, with each dataset containing metadata and attributes.
- **Metadata and Attributes**: Stores additional information about the data for context and retrieval.
- **Storing Time Series Data**: Efficiently manages and retrieves time series data for analysis.
## Visualization
The `visualization.ipynb` notebook offers visual insights into the stock data, generating plots for stock prices, price changes, and percentage changes using Matplotlib and Seaborn. The visualizations include:
1. **Daily Stock Price Visualization**

2. **Hourly Stock Price Visualization**

### Interpretation of the Visualizations
1-Daily Visualizations:
- Consistent Patterns Across Companies: The visualizations for the daily data on short periods show that the three companies exhibit similar trends over time. This could indicate that the companies are reacting to broader market forces in a similar manner. For example, a general upward trend might suggest positive market conditions affecting all three companies, while synchronized drops might point to external factors impacting the entire sector.
- Smooth Temporal Changes: The daily data often shows smoother trends, with gradual changes rather than abrupt fluctuations. This could be due to the aggregation of data over a full day, which tends to average out short-term noise and highlights more significant movements.
2-Hourly Visualizations:
- Variation in Intra-Day Patterns: Unlike the daily visualizations on short periods, the hourly data reveals more distinct patterns for each company. This suggests that while the companies may follow similar daily trends, their intra-day price movements can differ due to factors like company-specific news, trading volumes, or other microeconomic events.
- More Volatile Changes: The hourly plots often display more volatility, with sharp increases or decreases in stock prices within short time frames. This reflects the dynamic nature of stock prices during trading hours, where prices can react rapidly to new information or market sentiment.
3-Conclusion:
- The daily visualizations provide a broad overview of how the companies' stocks perform over time, showing similar trends across companies.
- The hourly visualizations, however, reveal the uniqueness in how each company's stock behaves within a day, capturing the more nuanced and immediate reactions to market events.
These insights demonstrate the importance of visualizing data at different temporal resolutions, as daily and hourly views can offer complementary perspectives on stock performance. The daily data helps to identify overall trends, while the hourly data provides a closer look at the company's behavior within those trends.
## Repository Structure
- **.github/workflows**: Contains the `auto_scrap.yml` file for automating the scraping process using GitHub Actions.
- **Images**: Contains images from the `visualization.ipynb` notebook, such as daily and hourly stock price visualizations.
- **scraping_scripts**: Contains scripts for scraping stock data using `yfinance` and BeautifulSoup.
- **URLs**: Includes the `URLs.ipynb` notebook to store the URLs in a CSV file.
- **Tickers**: Includes the `Tickers.ipynb` notebook to store the tickers in a CSV file.
- **scraped_data**: Contains the HDF5 file with scraped data from Yahoo Finance, organized by date and company.
- **.gitattributes**: Defines attributes for the repository.
- **.gitignore**: Specifies files and directories to be ignored by version control.
- **read_hdf5_file.py**: Script for reading data stored in HDF5 format.
- **requirements.txt**: Lists the dependencies required for the project.
- **scraping.bat**: Batch file for automating the scraping process on Windows systems using Task Scheduler.
- **visualization.ipynb**: Jupyter Notebook for visualizing the scraped stock data.
## Getting Started
### Prerequisites
- Python 3.x (The project was developed using Python 3.11).
- Required libraries listed in `requirements.txt`.
### Installation
1. Clone the repository: `https://github.com/TouradBaba/web_scraping.git`
2. Install dependencies: `pip install -r requirements.txt`
3. Navigate to the repository directory: `cd `
## Clarification on Data Collection
The `scraper.py` script demonstrates how to scrape data from Yahoo Finance using BeautifulSoup, but the repository primarily relies on `yfinance` for data collection. The scraped data is stored in an HDF5 file named `data.h5`.