Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/collaborative-ai/earthquake-forecasting
https://github.com/collaborative-ai/earthquake-forecasting
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/collaborative-ai/earthquake-forecasting
- Owner: Collaborative-AI
- Created: 2021-01-16T16:20:39.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2024-08-10T02:19:51.000Z (5 months ago)
- Last Synced: 2024-08-10T18:27:41.301Z (4 months ago)
- Language: Python
- Size: 291 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# SmartQuake
SmartQuake is a research project aimed at predicting earthquakes on a global scale using the latest machine learning technologies. It compiles data from 14 global datasets, creating a robust dataset that can be leveraged for future earthquake prediction research.
---
## Table of Contents
1. [Dataset Scraping](#dataset-scraping)
2. [Data Processing](#data-processing)
3. [Merging](#merging)
4. [Dataset Checkpoints](#dataset-checkpoints)---
![Alt Text](src\smartquake_pipeline.png)
# Dataset Scraping
### Overview
This step involves scraping earthquake data from various sources, including text files, web pages, and PDFs. It utilizes **BeautifulSoup** for web scraping, **Pandas** for data manipulation, and **Tabula** for PDF data extraction.
### Installation
1. From Google Drive, download the raw datasets and place them under the `src/scraper/.../raw` folder.
2. Install the required dependencies:```bash
pip install -r requirements.txt
```3. Run the scraping script:
```bash
python src/main.py
```The scraped datasets will be saved under `src/scraper/.../clean`.
### Usage
1. **Initialization**: Create an instance of the `Scraper` class with parameters:
- `input_path`: Path to the input file (for text and PDF sources).
- `output_path`: Path to save the output CSV.
- `url`: Webpage URL to scrape (for web sources).
- `start_time` and `end_time`: Date range for filtering data.
- `header`: Column names for the output CSV.
- `separator`: Character for separating data in text files (default is space).2. **Scraping**:
- `find_quakes_txt(num_skips=0)`: For text files. `num_skips` skips initial lines.
- `find_quakes_web()`: For web pages. Scrapes data based on the body tag and predefined header.3. **Example**:
```python
scraper = Scraper(input_path='input.txt', output_path='output.csv', url='http://example.com', header=['Date', 'Magnitude', 'Location'])
scraper.find_quakes_txt(num_skips=1)
scraper.find_quakes_web()
```---
# Data Processing
### Overview
Data processing is the second step in the SmartQuake data pipeline. After scraping the datasets, they are compiled into a standardized format where all datasets share the same columns.
### Data Standardization
Processed earthquake CSV files will contain the following columns:
1. **Timestamp**: Stored as a `pd.Timestamp` string in UTC (format: YYYY-MM-DD HH:MM:SS.millisecond+00:00).
2. **Magnitude**: Moment magnitude (Mw).
3. **Latitude**: Range within [-90, 90].
4. **Longitude**: Range within [-180, 180].
5. **Depth**: Depth in kilometers (optional, may be `None` for older records).All datasets are sorted chronologically and contain no duplicates.
### File Organization
- **data_processor.py**: Contains the `DataProcessor` class for standardized processing.
- **run_processor.py**: Runs the `DataProcessor` on all scraped datasets.
- **processed/**: Folder containing the processed output datasets (CSVs).### Running Data Processing
1. Ensure that all `clean` datasets exist in the `src/scraper/.../clean` folder.
2. Verify that the `processed/` folder exists in `data_processing/`.
3. Run `run_processor.py`:```bash
python data_processing/run_processor.py
```4. After completion, check for the processed CSVs in the `processed/` folder before proceeding to the merging step.
---
# Merging
The merging process combines all processed datasets into a single file for machine learning model input. This step preserves the same columns and ensures chronological order without duplicates.
### File Organization
- **helper.py**: Contains helper functions for merging.
- **merge.py**: Merges non-USGS/SAGE datasets into `Various-Catalogs.csv`.
- **usgs_pre_1950/**: Folder containing scripts for USGS data processing and merging.
- **final/**: Folder containing `usgs_sage_various_merge.py`, which merges all datasets into `Completed-Merge.csv`.### Running Merge
1. **Compile Processed Datasets**: Ensure all processed datasets are in `data_processing/processed/` (excluding USGS/SAGE datasets).
2. **First Merge**: Run `merge.py` to create `Various-Catalogs.csv`.
3. **USGS Data Processing**:
- Download USGS and SAGE datasets from Google Drive and place them in `merge/usgs_pre_1950`.
- Run `preprocess.py` and `merge.py` to create `USGS_SAGE_Merged.csv`.
4. **Final Merge**: Run `usgs_sage_various_merge.py` to merge all datasets into `Completed-Merge.csv`.---
# Dataset Checkpoints
Access the dataset at various stages of acquisition:
- [Scraper](https://drive.google.com/drive/folders/1okZ_2QW58CqQwPA8JIDIwmaBbcOtLIMp?usp=sharing)
- [Preprocessed](https://drive.google.com/drive/folders/1CnXaP9KgUxgQrrreYt3s1MSbCJKHmCQy?usp=sharing)
- [Merged (finalized) dataset](https://drive.google.com/drive/folders/1GUvjtBC2jBqHQGbAVy4rqSAd9fXe3sxT?usp=sharing)