Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/collaborative-ai/earthquake-forecasting


https://github.com/collaborative-ai/earthquake-forecasting

Last synced: about 1 month ago
JSON representation

Awesome Lists containing this project

README

        

# SmartQuake

SmartQuake is a research project aimed at predicting earthquakes on a global scale using the latest machine learning technologies. It compiles data from 14 global datasets, creating a robust dataset that can be leveraged for future earthquake prediction research.

---

## Table of Contents

1. [Dataset Scraping](#dataset-scraping)
2. [Data Processing](#data-processing)
3. [Merging](#merging)
4. [Dataset Checkpoints](#dataset-checkpoints)

---

![Alt Text](src\smartquake_pipeline.png)

# Dataset Scraping

### Overview

This step involves scraping earthquake data from various sources, including text files, web pages, and PDFs. It utilizes **BeautifulSoup** for web scraping, **Pandas** for data manipulation, and **Tabula** for PDF data extraction.

### Installation

1. From Google Drive, download the raw datasets and place them under the `src/scraper/.../raw` folder.
2. Install the required dependencies:

```bash
pip install -r requirements.txt
```

3. Run the scraping script:

```bash
python src/main.py
```

The scraped datasets will be saved under `src/scraper/.../clean`.

### Usage

1. **Initialization**: Create an instance of the `Scraper` class with parameters:
- `input_path`: Path to the input file (for text and PDF sources).
- `output_path`: Path to save the output CSV.
- `url`: Webpage URL to scrape (for web sources).
- `start_time` and `end_time`: Date range for filtering data.
- `header`: Column names for the output CSV.
- `separator`: Character for separating data in text files (default is space).

2. **Scraping**:
- `find_quakes_txt(num_skips=0)`: For text files. `num_skips` skips initial lines.
- `find_quakes_web()`: For web pages. Scrapes data based on the body tag and predefined header.

3. **Example**:

```python
scraper = Scraper(input_path='input.txt', output_path='output.csv', url='http://example.com', header=['Date', 'Magnitude', 'Location'])
scraper.find_quakes_txt(num_skips=1)
scraper.find_quakes_web()
```

---

# Data Processing

### Overview

Data processing is the second step in the SmartQuake data pipeline. After scraping the datasets, they are compiled into a standardized format where all datasets share the same columns.

### Data Standardization

Processed earthquake CSV files will contain the following columns:

1. **Timestamp**: Stored as a `pd.Timestamp` string in UTC (format: YYYY-MM-DD HH:MM:SS.millisecond+00:00).
2. **Magnitude**: Moment magnitude (Mw).
3. **Latitude**: Range within [-90, 90].
4. **Longitude**: Range within [-180, 180].
5. **Depth**: Depth in kilometers (optional, may be `None` for older records).

All datasets are sorted chronologically and contain no duplicates.

### File Organization

- **data_processor.py**: Contains the `DataProcessor` class for standardized processing.
- **run_processor.py**: Runs the `DataProcessor` on all scraped datasets.
- **processed/**: Folder containing the processed output datasets (CSVs).

### Running Data Processing

1. Ensure that all `clean` datasets exist in the `src/scraper/.../clean` folder.
2. Verify that the `processed/` folder exists in `data_processing/`.
3. Run `run_processor.py`:

```bash
python data_processing/run_processor.py
```

4. After completion, check for the processed CSVs in the `processed/` folder before proceeding to the merging step.

---

# Merging

The merging process combines all processed datasets into a single file for machine learning model input. This step preserves the same columns and ensures chronological order without duplicates.

### File Organization

- **helper.py**: Contains helper functions for merging.
- **merge.py**: Merges non-USGS/SAGE datasets into `Various-Catalogs.csv`.
- **usgs_pre_1950/**: Folder containing scripts for USGS data processing and merging.
- **final/**: Folder containing `usgs_sage_various_merge.py`, which merges all datasets into `Completed-Merge.csv`.

### Running Merge

1. **Compile Processed Datasets**: Ensure all processed datasets are in `data_processing/processed/` (excluding USGS/SAGE datasets).
2. **First Merge**: Run `merge.py` to create `Various-Catalogs.csv`.
3. **USGS Data Processing**:
- Download USGS and SAGE datasets from Google Drive and place them in `merge/usgs_pre_1950`.
- Run `preprocess.py` and `merge.py` to create `USGS_SAGE_Merged.csv`.
4. **Final Merge**: Run `usgs_sage_various_merge.py` to merge all datasets into `Completed-Merge.csv`.

---

# Dataset Checkpoints

Access the dataset at various stages of acquisition:

- [Scraper](https://drive.google.com/drive/folders/1okZ_2QW58CqQwPA8JIDIwmaBbcOtLIMp?usp=sharing)
- [Preprocessed](https://drive.google.com/drive/folders/1CnXaP9KgUxgQrrreYt3s1MSbCJKHmCQy?usp=sharing)
- [Merged (finalized) dataset](https://drive.google.com/drive/folders/1GUvjtBC2jBqHQGbAVy4rqSAd9fXe3sxT?usp=sharing)