https://github.com/frocode/news-data-pipeline_airflow_aws
https://github.com/frocode/news-data-pipeline_airflow_aws
Last synced: 10 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/frocode/news-data-pipeline_airflow_aws
- Owner: FroCode
- Created: 2024-07-31T23:29:59.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-08-08T17:25:00.000Z (over 1 year ago)
- Last Synced: 2026-01-02T12:37:32.072Z (4 months ago)
- Language: Python
- Size: 49 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: news.csv
Awesome Lists containing this project
README
# News API ETL Pipeline
## Overview
This repository contains a simple ETL (Extract, Load) pipeline for fetching top headlines from the News API and uploading the data to an AWS S3 bucket. It uses Python and Apache Airflow for orchestration.
## Files
- **`news_etl.py`**: Contains the ETL process for fetching news data and uploading it to S3.
- **`news_etl_dag.py`**: Defines the Airflow DAG for scheduling and running the ETL process.
## Prerequisites
- Python 3.7 or higher
- Apache Airflow
- `requests` library
- `pandas` library
- `boto3` library (for S3 operations)
- AWS credentials configured for `boto3`
## Setup
1. **Clone the Repository:**
```bash
git clone https://github.com/yourusername/news_etl.git
cd news_etl
```
2. **Install Dependencies:**
It is recommended to use a virtual environment. Install the required Python libraries using:
```bash
pip install requests pandas boto3 apache-airflow
```
3. **Configure Airflow:**
- Set up Apache Airflow by following the [Airflow documentation](https://airflow.apache.org/docs/apache-airflow/stable/start.html).
- Place the `news_etl_dag.py` file in your Airflow DAGs folder (usually located at `~/airflow/dags`).
4. **Configure AWS Credentials:**
Ensure that your AWS credentials are configured. You can set up your AWS credentials using the AWS CLI:
```bash
aws configure
```
Alternatively, set up environment variables:
```bash
export AWS_ACCESS_KEY_ID=your_access_key_id
export AWS_SECRET_ACCESS_KEY=your_secret_access_key
```
5. **Modify API Key:**
Replace the placeholder API key in `news_etl.py` with your actual News API key:
```python
api_key = 'your_news_api_key'
```
## Usage
1. **Running the ETL Process Manually:**
You can run the ETL process directly using Python:
```bash
python news_etl.py
```
2. **Scheduling with Airflow:**
- Start the Airflow web server and scheduler:
```bash
airflow webserver --port 8080
airflow scheduler
```
- Access the Airflow web interface at `http://localhost:8080`.
- Trigger the `news_etl_dag` manually or wait for it to run according to the schedule (daily).
## S3 Bucket Configuration
- Ensure you have an S3 bucket named `reddits-data` (or adjust the bucket name in `news_etl.py` accordingly).
- The pipeline saves the news data as `news.csv` in the S3 bucket.
## Error Handling
- **API Errors:** Ensure your News API key is valid and not expired. Check for HTTP response codes in the logs for detailed error messages.
- **S3 Upload Errors:** Verify that your AWS credentials have the necessary permissions to upload files to the S3 bucket.
