https://github.com/ekoepplin/dlt-data-dumper

Illustrates how to use dlt (data load tool) in combination with newsapi (source), aws s3 (destination) and duckDB (local testing)
https://github.com/ekoepplin/dlt-data-dumper

dlt dlthub duckdb github-actions newsapi-org

Last synced: about 2 months ago
JSON representation

Illustrates how to use dlt (data load tool) in combination with newsapi (source), aws s3 (destination) and duckDB (local testing)

Host: GitHub
URL: https://github.com/ekoepplin/dlt-data-dumper
Owner: ekoepplin
Created: 2024-10-21T12:07:01.000Z (7 months ago)
Default Branch: master
Last Pushed: 2024-12-16T20:22:05.000Z (5 months ago)
Last Synced: 2024-12-16T20:39:49.971Z (5 months ago)
Topics: dlt, dlthub, duckdb, github-actions, newsapi-org
Language: Python
Homepage:
Size: 123 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: newsapi_pipeline.py

Awesome Lists containing this project

README

# NewsAPI Data Ingestion with DLT

This project demonstrates how to ingest data from the NewsAPI using the Data Load Tool (DLT) library. It fetches top headlines, article searches, and news sources, then loads them into a DuckDB database.

## Features

- Fetch top headlines from NewsAPI
- Search for articles on specific topics
- Retrieve news sources information
- Load data into DuckDB using DLT
- Jupyter notebook for interactive data exploration

## Prerequisites

- Python 3.11.8
- Pipenv for dependency management

## Installation

1. Clone this repository:
```
git clone
cd dlt-data-dumper
```

2. Install dependencies using Pipenv:
```
pipenv install
```

3. Activate the virtual environment:
```
pipenv shell
```

4. Set up your NewsAPI key:
- Sign up for a free API key at [https://newsapi.org/](https://newsapi.org/)
- Set the API key as an environment variable:
```
export NEWS_API_KEY=your_api_key_here
```

## Usage

### Running the Pipeline

To run the data ingestion pipeline, use the `newsapi_pipeline.py` script. This script supports various command-line options for different execution modes:

1. Normal mode:
```
python newsapi_pipeline.py
```

2. Test mode (uses DuckDB instead of filesystem):
```
python newsapi_pipeline.py --test
```

3. Full refresh (replaces existing data instead of appending):
```
python newsapi_pipeline.py --full-refresh
```

4. Custom log level:
```
python newsapi_pipeline.py --log-level DEBUG
```

You can also combine these options as needed. For example:
```
python newsapi_pipeline.py --test --full-refresh --log-level DEBUG
```

This script will fetch data from NewsAPI and load it into either a filesystem-based storage (default) or a DuckDB database (in test mode).

### Exploring Data with Jupyter Notebook

1. Start Jupyter Notebook:
```
jupyter notebook
```

2. Open the `eda-newsapi.ipynb` notebook in your browser.

3. Run the cells to fetch data and perform exploratory data analysis.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is open-source and available under the [MIT License](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ekoepplin/dlt-data-dumper

Awesome Lists containing this project

README