https://github.com/ekoepplin/dlt-data-dumper
Illustrates how to use dlt (data load tool) in combination with newsapi (source), aws s3 (destination) and duckDB (local testing)
https://github.com/ekoepplin/dlt-data-dumper
dlt dlthub duckdb github-actions newsapi-org
Last synced: about 2 months ago
JSON representation
Illustrates how to use dlt (data load tool) in combination with newsapi (source), aws s3 (destination) and duckDB (local testing)
- Host: GitHub
- URL: https://github.com/ekoepplin/dlt-data-dumper
- Owner: ekoepplin
- Created: 2024-10-21T12:07:01.000Z (7 months ago)
- Default Branch: master
- Last Pushed: 2024-12-16T20:22:05.000Z (5 months ago)
- Last Synced: 2024-12-16T20:39:49.971Z (5 months ago)
- Topics: dlt, dlthub, duckdb, github-actions, newsapi-org
- Language: Python
- Homepage:
- Size: 123 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: newsapi_pipeline.py
Awesome Lists containing this project
README
# NewsAPI Data Ingestion with DLT
This project demonstrates how to ingest data from the NewsAPI using the Data Load Tool (DLT) library. It fetches top headlines, article searches, and news sources, then loads them into a DuckDB database.
## Features
- Fetch top headlines from NewsAPI
- Search for articles on specific topics
- Retrieve news sources information
- Load data into DuckDB using DLT
- Jupyter notebook for interactive data exploration## Prerequisites
- Python 3.11.8
- Pipenv for dependency management## Installation
1. Clone this repository:
```
git clone
cd dlt-data-dumper
```2. Install dependencies using Pipenv:
```
pipenv install
```3. Activate the virtual environment:
```
pipenv shell
```4. Set up your NewsAPI key:
- Sign up for a free API key at [https://newsapi.org/](https://newsapi.org/)
- Set the API key as an environment variable:
```
export NEWS_API_KEY=your_api_key_here
```## Usage
### Running the Pipeline
To run the data ingestion pipeline, use the `newsapi_pipeline.py` script. This script supports various command-line options for different execution modes:
1. Normal mode:
```
python newsapi_pipeline.py
```2. Test mode (uses DuckDB instead of filesystem):
```
python newsapi_pipeline.py --test
```3. Full refresh (replaces existing data instead of appending):
```
python newsapi_pipeline.py --full-refresh
```4. Custom log level:
```
python newsapi_pipeline.py --log-level DEBUG
```You can also combine these options as needed. For example:
```
python newsapi_pipeline.py --test --full-refresh --log-level DEBUG
```This script will fetch data from NewsAPI and load it into either a filesystem-based storage (default) or a DuckDB database (in test mode).
### Exploring Data with Jupyter Notebook
1. Start Jupyter Notebook:
```
jupyter notebook
```2. Open the `eda-newsapi.ipynb` notebook in your browser.
3. Run the cells to fetch data and perform exploratory data analysis.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is open-source and available under the [MIT License](LICENSE).