https://github.com/victoriacheng15/articles-extractor
Python application automating web scraping of articles (titles, links, dates, authors) from websites. Deploy via manual runs, cron/Docker, or GitHub Actions. Organizes data into Google Sheets.
https://github.com/victoriacheng15/articles-extractor
bash docker github-actions google-sheets google-sheets-api python3 raspberry-pi
Last synced: 7 days ago
JSON representation
Python application automating web scraping of articles (titles, links, dates, authors) from websites. Deploy via manual runs, cron/Docker, or GitHub Actions. Organizes data into Google Sheets.
- Host: GitHub
- URL: https://github.com/victoriacheng15/articles-extractor
- Owner: victoriacheng15
- License: mit
- Created: 2024-02-04T20:28:44.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-06T14:44:33.000Z (about 2 months ago)
- Last Synced: 2025-03-30T09:11:18.611Z (26 days ago)
- Topics: bash, docker, github-actions, google-sheets, google-sheets-api, python3, raspberry-pi
- Language: Python
- Homepage:
- Size: 78.1 KB
- Stars: 2
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Article Extractor
A Python application that automatically scrapes articles from freeCodeCamp, Substack, and other sources, then organizes them in Google Sheets.
## Getting Started
Please refer to the [Wiki](https://github.com/victoriacheng15/articles-extractor/wiki)
## Tech Stacks




## Key Features
- **Efficient scraping** using Python generators to minimize memory usage
- **Automated scheduling** via GitHub Actions (daily runs)
- **Cross-platform** - Runs on Raspberry Pi, cloud, or local machines
- **Extensible architecture** for adding new content sources## What I have learned
I discovered how Python generators can streamline workflows that depend on sequential completion. In my original approach, I collected all articles in an array (all_articles) before processing them, which forced the script to wait until every scrape finished before sending anything to Google Sheets. By refactoring to use generators, each article is processed immediately after it’s scraped, eliminating the need to store everything upfront. This taught me two key things:
- Natural Sequencing: Generators inherently wait for one action (like scraping an article) to complete before yielding the result and moving to the next. This ensured data flowed smoothly into Google Sheets without manual batching.
- Responsive Execution: Unlike lists, generators don’t hold all items in memory at once. While my primary goal wasn’t memory optimization, I noticed the script felt more responsive—articles appeared in Sheets incrementally, and interruptions didn’t waste prior work.The change simplified my code by removing temporary storage and made the process feel more deliberate, as if guiding each article step-by-step from source to destination.