https://github.com/samhernandezcdev/hn-web-scraper
๐ฐ A Python-based web scraper that collects the relevant posts from Hacker News
https://github.com/samhernandezcdev/hn-web-scraper
hackernews python scraper selenium
Last synced: about 2 months ago
JSON representation
๐ฐ A Python-based web scraper that collects the relevant posts from Hacker News
- Host: GitHub
- URL: https://github.com/samhernandezcdev/hn-web-scraper
- Owner: samhernandezcdev
- Created: 2025-04-16T21:21:53.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-16T21:53:37.000Z (about 1 year ago)
- Last Synced: 2025-04-17T09:31:33.439Z (about 1 year ago)
- Topics: hackernews, python, scraper, selenium
- Language: Python
- Homepage:
- Size: 3.91 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ฐ HackerNews Web Scraper
A Python-based web scraper that collects the relevant posts from [Hacker News](https://news.ycombinator.com/) using `Selenium`, `BeautifulSoup`, and `Requests`. Compiles detailed results, it implements a two-phase system: first it identifies relevant URLs and then performs deep scraping with metadata, saving results in HTML, SQLite and/or CSV.
---
## ๐ Features (Planned)
- Scanning: Search with Selenium in real time.
- Extraction: Complete download of the content of each link.
- Interactive CLI for customized searches.
- Support for automatic pagination and result limiting.
- Download of full HTML with Selenium.
- Scrapes top stories from Hacker News.
- Parses HTML efficiently using BeautifulSoup.
- Can be extended for pagination or deeper content.
- Includes a test suite to validate core functionality.
- Audible notification at the end of the process.
---
## ๐งช Tests
- Fixtures for temporary databases.
- HTTP response mocking.
- Scraping validation and URL processing.
- Execute tests with:
```bash
> pytest
```
## ๐ฆ Installation
### 1. Clone the repository
```bash
> git clone https://github.com/samhernandezcdev/hn-web-scraper.git
> cd hn-web-scraper
```
### 2. Create and activate a virtual environment
On macOS/Linux:
```bash
> python -m venv venv
> source venv/Scripts/activate
```
On Windows:
```bash
> python -m venv venv
> venv\Scripts\activate
```
### 3. Install dependencies
```bash
> pip install -r requirements.txt
```
> Make sure you have Chrome and **ChromeDriver** installed and available in your system PATH.
## ๐ป How to run it?
```bash
> python project.py
```
### Example:
```markdown
> Enter your search term: "Python Web Scraping"
> How many elements do you want to scrape?: 50
> [PROCESSING] Searching in HackerNews...
> [PAGINATION] Page 1/3 | URLs found: 15/50
> [RECOPILING] Scrapping individual URLs...
> [PROGRESS] โโโโโโโโโโโโโโโโโโโโ 70% | 35/50
> [STORING] Saving the files...
- 50 records in db/2025-04-16_hn_scrape.db
- HTMLs in scrapes/2025-04-16_Python_Web_Scraping/
> [COMPLETED] 3.2 MB of data collected (00:02:17).
๐ (sound notification).
```