https://github.com/studyresearchprojects/hacker-news-scraper
A example web scraper for Hacker News exposed through a REST API
https://github.com/studyresearchprojects/hacker-news-scraper
Last synced: 6 months ago
JSON representation
A example web scraper for Hacker News exposed through a REST API
- Host: GitHub
- URL: https://github.com/studyresearchprojects/hacker-news-scraper
- Owner: StudyResearchProjects
- Created: 2021-09-16T01:19:49.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-10-08T21:36:17.000Z (over 4 years ago)
- Last Synced: 2025-08-06T00:42:55.785Z (6 months ago)
- Language: Python
- Homepage:
- Size: 8.79 KB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
hacker-news-scraper
A example web scraper for Hacker News exposed through a REST API
## Production
```bash
docker image build -t hacker-news-scrapper .
```
```bash
docker run -d -p 5000:5000 hacker-news-scrapper
```
## Development
### Execute
```bash
docker-compose -f ./docker-compose.dev.yml up --build
```
Then visit:
```
http://0.0.0.0:5000
```
### Shutdown
Focus the terminal where the session is running and excute `Ctrl + C`.
Then execute:
```bash
docker-compose -f ./docker-compose.dev.yml down
```
### HTTP Server
An HTTP Server acts as the interface to consume this application.
Endpoints are enumerated in the following table:
Method | URI | Description
--- | --- | ---
`GET` | `/crawl` | Executes the `HackerNewsBotSpider` and retrieves the state
`GET` | `/results` | Retrieves the results from the `/crawl` process if available
`GET` | `/context` | Retrieves the current state for relevant values
In order to execute any of these HTTP requests you must first follow the
[Execute](#execute) section and an HTTP client such as _cURL_.
Example of a cURL call to this API while running.
```bash
curl http://0.0.0.0:5000/crawl
```
### Scraper
Scrapy is used as _web crawler_/_scraper_ solution to retrieve posts from
Hacker News in this project.
The _Scrapy_ project is stored under `src/scraper` and contains the:
`HackerNewsBotSpider`.
In order to use _Scrapy_ shell you must [execute the docker-compose](#execute)
and use `docker exec` to SSH into the running container.
1. Execute `docker ps` to gather container details
```bash
docker ps
```
2. Copy the relevant `CONTAINER ID` to your clipboard
3. Execute `docker exec -it bash`
> At this point you will be using the container's BASH instance.
4. Change directory to `src/scraper` and then execute the _Scrapy_ shell
```bash
scrapy shell
```
With the Scrapy shell you will be able to debug and test CSS selectors to
gather data from the website in question.
### Running _Scrapy_ Spider
As mentioned above, a `HackerNewsBotSpider` is available, the purpose of this
spider is to retrieve post details from Hacker News.
Following the ["Scraper" instructions](#scraper) to the third step, run
```bash
scrapy crawl hacker_news_bot
```
instead of:
```bash
scrapy shell
```
To have the bot executed. This command should output the scraped data as part of
the debug output in the terminal.
### Dependencies
- **Flask**: Lightweight WSGI (Web Server Gateway Interface) web application
framework
- **crochet**: Makes it easier to use Twisted from regular blocking code
- **Scrapy**: DOM Scraper useful for crawling the web
- **Twisted**: Multi-purpose event based framework