https://github.com/mdsrosa/hackernews_scrapy

HackerNews Scrapy that crawls Python news.
https://github.com/mdsrosa/hackernews_scrapy

Last synced: 4 months ago
JSON representation

HackerNews Scrapy that crawls Python news.

Host: GitHub
URL: https://github.com/mdsrosa/hackernews_scrapy
Owner: mdsrosa
License: mit
Created: 2015-11-08T21:55:48.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2015-11-25T19:20:24.000Z (over 9 years ago)
Last Synced: 2025-01-21T19:26:47.655Z (6 months ago)
Language: Python
Homepage:
Size: 13.7 KB
Stars: 1
Watchers: 3
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# HackerNews Scrapy

This is a Scrapy project to scrape the [Hacker News](https://news.ycombinator.com/) for Python articles related.

# Items

The items scraped by this project are articles, and the item is defined in the class:
```
hackernews_scrapy.items.ArticleItem
```

# Crawl Spider

This project contains a `CrawlSpider` called `pythonhackernews` that you can see by running:

`scrapy list`

## Crawl Spider: pythonhackernews

The `pythonhackernews` crawlspider scrapes the Hacker News (news.ycombinator.com) for Python articles related.

This spider doesn't crawl the entire news.ycombinator.com site but only the first 9 pages by default.

So, if you run the spider regularly (with `scrapy crawl pythonhackernews`) it will scrape only those 9 pages.

# Pipelines

This project uses two pipelines: ValidateArticlePipeline and MongoDBPipeline.

The `ValidateArticlePipeline` can be found in this class:
```
hackernews_scrapy.pipelines.ValidateArticlePipeline
```
This pipeline filter out websites containing 'Python' or 'python' in their title.

The `MongoDBPipeline` can be found in this class:
```
hackernews_scrapy.pipelines.MongoDBPipeline
```

This pipeline saves the articles in the MongoDB database if is not already saved.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mdsrosa/hackernews_scrapy

Awesome Lists containing this project

README