Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mdsrosa/hackernews_scrapy
HackerNews Scrapy that crawls Python news.
https://github.com/mdsrosa/hackernews_scrapy
Last synced: 5 days ago
JSON representation
HackerNews Scrapy that crawls Python news.
- Host: GitHub
- URL: https://github.com/mdsrosa/hackernews_scrapy
- Owner: mdsrosa
- License: mit
- Created: 2015-11-08T21:55:48.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2015-11-25T19:20:24.000Z (about 9 years ago)
- Last Synced: 2024-11-21T00:44:36.955Z (2 months ago)
- Language: Python
- Homepage:
- Size: 13.7 KB
- Stars: 1
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# HackerNews Scrapy
This is a Scrapy project to scrape the [Hacker News](https://news.ycombinator.com/) for Python articles related.
# Items
The items scraped by this project are articles, and the item is defined in the class:
```
hackernews_scrapy.items.ArticleItem
```# Crawl Spider
This project contains a `CrawlSpider` called `pythonhackernews` that you can see by running:
`scrapy list`
## Crawl Spider: pythonhackernews
The `pythonhackernews` crawlspider scrapes the Hacker News (news.ycombinator.com) for Python articles related.
This spider doesn't crawl the entire news.ycombinator.com site but only the first 9 pages by default.
So, if you run the spider regularly (with `scrapy crawl pythonhackernews`) it will scrape only those 9 pages.
# Pipelines
This project uses two pipelines: ValidateArticlePipeline and MongoDBPipeline.
The `ValidateArticlePipeline` can be found in this class:
```
hackernews_scrapy.pipelines.ValidateArticlePipeline
```
This pipeline filter out websites containing 'Python' or 'python' in their title.The `MongoDBPipeline` can be found in this class:
```
hackernews_scrapy.pipelines.MongoDBPipeline
```This pipeline saves the articles in the MongoDB database if is not already saved.