Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kiote/dirbot
scraping of thelocals.ru with scrapy
https://github.com/kiote/dirbot
Last synced: about 7 hours ago
JSON representation
scraping of thelocals.ru with scrapy
- Host: GitHub
- URL: https://github.com/kiote/dirbot
- Owner: kiote
- Created: 2015-06-12T09:49:36.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2015-06-17T14:00:16.000Z (over 9 years ago)
- Last Synced: 2023-03-10T19:26:01.855Z (over 1 year ago)
- Language: Julia
- Size: 125 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
Awesome Lists containing this project
README
======
dirbot
======This is a Scrapy project to scrape websites from public web directories.
This project is only meant for educational purposes.
Items
=====The items scraped by this project are websites, and the item is defined in the
class::dirbot.items.Website
See the source code for more details.
Spiders
=======This project contains one spider called ``dmoz`` that you can see by running::
scrapy list
Spider: dmoz
------------The ``dmoz`` spider scrapes the Open Directory Project (dmoz.org), and it's
based on the dmoz spider described in the `Scrapy tutorial`_This spider doesn't crawl the entire dmoz.org site but only a few pages by
default (defined in the ``start_pages`` attribute). These pages are:* http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
* http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/So, if you run the spider regularly (with ``scrapy crawl dmoz``) it will scrape
only those two pages... _Scrapy tutorial: http://doc.scrapy.org/en/latest/intro/tutorial.html
Pipelines
=========This project uses a pipeline to filter out websites containing certain
forbidden words in their description. This pipeline is defined in the class::dirbot.pipelines.FilterWordsPipeline