https://github.com/rmax/dirbot-mysql
Scrapy project based on dirbot to show how to use Twisted's adbapi to store the scraped data in MySQL.
https://github.com/rmax/dirbot-mysql
Last synced: about 1 year ago
JSON representation
Scrapy project based on dirbot to show how to use Twisted's adbapi to store the scraped data in MySQL.
- Host: GitHub
- URL: https://github.com/rmax/dirbot-mysql
- Owner: rmax
- Created: 2010-04-15T22:11:44.000Z (about 16 years ago)
- Default Branch: master
- Last Pushed: 2013-10-24T23:07:04.000Z (over 12 years ago)
- Last Synced: 2025-02-27T10:39:28.662Z (over 1 year ago)
- Language: Python
- Homepage:
- Size: 125 KB
- Stars: 117
- Watchers: 11
- Forks: 54
- Open Issues: 3
-
Metadata Files:
- Readme: README.rst
Awesome Lists containing this project
README
======
dirbot
======
This is a Scrapy project to scrape websites from public web directories.
This project is only meant for educational purposes.
Items
=====
The items scraped by this project are websites, and the item is defined in the
class::
dirbot.items.Website
See the source code for more details.
Spiders
=======
This project contains one spider called ``dmoz`` that you can see by running::
scrapy list
Spider: dmoz
------------
The ``dmoz`` spider scrapes the Open Directory Project (dmoz.org), and it's
based on the dmoz spider described in the `Scrapy tutorial`_
This spider doesn't crawl the entire dmoz.org site but only a few pages by
default (defined in the ``start_pages`` attribute). These pages are:
* http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
* http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
So, if you run the spider regularly (with ``scrapy crawl dmoz``) it will scrape
only those two pages.
.. _Scrapy tutorial: http://doc.scrapy.org/intro/tutorial.html
Pipelines
=========
Filtering by words
------------------
A pipeline to filter out websites containing certain forbidden words in their
description. This pipeline is defined in the class::
dirbot.pipelines.FilterWordsPipeline
Requiring certain item fields
-----------------------------
A pipeline to discard items that lack of certain fields. This pipeline is
defined in the class::
dirbot.pipelines.RequiredFieldsPipeline
Storing items in a MySQL database
---------------------------------
A pipeline to store (insert or update) scraped items in a MySQL database. This
pipeline is defined in the class::
dirbot.pipelines.MySQLStorePipeline
The database schema is defined in ``db/mysql.sql`` and the settings file
contains the default ``MYSQL_*`` settings values. The scraped items will be
stored in the ``website`` database table.
.. note::
It is *required* to have set up the database schema *before* running the spider.