Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/lunarwhite/hello-scrapy-mongo

A simple crawler for stackoverflow newest questions.
https://github.com/lunarwhite/hello-scrapy-mongo

crwal demo mongodb scraping-websites scrapy

Last synced: 12 days ago
JSON representation

A simple crawler for stackoverflow newest questions.

Host: GitHub
URL: https://github.com/lunarwhite/hello-scrapy-mongo
Owner: lunarwhite
License: mit
Created: 2021-11-25T02:47:47.000Z (almost 3 years ago)
Default Branch: master
Last Pushed: 2024-05-15T03:39:40.000Z (6 months ago)
Last Synced: 2024-05-15T20:45:12.923Z (6 months ago)
Topics: crwal, demo, mongodb, scraping-websites, scrapy
Language: Python
Homepage:
Size: 45.9 KB
Stars: 0
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # hello-scrapy-mongo

A simple crawler demo for [stackoverflow.com](https://stackoverflow.com/) newest questions, using Python Scrapy. MongoDB as Pipeline.

Explore more details in this [blog post](https://lunarwhite.notion.site/Sraping-Website-using-Scrapy-and-MongoDB-800f14544de14cf7bc684191e8052198).

## Setup env

### MongoDB

- install

  ```shell

  # Import the public key used by the package management system

  wget -qO - https://www.mongodb.org/static/pgp/server-5.0.asc | sudo apt-key add -

  

  # Create a list file for MongoDB

  echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/5.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-5.0.list

  

  # Reload local package database

  sudo apt-get update

  

  # Install the MongoDB packages

  sudo apt-get install -y mongodb-org

  ```

- run/stop

  ```shell

  # Start MongoDB

  sudo systemctl start mongod

  

  # Verify that MongoDB has started successfully

  sudo systemctl status mongod

  

  # Stop MongoDB

  sudo systemctl stop mongod

  

  # Restart MongoDB

  sudo systemctl restart mongod

  

  # Begin using MongoDB

  mongosh

  ```

### Python

- check version

  ```shell

  python3 -V

  ```

- create virtual env

  ```shell

  python3 -m venv .venv

  source .venv/bin/activate

  

  # deactivate

  ```

- pkg: `scrapy`

  ```shell

  pip install scrapy

  ```

- pkg: `pymongo`

  ```shell

  python3 -m pip install 'pymongo[srv]'

  ```

- dependency

  ```shell

  pip freeze > requirements.txt

  ```

## Run demo

### Init project

- start project

  ```shell

  scrapy startproject hello

  ```

- project layout

  ```shell

  .

  ├── scrapy.cfg # config file

  └── hello

      ├── __init__.py

      ├── items.py # data structure

      ├── middlewares.py

      ├── pipelines.py

      ├── settings.py

      └── spiders # directory put your spiders

          ├── __init__.py

          └── hello_spider.py

  ```

### Define item

- example

  ```python

  # items.py

  

  import scrapy

  

  class HelloItem(scrapy.Item):

      title = scrapy.Field()

      url = scrapy.Field()

      pass

  ```

### Create spider

- example

  ```python

  # hello_spider.py

  

  import scrapy

  from hello.items import HelloItem

  

  class HelloSpider(scrapy.Spider):

      name = "hello"

      allowed_domains = ["stackoverflow.com"]

      start_urls = [

          "http://stackoverflow.com/questions?pagesize=50&sort=newest",

      ]

   

      def parse(self, response):

          questions = response.xpath('//div[@class="summary"]/h3')

   

          for question in questions:

              item = HelloItem()

              item['title'] = question.xpath('a[@class="question-hyperlink"]/text()').extract()[0]

              item['url'] = question.xpath('a[@class="question-hyperlink"]/@href').extract()[0]

              yield item

  ```

### Use pipeline

- import rules in setting.py

  ```python

  ITEM_PIPELINES = {'hello.pipelines.MongoDBPipeline': 300}

  

  MONGODB_HOST = "localhost"

  MONGODB_PORT = 27017

  MONGODB_DB = "hello"

  MONGODB_COLLECTION = "questions"

  ```

- example

  ```python

  # pipeline.py

  

  import pymongo

  from scrapy.exceptions import DropItem

  from scrapy.utils.project import get_project_settings

  

  class MongoDBPipeline(object):

      

      def __init__(self):

          settings = get_project_settings()

          connection = pymongo.MongoClient(

              settings['MONGODB_HOST'],

              settings['MONGODB_PORT']

          )

          db = connection[settings['MONGODB_DB']]

          self.collection = db[settings['MONGODB_COLLECTION']]

   

      def process_item(self, item, spider):

          isValid = True

          for data in item:

              if not data:

                  isValid = False

                  raise DropItem("Missing {0}!".format(data))

          if isValid:

              self.collection.insert(dict(item))    

          return item

  ```

### Deploy

- run crawler

  ```shell

  cd hello/

  scrapy crawl hello

  ```

- show db file

  ```shell

  mongosh

  show dbs

  use hello

  

  show collections

  db.questions.find().pretty()

  ```

## Misc

### Tool chain

- database: MongoDB

- lang/framework: Python Scrapy  

### References

- https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/

- https://docs.mongodb.com/drivers/pymongo/

- https://www.runoob.com/mongodb/mongodb-tutorial.html

- https://docs.scrapy.org/en/latest/topics/settings.html

- https://docs.scrapy.org/en/latest/topics/items.html

- https://doc.scrapy.org/en/latest/topics/item-pipeline.html

- https://docs.scrapy.org/en/latest/topics/spiders.html