Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/lunarwhite/hello-scrapy-mongo

A simple crawler for stackoverflow newest questions.
https://github.com/lunarwhite/hello-scrapy-mongo

crwal demo mongodb scraping-websites scrapy

Last synced: 12 days ago
JSON representation

A simple crawler for stackoverflow newest questions.

Awesome Lists containing this project

README

        

# hello-scrapy-mongo

A simple crawler demo for [stackoverflow.com](https://stackoverflow.com/) newest questions, using Python Scrapy. MongoDB as Pipeline.

Explore more details in this [blog post](https://lunarwhite.notion.site/Sraping-Website-using-Scrapy-and-MongoDB-800f14544de14cf7bc684191e8052198).

## Setup env

### MongoDB

- install
```shell
# Import the public key used by the package management system
wget -qO - https://www.mongodb.org/static/pgp/server-5.0.asc | sudo apt-key add -

# Create a list file for MongoDB
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/5.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-5.0.list

# Reload local package database
sudo apt-get update

# Install the MongoDB packages
sudo apt-get install -y mongodb-org
```

- run/stop
```shell
# Start MongoDB
sudo systemctl start mongod

# Verify that MongoDB has started successfully
sudo systemctl status mongod

# Stop MongoDB
sudo systemctl stop mongod

# Restart MongoDB
sudo systemctl restart mongod

# Begin using MongoDB
mongosh
```

### Python

- check version
```shell
python3 -V
```
- create virtual env
```shell
python3 -m venv .venv
source .venv/bin/activate

# deactivate
```
- pkg: `scrapy`
```shell
pip install scrapy
```
- pkg: `pymongo`
```shell
python3 -m pip install 'pymongo[srv]'
```
- dependency
```shell
pip freeze > requirements.txt
```

## Run demo

### Init project

- start project
```shell
scrapy startproject hello
```
- project layout
```shell
.
├── scrapy.cfg # config file
└── hello
├── __init__.py
├── items.py # data structure
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders # directory put your spiders
├── __init__.py
└── hello_spider.py
```

### Define item

- example
```python
# items.py

import scrapy

class HelloItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
pass
```

### Create spider

- example
```python
# hello_spider.py

import scrapy
from hello.items import HelloItem

class HelloSpider(scrapy.Spider):
name = "hello"
allowed_domains = ["stackoverflow.com"]
start_urls = [
"http://stackoverflow.com/questions?pagesize=50&sort=newest",
]

def parse(self, response):
questions = response.xpath('//div[@class="summary"]/h3')

for question in questions:
item = HelloItem()
item['title'] = question.xpath('a[@class="question-hyperlink"]/text()').extract()[0]
item['url'] = question.xpath('a[@class="question-hyperlink"]/@href').extract()[0]
yield item
```

### Use pipeline

- import rules in setting.py
```python
ITEM_PIPELINES = {'hello.pipelines.MongoDBPipeline': 300}

MONGODB_HOST = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "hello"
MONGODB_COLLECTION = "questions"
```
- example
```python
# pipeline.py

import pymongo
from scrapy.exceptions import DropItem
from scrapy.utils.project import get_project_settings

class MongoDBPipeline(object):

def __init__(self):
settings = get_project_settings()
connection = pymongo.MongoClient(
settings['MONGODB_HOST'],
settings['MONGODB_PORT']
)
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]

def process_item(self, item, spider):
isValid = True
for data in item:
if not data:
isValid = False
raise DropItem("Missing {0}!".format(data))
if isValid:
self.collection.insert(dict(item))
return item
```

### Deploy

- run crawler
```shell
cd hello/
scrapy crawl hello
```
- show db file
```shell
mongosh
show dbs
use hello

show collections
db.questions.find().pretty()
```

## Misc

### Tool chain

- database: MongoDB
- lang/framework: Python Scrapy

### References

- https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
- https://docs.mongodb.com/drivers/pymongo/
- https://www.runoob.com/mongodb/mongodb-tutorial.html
- https://docs.scrapy.org/en/latest/topics/settings.html
- https://docs.scrapy.org/en/latest/topics/items.html
- https://doc.scrapy.org/en/latest/topics/item-pipeline.html
- https://docs.scrapy.org/en/latest/topics/spiders.html