Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lunarwhite/hello-scrapy-mongo
A simple crawler for stackoverflow newest questions.
https://github.com/lunarwhite/hello-scrapy-mongo
crwal demo mongodb scraping-websites scrapy
Last synced: 12 days ago
JSON representation
A simple crawler for stackoverflow newest questions.
- Host: GitHub
- URL: https://github.com/lunarwhite/hello-scrapy-mongo
- Owner: lunarwhite
- License: mit
- Created: 2021-11-25T02:47:47.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2024-05-15T03:39:40.000Z (6 months ago)
- Last Synced: 2024-05-15T20:45:12.923Z (6 months ago)
- Topics: crwal, demo, mongodb, scraping-websites, scrapy
- Language: Python
- Homepage:
- Size: 45.9 KB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# hello-scrapy-mongo
A simple crawler demo for [stackoverflow.com](https://stackoverflow.com/) newest questions, using Python Scrapy. MongoDB as Pipeline.
Explore more details in this [blog post](https://lunarwhite.notion.site/Sraping-Website-using-Scrapy-and-MongoDB-800f14544de14cf7bc684191e8052198).
## Setup env
### MongoDB
- install
```shell
# Import the public key used by the package management system
wget -qO - https://www.mongodb.org/static/pgp/server-5.0.asc | sudo apt-key add -
# Create a list file for MongoDB
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/5.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-5.0.list
# Reload local package database
sudo apt-get update
# Install the MongoDB packages
sudo apt-get install -y mongodb-org
```- run/stop
```shell
# Start MongoDB
sudo systemctl start mongod
# Verify that MongoDB has started successfully
sudo systemctl status mongod
# Stop MongoDB
sudo systemctl stop mongod
# Restart MongoDB
sudo systemctl restart mongod
# Begin using MongoDB
mongosh
```### Python
- check version
```shell
python3 -V
```
- create virtual env
```shell
python3 -m venv .venv
source .venv/bin/activate
# deactivate
```
- pkg: `scrapy`
```shell
pip install scrapy
```
- pkg: `pymongo`
```shell
python3 -m pip install 'pymongo[srv]'
```
- dependency
```shell
pip freeze > requirements.txt
```## Run demo
### Init project
- start project
```shell
scrapy startproject hello
```
- project layout
```shell
.
├── scrapy.cfg # config file
└── hello
├── __init__.py
├── items.py # data structure
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders # directory put your spiders
├── __init__.py
└── hello_spider.py
```### Define item
- example
```python
# items.py
import scrapy
class HelloItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
pass
```### Create spider
- example
```python
# hello_spider.py
import scrapy
from hello.items import HelloItem
class HelloSpider(scrapy.Spider):
name = "hello"
allowed_domains = ["stackoverflow.com"]
start_urls = [
"http://stackoverflow.com/questions?pagesize=50&sort=newest",
]
def parse(self, response):
questions = response.xpath('//div[@class="summary"]/h3')
for question in questions:
item = HelloItem()
item['title'] = question.xpath('a[@class="question-hyperlink"]/text()').extract()[0]
item['url'] = question.xpath('a[@class="question-hyperlink"]/@href').extract()[0]
yield item
```### Use pipeline
- import rules in setting.py
```python
ITEM_PIPELINES = {'hello.pipelines.MongoDBPipeline': 300}
MONGODB_HOST = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "hello"
MONGODB_COLLECTION = "questions"
```
- example
```python
# pipeline.py
import pymongo
from scrapy.exceptions import DropItem
from scrapy.utils.project import get_project_settings
class MongoDBPipeline(object):
def __init__(self):
settings = get_project_settings()
connection = pymongo.MongoClient(
settings['MONGODB_HOST'],
settings['MONGODB_PORT']
)
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
def process_item(self, item, spider):
isValid = True
for data in item:
if not data:
isValid = False
raise DropItem("Missing {0}!".format(data))
if isValid:
self.collection.insert(dict(item))
return item
```### Deploy
- run crawler
```shell
cd hello/
scrapy crawl hello
```
- show db file
```shell
mongosh
show dbs
use hello
show collections
db.questions.find().pretty()
```## Misc
### Tool chain
- database: MongoDB
- lang/framework: Python Scrapy### References
- https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
- https://docs.mongodb.com/drivers/pymongo/
- https://www.runoob.com/mongodb/mongodb-tutorial.html
- https://docs.scrapy.org/en/latest/topics/settings.html
- https://docs.scrapy.org/en/latest/topics/items.html
- https://doc.scrapy.org/en/latest/topics/item-pipeline.html
- https://docs.scrapy.org/en/latest/topics/spiders.html