https://github.com/billy0402/scrapy-tutorial

A learning project from the book 'Scrapy一本就精通'.
https://github.com/billy0402/scrapy-tutorial

course crawler docker mongodb mysql proxy python redis scrapy splash sqlite ubuntu

Last synced: 2 months ago
JSON representation

A learning project from the book 'Scrapy一本就精通'.

Host: GitHub
URL: https://github.com/billy0402/scrapy-tutorial
Owner: billy0402
Created: 2019-09-28T07:05:37.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2019-09-28T07:06:50.000Z (about 6 years ago)
Last Synced: 2025-07-19T05:30:22.061Z (3 months ago)
Topics: course, crawler, docker, mongodb, mysql, proxy, python, redis, scrapy, splash, sqlite, ubuntu
Language: Python
Homepage:
Size: 555 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # scrapy-tutorial

## environment

- [macOS 10.14.6](https://www.apple.com/tw/macos/mojave/)

- [PyCharm 2019.2.3](https://www.jetbrains.com/pycharm/)

- [Python 3.7.4](https://www.python.org/)

- [Scrapy 1.6.0](https://github.com/scrapy/scrapy)

## [Scrapy](https://scrapy.org/)

```shell

# install

$ pipenv install scrapy

# create new project

$ scrapy startproject 

# create spider file

$ scrapy genspider  

# tree project

$ pipenv install tree

$ tree .

Scrapy/

├── 

│   ├── __init__.py

│   ├── items.py

│   ├── middlewares.py

│   ├── pipelines.py

│   ├── settings.py

│   └── spiders

│       ├── __init__.py

│       └── spiders.py

└── scrapy.cfg

# crawler(-t: file type, -o: file path)

$ scrapy crawl books -t csv -o books.csv --nolog

# %(name)s: spider name, %(time)s: file create time

$ scrapy crawl books -o 'export_data/%(name)s/%(time)s.csv'

# see result

$ sed -n '2,$p' books.csv | cat -n

# get file head 5 lines

$ head -5 books.csv

# open scrapy shell

$ scrapy shell 

# open view response in a browser

$ view(response)

```

## selector

- [XPath](https://www.w3.org/TR/xpath/all/)

- [XPath Syntax](https://www.w3schools.com/xml/xpath_syntax.asp)

- [CSS](https://www.w3.org/TR/selectors-3/)

- [CSS Selector](https://www.w3schools.com/cssref/css_selectors.asp)

## [tesseract](https://github.com/tesseract-ocr/tesseract)

```shell

$ brew install tesseract

$ brew install tesseract-lang # 語言包

```

## Browser Cookies Middleware

```python

# scrapy shell

from scrapy import Request

url = 'https://github.com/settings/profile'     

fetch(Request(url, meta={'cookiejar': 'chrome'}))

view(response)

```

## [Splash](https://splash.readthedocs.io/en/stable/)

```shell

$ docker pull scrapinghub/splash

$ docker run --name mysplash -p 8050:8050 -p 8051:8051 -d scrapinghub/splash 

```

## HTTP Proxy test

```shell

# HTTP Proxy enviroment variable setup 

$ export http(s)_proxy="http(s)://username:password@proxy_ip:proxy_port"

$ scrapy shell

```

```python

import json

import base64

from scrapy import Request

url = 'https://httpbin.org/ip'

proxy = 'proxy_ip:proxy_port'

user = 'username'

password = 'password'

auth = '{}:{}'.format(user, password).encode('utf8')

request = Request(url, meta={'proxy': proxy})

request.headers['Proxy-Authorization'] = b'Basic' + base64.b64encode(auth)

fetch(request)

json.loads(response.text)

```

## free Proxy

- [Proxy List](http://proxy-list.org/english/index.php)

- [Free Proxy List](https://free-proxy-list.net/)

- [西刺免費代理IP](https://www.xicidaili.com/)

- [Proxy 360](http://www.proxy360.cn/default.aspx)

- [快代理](https://www.kuaidaili.com/)

## SQLite

```shell

$ sqlite3 scrapy.db

$ .exit

```

```sqlite

CREATE TABLE books (

    upc           CHAR(16) NOT NULL PRIMARY KEY,

    name          VARCHAR(256) NOT NULL,

    price         VARCHAR(16) NOT NULL,

    review_rating INT,

    review_num    INT,

    stock         INT

);

SELECT * FROM books;

```

## [MySQL(docker)](https://hub.docker.com/_/mysql)

### [mysqlclient](https://github.com/PyMySQL/mysqlclient-python)

```shell

$ docker run --name mymysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -p 3306:3306 -d mysql:5.7.26

$ docker exec -it mymysql bash

$ mysql -h 127.0.0.1 -u root -p

$ brew install mysql-connector-c

$ pip install mysqlclient

```

```mysql

CREATE DATABASE scrapy_db CHARACTER SET 'utf8' COLLATE 'utf8_general_ci';

USE scrapy_db;

CREATE TABLE books (

    upc           CHAR(16) NOT NULL PRIMARY KEY,

    name          VARCHAR(256) NOT NULL,

    price         VARCHAR(16) NOT NULL,

    review_rating INT,

    review_num    INT,

    stock         INT

) ENGINE=InnoDB DEFAULT CHARSET=utf8;

SELECT * FROM books;

```

## [MongoDB(docker)](https://hub.docker.com/_/mongo)

```shell

# setup MongoDB docker

$ docker images

$ docker rmi 

$ docker pull mongo

$ docker run --name mymongo -p 27017:27017 -d mongo

$ docker exec -it mymongo bash

$ mongo

# MongoDB command

$ use scrapy_data

$ db.getCollectionNames()

$ db.books.count()

$ db.books.find()

$ db.books.drop()

```

## [redis](https://redis.io/)

### [redis(docker)](https://hub.docker.com/_/redis/)

```shell

$ docker run --name myredis -p 6379:6379 -d redis

$ docker exec -it myredis bash

$ redis-cli

```

### redis server(Ubuntu)

```shell

# 安裝 redis-server

$ sudo apt-get install redis-server

$ sudo service redis-server start

$ sudo service redis-server restart

$ sudo service redis-server stop

# 查詢服務

$ sudo apt install net-tools

$ netstat -ntl

# 設定連線限制

$ sudo vi /etc/redis/redis.conf

# 接受任意 IP 請求

# bind 127.0.0.1 > bind 0.0.0.0

# 取得 IP

$ ifconfig

# 使用 -h 參數指定主機 ip

$ redis-cli -h 

# 測試連接資料庫是否成功

$ PING

# scrapy-redis 設定起始爬取點

$ lpush books:start_urls 'http://books.toscrape.com/'

```

### redis-cli

```shell

# KEYS 鍵值

$ KEYS * # 取得 redis 中的所有 key 值

$ KEYS key:* # 取得 book 中的所有 key 值

# String 字串

$ SET key value # 設定字串 key 的值

$ GET key # 取得字串 key 的值

$ DEL key # 刪除字串 key

# List 列表

$ LPUSH key [value] # 在列表 key 左端插入一個或多個值

$ RPUSH key [value] # 在列表 key 右端插入一個或多個值

$ LPOP key # 從列表 key 左端取出一個值

$ RPOP key # 從列表 key 右端取出一個值

$ LINDEX key index # 取得列表 key 中 index 位置的值

$ LRANGE key start end # 取得列表 key 中位置從 start 到 end 範圍的值

$ LLEN key 取得列表 key 的長度

# Hash 雜湊

$ HSET key field value # 將雜湊 key 中的 field 欄位設定值為 value

$ HDEL key [field] # 刪除雜湊 key 中的一個或多個欄位

$ HGET key field 取得雜湊 key 中的 field 欄位的值

$ HGETALL key 取得雜湊 key 中的所有欄位和值

$ HGETALL key:field # 取得雜湊 key 中的欄位為 field 的值

# Set 集合

$ SADD key [member] # 向集合 key 中增加一個或多個成員

$ SREM key [member] # 向集合 key 中刪除一個或多個成員

$ SMEMBERS key # 取得集合 key 中的所有成員

$ SCARD key # 取得集合 key 中的成員數量

$ SISMEMBER key member # 判斷 member 是否是集合 key 的成員

# ZSet 有序集合

$ ZADD key [score member] # 向有序集合 key 中增加一個或多個成員

$ ZREM key [member] # 向有序集合 key 中刪除一個或多個成員

$ ZRANGE key start stop # 取得有序集合 key 中位置從 start 到 end 的所有成員

$ ZRANGEBYSCORE key min max # 取得有序集合 key 中分數從 mix 到 max 的所有成員

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/billy0402/scrapy-tutorial

Awesome Lists containing this project

README