{"id":20244668,"url":"https://github.com/billy0402/scrapy-tutorial","last_synced_at":"2026-04-13T03:03:44.681Z","repository":{"id":165279296,"uuid":"211459237","full_name":"billy0402/scrapy-tutorial","owner":"billy0402","description":"A learning project from the book 'Scrapy一本就精通'.","archived":false,"fork":false,"pushed_at":"2019-09-28T07:06:50.000Z","size":568,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-07-19T05:30:22.061Z","etag":null,"topics":["course","crawler","docker","mongodb","mysql","proxy","python","redis","scrapy","splash","sqlite","ubuntu"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/billy0402.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-09-28T07:05:37.000Z","updated_at":"2024-04-13T15:53:07.000Z","dependencies_parsed_at":null,"dependency_job_id":"5071fd43-49cc-4d6a-b53f-d19684602e8f","html_url":"https://github.com/billy0402/scrapy-tutorial","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/billy0402/scrapy-tutorial","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/billy0402%2Fscrapy-tutorial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/billy0402%2Fscrapy-tutorial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/billy0402%2Fscrapy-tutorial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/billy0402%2Fscrapy-tutorial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/billy0402","download_url":"https://codeload.github.com/billy0402/scrapy-tutorial/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/billy0402%2Fscrapy-tutorial/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267442478,"owners_count":24087805,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-27T02:00:11.917Z","response_time":82,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["course","crawler","docker","mongodb","mysql","proxy","python","redis","scrapy","splash","sqlite","ubuntu"],"created_at":"2024-11-14T09:16:42.895Z","updated_at":"2026-04-13T03:03:44.616Z","avatar_url":"https://github.com/billy0402.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# scrapy-tutorial\n\n## environment\n- [macOS 10.14.6](https://www.apple.com/tw/macos/mojave/)\n- [PyCharm 2019.2.3](https://www.jetbrains.com/pycharm/)\n- [Python 3.7.4](https://www.python.org/)\n- [Scrapy 1.6.0](https://github.com/scrapy/scrapy)\n\n## [Scrapy](https://scrapy.org/)\n```shell\n# install\n$ pipenv install scrapy\n\n# create new project\n$ scrapy startproject \u003cproject name\u003e\n# create spider file\n$ scrapy genspider \u003cspider name\u003e \u003cdomain\u003e\n\n# tree project\n$ pipenv install tree\n$ tree .\nScrapy/\n├── \u003cproject name\u003e\n│   ├── __init__.py\n│   ├── items.py\n│   ├── middlewares.py\n│   ├── pipelines.py\n│   ├── settings.py\n│   └── spiders\n│       ├── __init__.py\n│       └── spiders.py\n└── scrapy.cfg\n\n# crawler(-t: file type, -o: file path)\n$ scrapy crawl books -t csv -o books.csv --nolog\n# %(name)s: spider name, %(time)s: file create time\n$ scrapy crawl books -o 'export_data/%(name)s/%(time)s.csv'\n# see result\n$ sed -n '2,$p' books.csv | cat -n\n# get file head 5 lines\n$ head -5 books.csv\n\n# open scrapy shell\n$ scrapy shell \u003curl\u003e\n# open view response in a browser\n$ view(response)\n```\n\n## selector\n- [XPath](https://www.w3.org/TR/xpath/all/)\n- [XPath Syntax](https://www.w3schools.com/xml/xpath_syntax.asp)\n- [CSS](https://www.w3.org/TR/selectors-3/)\n- [CSS Selector](https://www.w3schools.com/cssref/css_selectors.asp)\n\n## [tesseract](https://github.com/tesseract-ocr/tesseract)\n```shell\n$ brew install tesseract\n$ brew install tesseract-lang # 語言包\n```\n\n## Browser Cookies Middleware\n```python\n# scrapy shell\nfrom scrapy import Request\nurl = 'https://github.com/settings/profile'     \nfetch(Request(url, meta={'cookiejar': 'chrome'}))\nview(response)\n```\n\n## [Splash](https://splash.readthedocs.io/en/stable/)\n```shell\n$ docker pull scrapinghub/splash\n$ docker run --name mysplash -p 8050:8050 -p 8051:8051 -d scrapinghub/splash \n```\n\n## HTTP Proxy test\n```shell\n# HTTP Proxy enviroment variable setup \n$ export http(s)_proxy=\"http(s)://username:password@proxy_ip:proxy_port\"\n$ scrapy shell\n```\n```python\nimport json\nimport base64\nfrom scrapy import Request\n\nurl = 'https://httpbin.org/ip'\nproxy = 'proxy_ip:proxy_port'\nuser = 'username'\npassword = 'password'\nauth = '{}:{}'.format(user, password).encode('utf8')\nrequest = Request(url, meta={'proxy': proxy})\nrequest.headers['Proxy-Authorization'] = b'Basic' + base64.b64encode(auth)\nfetch(request)\njson.loads(response.text)\n```\n\n## free Proxy\n- [Proxy List](http://proxy-list.org/english/index.php)\n- [Free Proxy List](https://free-proxy-list.net/)\n- [西刺免費代理IP](https://www.xicidaili.com/)\n- [Proxy 360](http://www.proxy360.cn/default.aspx)\n- [快代理](https://www.kuaidaili.com/)\n\n## SQLite\n```shell\n$ sqlite3 scrapy.db\n$ .exit\n```\n```sqlite\nCREATE TABLE books (\n    upc           CHAR(16) NOT NULL PRIMARY KEY,\n    name          VARCHAR(256) NOT NULL,\n    price         VARCHAR(16) NOT NULL,\n    review_rating INT,\n    review_num    INT,\n    stock         INT\n);\nSELECT * FROM books;\n```\n\n## [MySQL(docker)](https://hub.docker.com/_/mysql)\n### [mysqlclient](https://github.com/PyMySQL/mysqlclient-python)\n```shell\n$ docker run --name mymysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -p 3306:3306 -d mysql:5.7.26\n$ docker exec -it mymysql bash\n$ mysql -h 127.0.0.1 -u root -p\n$ brew install mysql-connector-c\n$ pip install mysqlclient\n```\n```mysql\nCREATE DATABASE scrapy_db CHARACTER SET 'utf8' COLLATE 'utf8_general_ci';\nUSE scrapy_db;\nCREATE TABLE books (\n    upc           CHAR(16) NOT NULL PRIMARY KEY,\n    name          VARCHAR(256) NOT NULL,\n    price         VARCHAR(16) NOT NULL,\n    review_rating INT,\n    review_num    INT,\n    stock         INT\n) ENGINE=InnoDB DEFAULT CHARSET=utf8;\nSELECT * FROM books;\n```\n\n## [MongoDB(docker)](https://hub.docker.com/_/mongo)\n```shell\n# setup MongoDB docker\n$ docker images\n$ docker rmi \u003cIMAGE ID\u003e\n$ docker pull mongo\n$ docker run --name mymongo -p 27017:27017 -d mongo\n$ docker exec -it mymongo bash\n$ mongo\n\n# MongoDB command\n$ use scrapy_data\n$ db.getCollectionNames()\n$ db.books.count()\n$ db.books.find()\n$ db.books.drop()\n```\n\n## [redis](https://redis.io/)\n### [redis(docker)](https://hub.docker.com/_/redis/)\n```shell\n$ docker run --name myredis -p 6379:6379 -d redis\n$ docker exec -it myredis bash\n$ redis-cli\n```\n### redis server(Ubuntu)\n```shell\n# 安裝 redis-server\n$ sudo apt-get install redis-server\n$ sudo service redis-server start\n$ sudo service redis-server restart\n$ sudo service redis-server stop\n\n# 查詢服務\n$ sudo apt install net-tools\n$ netstat -ntl\n\n# 設定連線限制\n$ sudo vi /etc/redis/redis.conf\n# 接受任意 IP 請求\n# bind 127.0.0.1 \u003e bind 0.0.0.0\n\n# 取得 IP\n$ ifconfig\n# 使用 -h 參數指定主機 ip\n$ redis-cli -h \u003chost ip\u003e\n# 測試連接資料庫是否成功\n$ PING\n# scrapy-redis 設定起始爬取點\n$ lpush books:start_urls 'http://books.toscrape.com/'\n```\n### redis-cli\n```shell\n# KEYS 鍵值\n$ KEYS * # 取得 redis 中的所有 key 值\n$ KEYS key:* # 取得 book 中的所有 key 值\n\n# String 字串\n$ SET key value # 設定字串 key 的值\n$ GET key # 取得字串 key 的值\n$ DEL key # 刪除字串 key\n\n# List 列表\n$ LPUSH key [value] # 在列表 key 左端插入一個或多個值\n$ RPUSH key [value] # 在列表 key 右端插入一個或多個值\n$ LPOP key # 從列表 key 左端取出一個值\n$ RPOP key # 從列表 key 右端取出一個值\n$ LINDEX key index # 取得列表 key 中 index 位置的值\n$ LRANGE key start end # 取得列表 key 中位置從 start 到 end 範圍的值\n$ LLEN key 取得列表 key 的長度\n\n# Hash 雜湊\n$ HSET key field value # 將雜湊 key 中的 field 欄位設定值為 value\n$ HDEL key [field] # 刪除雜湊 key 中的一個或多個欄位\n$ HGET key field 取得雜湊 key 中的 field 欄位的值\n$ HGETALL key 取得雜湊 key 中的所有欄位和值\n$ HGETALL key:field # 取得雜湊 key 中的欄位為 field 的值\n\n# Set 集合\n$ SADD key [member] # 向集合 key 中增加一個或多個成員\n$ SREM key [member] # 向集合 key 中刪除一個或多個成員\n$ SMEMBERS key # 取得集合 key 中的所有成員\n$ SCARD key # 取得集合 key 中的成員數量\n$ SISMEMBER key member # 判斷 member 是否是集合 key 的成員\n\n# ZSet 有序集合\n$ ZADD key [score member] # 向有序集合 key 中增加一個或多個成員\n$ ZREM key [member] # 向有序集合 key 中刪除一個或多個成員\n$ ZRANGE key start stop # 取得有序集合 key 中位置從 start 到 end 的所有成員\n$ ZRANGEBYSCORE key min max # 取得有序集合 key 中分數從 mix 到 max 的所有成員\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbilly0402%2Fscrapy-tutorial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbilly0402%2Fscrapy-tutorial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbilly0402%2Fscrapy-tutorial/lists"}