Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jwlin/ptt-web-crawler
PTT 網路版爬蟲
https://github.com/jwlin/ptt-web-crawler
Last synced: 4 months ago
JSON representation
PTT 網路版爬蟲
- Host: GitHub
- URL: https://github.com/jwlin/ptt-web-crawler
- Owner: jwlin
- License: mit
- Created: 2015-02-17T02:24:52.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2024-03-31T16:17:47.000Z (11 months ago)
- Last Synced: 2024-08-01T08:09:48.746Z (7 months ago)
- Language: Python
- Homepage:
- Size: 77.1 KB
- Stars: 431
- Watchers: 25
- Forks: 221
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ptt-web-crawler (PTT 網路版爬蟲) [data:image/s3,"s3://crabby-images/15d29/15d291ee5f356b778b4b48404dcc4cad78450d29" alt="Build Status"](https://travis-ci.org/jwlin/ptt-web-crawler)
### [English Readme](#english_desc)
### [Live demo](http://app.castman.net/ptt-web-crawler)
### [Scrapy 版本](https://github.com/afunTW/ptt-web-crawler) by afunTW特色
* 支援單篇及多篇文章抓取
* 過濾資料內空白、空行及特殊字元
* JSON 格式輸出
* 支援 Python 2.7, 3.4-3.6輸出 JSON 格式
```
{
"article_id": 文章 ID,
"article_title": 文章標題 ,
"author": 作者,
"board": 板名,
"content": 文章內容,
"date": 發文時間,
"ip": 發文位址,
"message_count": { # 推文
"all": 總數,
"boo": 噓文數,
"count": 推文數-噓文數,
"neutral": → 數,
"push": 推文數
},
"messages": [ # 推文內容
{
"push_content": 推文內容,
"push_ipdatetime": 推文時間及位址,
"push_tag": 推/噓/→ ,
"push_userid": 推文者 ID
},
...
]
}
```### 參數說明
```commandline
python crawler.py -b 看板名稱 -i 起始索引 結束索引 (設為負數則以倒數第幾頁計算)
python crawler.py -b 看板名稱 -a 文章ID
```### 範例
爬取 PublicServan 板第 100 頁 (https://www.ptt.cc/bbs/PublicServan/index100.html)
到第 200 頁 (https://www.ptt.cc/bbs/PublicServan/index200.html) 的內容,
輸出至 `PublicServan-100-200.json`* 直接執行腳本
```commandline
cd PttWebCrawler
python crawler.py -b PublicServan -i 100 200
```
* 呼叫 package```commandline
python setup.py install
python -m PttWebCrawler -b PublicServan -i 100 200
```* 作為函式庫呼叫
```python
from PttWebCrawler.crawler import *c = PttWebCrawler(as_lib=True)
c.parse_articles(100, 200, 'PublicServan')
```### 測試
```commandline
python test.py
```***
ptt-web-crawler is a crawler for the web version of PTT, the largest online community in Taiwan.
usage: python crawler.py [-h] -b BOARD_NAME (-i START_INDEX END_INDEX | -a ARTICLE_ID) [-v]
optional arguments:
-h, --help show this help message and exit
-b BOARD_NAME Board name
-i START_INDEX END_INDEX Start and end index
-a ARTICLE_ID Article ID
-v, --version show program's version number and exitOutput would be `BOARD_NAME-START_INDEX-END_INDEX.json` (or `BOARD_NAME-ID.json`)