Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zenixls2/2chpreprocess
Dump messages from 2ch with some preprocessing for ML analysis
https://github.com/zenixls2/2chpreprocess
2ch crawler python
Last synced: about 1 month ago
JSON representation
Dump messages from 2ch with some preprocessing for ML analysis
- Host: GitHub
- URL: https://github.com/zenixls2/2chpreprocess
- Owner: zenixls2
- License: mit
- Created: 2017-07-10T12:25:45.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-07-18T06:12:41.000Z (over 7 years ago)
- Last Synced: 2024-10-15T09:51:07.178Z (3 months ago)
- Topics: 2ch, crawler, python
- Language: Python
- Size: 17.6 KB
- Stars: 0
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## 2ch preprocessor
------------------
This is a 2ch crawler & preprocessor for creating chatbot-rnn readable format.
The crawler part is almost complete, but the preprocessor part is still under development.
- TODO ITEMS:
* Normalize kanji, hirakana and katakana to romaji (alphabet) to decrease the vocabulary amount.
* truncate useless ANSI Art
* create a post processor to convert romaji to kanji, hirakana, or katakana
* preprocess data as dialog format defined in chatbot-rnn### Features
- Ignore some functional boards, and crawl through all the threads in 2ch
- checkpoint recovery (bug should be fixed now)
- multiple workers for crawling
- save to sqlite db### Installation
```bash
# Ubuntu
sudo apt-get install sqlite3
# Mac
brew install sqlite3git clone https://github.com/zenixls2/2chpreprocess
cd 2chpreprocess
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
```### Execution
usage: main.py [-h] [-t] [-r] [-p] [-w WORKER]2ch data preprocessor & crawler
optional arguments:
* -h, --help show this help message and exit
* -t, --topic get topic links
* -r, --rerun ignore cache results and rerun
* -p, --process process generated topic link
* -w WORKER, --worker WORKER
number of workersFirst crawl out all topics:
```bash
source venv/bin/activate
python main.py -t
```And crawl through all threads:
```bash
source venv/bin/activate
python main.py -p -w ${WORKER}
```
Notice that this could stop by users at any time.
The process could be recovered using the `checkpoint.pkl` file stored in `save`.
The result is by default save in `save` directory.
You could use `-r` to ignore the checkpoint settings and re-run.### About Crawling Result
There should be an `output.db` in your `save` directory once you execute with `-p` option. The scheme in sqlite3 db is defined as follows:
```yaml
messages:
- name(unicode) // category name
- id(int) // floor id / index id in each thread
- thread_id(text) // the topic's thread id
- message(text) // the message content for that index
```
You could use sqlite3 to access it:
```bash
bash$> sqlite3 output.db
SQLite version 3.16.0 2016-11-04 19:09:39
Enter ".help" for usage hints.
sqlite> select * from messages limit 2;
趣味一般|1|1112706898|トイレでするより、外で立ちションをするほうが、 解放感があって気持ちがいいですよね。
趣味一般|2|1112706898|に(´・ω・2)
```