https://github.com/zenixls2/2chpreprocess

Dump messages from 2ch with some preprocessing for ML analysis
https://github.com/zenixls2/2chpreprocess

2ch crawler python

Last synced: 7 months ago
JSON representation

Dump messages from 2ch with some preprocessing for ML analysis

Host: GitHub
URL: https://github.com/zenixls2/2chpreprocess
Owner: zenixls2
License: mit
Created: 2017-07-10T12:25:45.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2017-07-18T06:12:41.000Z (about 8 years ago)
Last Synced: 2025-01-31T17:52:17.739Z (9 months ago)
Topics: 2ch, crawler, python
Language: Python
Size: 17.6 KB
Stars: 0
Watchers: 4
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

## 2ch preprocessor
------------------
This is a 2ch crawler & preprocessor for creating chatbot-rnn readable format.
The crawler part is almost complete, but the preprocessor part is still under development.
- TODO ITEMS:
* Normalize kanji, hirakana and katakana to romaji (alphabet) to decrease the vocabulary amount.
* truncate useless ANSI Art
* create a post processor to convert romaji to kanji, hirakana, or katakana
* preprocess data as dialog format defined in chatbot-rnn

### Features
- Ignore some functional boards, and crawl through all the threads in 2ch
- checkpoint recovery (bug should be fixed now)
- multiple workers for crawling
- save to sqlite db

### Installation
```bash
# Ubuntu
sudo apt-get install sqlite3
# Mac
brew install sqlite3

git clone https://github.com/zenixls2/2chpreprocess
cd 2chpreprocess
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
```

### Execution
usage: main.py [-h] [-t] [-r] [-p] [-w WORKER]

2ch data preprocessor & crawler

optional arguments:
* -h, --help show this help message and exit
* -t, --topic get topic links
* -r, --rerun ignore cache results and rerun
* -p, --process process generated topic link
* -w WORKER, --worker WORKER
number of workers

First crawl out all topics:
```bash
source venv/bin/activate
python main.py -t
```

And crawl through all threads:
```bash
source venv/bin/activate
python main.py -p -w ${WORKER}
```
Notice that this could stop by users at any time.
The process could be recovered using the `checkpoint.pkl` file stored in `save`.
The result is by default save in `save` directory.
You could use `-r` to ignore the checkpoint settings and re-run.

### About Crawling Result
There should be an `output.db` in your `save` directory once you execute with `-p` option. The scheme in sqlite3 db is defined as follows:
```yaml
messages:
- name(unicode) // category name
- id(int) // floor id / index id in each thread
- thread_id(text) // the topic's thread id
- message(text) // the message content for that index
```
You could use sqlite3 to access it:
```bash
bash$> sqlite3 output.db
SQLite version 3.16.0 2016-11-04 19:09:39
Enter ".help" for usage hints.
sqlite> select * from messages limit 2;
趣味一般|1|1112706898|トイレでするより､外で立ちションをするほうが､解放感があって気持ちがいいですよね｡
趣味一般|2|1112706898|に（´・ω・２）
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zenixls2/2chpreprocess

Awesome Lists containing this project

README