https://github.com/xiaohan2012/q-crawler

Reinforcement based focused crawler
https://github.com/xiaohan2012/q-crawler

Last synced: about 1 year ago
JSON representation

Reinforcement based focused crawler

Host: GitHub
URL: https://github.com/xiaohan2012/q-crawler
Owner: xiaohan2012
Created: 2014-02-21T14:46:37.000Z (over 12 years ago)
Default Branch: master
Last Pushed: 2014-05-09T21:17:27.000Z (about 12 years ago)
Last Synced: 2024-04-14T18:06:59.690Z (about 2 years ago)
Language: Python
Homepage:
Size: 3.25 MB
Stars: 5
Watchers: 2
Forks: 1
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Q-crawler
===========

[![Build Status](https://travis-ci.org/xiaohan2012/q-crawler.png?branch=master)](https://travis-ci.org/xiaohan2012/q-crawler)

#Preparation

##Virtual environment
Run
```
cd q-crawler
virtualenv venv
```
to setup the virtual environment.

Run
```
source venv/bin/activate
```
to ensure the virtual environment is activated.

##Dependency resolving
```
pip install -r requirements.txt
```
Be patient. It might take several minutes.

You may encounter the error: `/bin/sh: xslt-config: not found`. Please see this [post](http://stackoverflow.com/questions/5178416/pip-install-lxml-error) for solution.

If you encounter `ffi.h not found`, try [this](http://stackoverflow.com/questions/12982486/glib-compile-error-ffi-h-but-libffi-is-installed/17518165#17518165).
#Usage
#Run the demo

To see how the RL-based crawler compares to the baseline crawler(the ordinary one), run the following command

```
cd src/spider
./ctrl.sh %run the crawler and feel free to have a cup of coffee during the crawling :)
python gen_html_data.py
```

Last, open the `comparison.html` file using Web browser to see the performance comparison.

The crawling process might run 15~20 minutes, varied by the Internet connection speed.

If you want to speed up the process, the number of URLs to be crawled can be changed(default to 10000). See [configuration](https://github.com/xiaohan2012/q-crawler/#configuration).

##training

```
cd src
python classifier_util.py train
```

And the produced classifier will be pickled and put in `data/classifier.pickle`.

##crawling

```
cd src/spider
scrapy crawl apprentice
scrapy crawl baseline
```

##Performance monitoring

```
cd src/spider
python gen_html_data.py
```

Open the `comparison.html` using modern web browser(Firefox 24.4.0 tested OK).

Some example performance plot is [here](http://www.cs.helsinki.fi/u/hxiao/rl-project/comparison.html).

##Training data preprocessing

Merge the positive/negative training samples into two separate files, each for one class. Each line represents one traing sample and consists of the tokens in the sample and is ended with class label of the sample(`pos` or `neg`).

Put both files under the `data` directory. Name the postive sample files to `pos` and negative sample files to `neg`.

See [this(for negative samples)](https://raw.githubusercontent.com/xiaohan2012/q-crawler/master/data/neg) and [this(for positive samples)](https://raw.githubusercontent.com/xiaohan2012/q-crawler/master/data/pos) files for example.

##Configuration

1. Maximum number of crawled URLs: change `CLOSESPIDER_ITEMCOUNT`'s value in [this](https://github.com/xiaohan2012/q-crawler/blob/master/src/spider/spider/settings.py) file
2. Starting URLs: change`START_URLS`'s value in [this](https://github.com/xiaohan2012/q-crawler/blob/master/src/spider/spider/settings.py) file

#Contact
xiaohan2012 at gmail.com

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xiaohan2012/q-crawler

Awesome Lists containing this project

README