{"id":26933099,"url":"https://github.com/xiaohan2012/q-crawler","last_synced_at":"2025-04-02T09:17:33.576Z","repository":{"id":14349566,"uuid":"17059153","full_name":"xiaohan2012/q-crawler","owner":"xiaohan2012","description":"Reinforcement based focused crawler","archived":false,"fork":false,"pushed_at":"2014-05-09T21:17:27.000Z","size":3404,"stargazers_count":5,"open_issues_count":3,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-04-14T18:06:59.690Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xiaohan2012.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-02-21T14:46:37.000Z","updated_at":"2024-04-14T18:06:59.691Z","dependencies_parsed_at":"2022-09-03T08:32:35.568Z","dependency_job_id":null,"html_url":"https://github.com/xiaohan2012/q-crawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaohan2012%2Fq-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaohan2012%2Fq-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaohan2012%2Fq-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaohan2012%2Fq-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xiaohan2012","download_url":"https://codeload.github.com/xiaohan2012/q-crawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246785481,"owners_count":20833498,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-02T09:17:32.994Z","updated_at":"2025-04-02T09:17:33.565Z","avatar_url":"https://github.com/xiaohan2012.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Q-crawler\n===========\n\n[![Build Status](https://travis-ci.org/xiaohan2012/q-crawler.png?branch=master)](https://travis-ci.org/xiaohan2012/q-crawler)\n\n#Preparation\n\n##Virtual environment\nRun\n```\ncd q-crawler\nvirtualenv venv\n```\nto setup the virtual environment.\n\nRun \n```\nsource venv/bin/activate\n```\nto ensure the virtual environment is activated.\n\n##Dependency resolving\n```\npip install -r requirements.txt\n```\nBe patient. It might take several minutes.\n\nYou may encounter the error:  `/bin/sh: xslt-config: not found`. Please see this [post](http://stackoverflow.com/questions/5178416/pip-install-lxml-error) for solution.\n\nIf you encounter `ffi.h not found`, try [this](http://stackoverflow.com/questions/12982486/glib-compile-error-ffi-h-but-libffi-is-installed/17518165#17518165).\n#Usage\n#Run the demo\n\nTo see how the RL-based crawler compares to the baseline crawler(the ordinary one), run the following command\n\n```\ncd src/spider\n./ctrl.sh %run the crawler and feel free to have a cup of coffee during the crawling :)\npython gen_html_data.py\n```\n\nLast, open the `comparison.html` file using Web browser to see the performance comparison.\n\nThe crawling process might run 15~20 minutes, varied by the Internet connection speed.\n\nIf you want to speed up the process, the number of URLs to be crawled can be changed(default to 10000). See [configuration](https://github.com/xiaohan2012/q-crawler/#configuration).\n\n\n##training\n\n```\ncd src\npython classifier_util.py train\n```\n\nAnd the produced classifier will be pickled and put in `data/classifier.pickle`.\n\n##crawling\n\n```\ncd src/spider\nscrapy crawl apprentice \nscrapy crawl baseline\n```\n\n##Performance monitoring\n\n```\ncd src/spider\npython gen_html_data.py\n```\n\nOpen the `comparison.html` using modern web browser(Firefox 24.4.0 tested OK).\n\nSome example performance plot is [here](http://www.cs.helsinki.fi/u/hxiao/rl-project/comparison.html).\n\n##Training data preprocessing \n\nMerge the positive/negative training samples into two separate files, each for one class. Each line represents one traing sample and consists of the tokens in the sample and is ended with class label of the sample(`pos` or `neg`).\n\nPut both files under the `data` directory. Name the postive sample files to `pos` and negative sample files to `neg`.\n\nSee [this(for negative samples)](https://raw.githubusercontent.com/xiaohan2012/q-crawler/master/data/neg) and [this(for positive samples)](https://raw.githubusercontent.com/xiaohan2012/q-crawler/master/data/pos) files for example.\n\n##Configuration\n\n1. Maximum number of crawled URLs: change  `CLOSESPIDER_ITEMCOUNT`'s value in [this](https://github.com/xiaohan2012/q-crawler/blob/master/src/spider/spider/settings.py) file\n2. Starting URLs: change`START_URLS`'s value in [this](https://github.com/xiaohan2012/q-crawler/blob/master/src/spider/spider/settings.py) file\n\n#Contact\nxiaohan2012 at gmail.com\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxiaohan2012%2Fq-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxiaohan2012%2Fq-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxiaohan2012%2Fq-crawler/lists"}