{"id":13813514,"url":"https://github.com/scrapy/scrapy-bench","last_synced_at":"2025-04-14T23:34:44.611Z","repository":{"id":54321187,"uuid":"92761583","full_name":"scrapy/scrapy-bench","owner":"scrapy","description":" A CLI for benchmarking Scrapy.","archived":false,"fork":false,"pushed_at":"2021-02-24T08:42:29.000Z","size":9172,"stargazers_count":31,"open_issues_count":8,"forks_count":15,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-28T11:39:39.155Z","etag":null,"topics":["benchmark-suite","command-line-tool","python","scrapy","scrapy-bench","web-crawler"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scrapy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-05-29T17:34:17.000Z","updated_at":"2024-11-22T07:39:49.000Z","dependencies_parsed_at":"2022-08-13T12:00:35.942Z","dependency_job_id":null,"html_url":"https://github.com/scrapy/scrapy-bench","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy%2Fscrapy-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy%2Fscrapy-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy%2Fscrapy-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy%2Fscrapy-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scrapy","download_url":"https://codeload.github.com/scrapy/scrapy-bench/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248980300,"owners_count":21193131,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark-suite","command-line-tool","python","scrapy","scrapy-bench","web-crawler"],"created_at":"2024-08-04T04:01:19.991Z","updated_at":"2025-04-14T23:34:44.595Z","avatar_url":"https://github.com/scrapy.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Benchmarking CLI for Scrapy\n(The project is still in development.)\n\n\u003eA command-line interface for benchmarking Scrapy, that reflects real-world usage.\n\n## Why?\n\n* Currently, the `scrapy bench` option present just spawns a spider which aggressively crawls randomly generated links at a high speed.\n* The speed thus obtained, which maybe useful for comparisons, does not actually reflects a real-world scenario.\n* The actual speed varies with the python version and scrapy version.\n\n### Current Features\n* Spawns a CPU-intensive spider which follows a fixed number of links of a static snapshot of the site [Books to Scrape](http://books.toscrape.com/index.html).\n* Follows a real-world scenario where various information of the books is extracted, and stored in a `.csv` file.\n* A broad crawl benchmark that uses 1000 copies of the site [Books to Scrape](http://books.toscrape.com/index.html) which are dynamically generated using `twisted`. The server file is present [here](https://github.com/scrapy/scrapy-bench/blob/master/server.py).\n* A micro benchmark that tests LinkExtractor() function by extracting links from a collection of html pages.\n* A micro benchmark that tests extraction using css from a collection of html pages.\n* A micro benchmark that tests extraction using xpath from a collection of html pages\n* Profile the benchmarkers with **vmprof** and upload to their website\n\n### Options\n* `--n-runs` option for performing more than one iteration of spider to improve the precision.\n* `--only_result` option for viewing the results only.\n* `--upload_result` option to upload the results to local codespeed for better comparison.\n\n### Spider settings\n* `SCRAPY_BENCH_RANDOM_PAYLOAD_SIZE`: Adds a random payload with the given size (in bytes).\n\n## Setup\n\n### Setup server for Ubuntu\n\n* Firstly, download the static snapshot of the website [Books to Scrape](http://books.toscrape.com/index.html). That can be done by using `wget`.\n\n        wget --mirror --convert-links --adjust-extension --page-requisites --no-parent \\\n            http://books.toscrape.com/index.html\n\n* Then place the whole file in the folder `var/www/html`:\n\n        sudo ln -s `pwd`/books.toscrape.com/ /var/www/html/\n\n* `nginx` is required for deploying the website. Hence it is required to be installed and configured. If it is, you would be able to see the site [here](http://localhost/books.toscrape.com/index.html).\n* If not, then follow the given steps :\n\n        sudo apt-get update\n        sudo apt-get install nginx\n\n* For the broad crawl, use the `server.py` file to serve sites of local copy of [Books to Scrape](http://books.toscrape.com/index.html), which would already be in `/var/www/html`.\n\n### Setup server using docker\n\n* Build serve part using docker\n\n        docker build -t scrapy-bench-server -f docker/Dockerfile .\n\n* Run docker container\n\n        docker run --rm -ti --network=host scrapy-bench-server\n\n* Now you have [nginx](http://localhost:8000/index.html) and [serve.py](http://localhost:8880/index.html) serving\n\n### Client setup\n\n* Add the following entries to `/etc/hosts` file :\n\n\t  127.0.0.1    domain1\n\t  127.0.0.1    domain2\n\t  127.0.0.1    domain3\n\t  127.0.0.1    domain4\n\t  127.0.0.1    domain5\n\t  127.0.0.1    domain6\n\t  127.0.0.1    domain7\n\t  127.0.0.1    domain8\n\t  ....................\n\t  127.0.0.1    domain1000\n\n* This would point the sites `http://domain1:8880/index.html` to the original site generated at `http://localhost:8880/index.html`.\n\n\nThere are 130 html files present in `sites.tar.gz`, which were downloaded using `download.py` from the top sites from `Alexa top sites` list.\n\nThere are 200 html files present in `bookfiles.tar.gz`, which were downloaded using `download.py` from the website [Books to Scrape](http://books.toscrape.com/index.html).\n\nThe spider `download.py`, dumps the response body as unicode to the files. The list of top sites was taken from [here](http://s3.amazonaws.com/alexa-static/top-1m.csv.zip).\n\n* Do the following to complete the installation:\n\n      git clone https://github.com/scrapy/scrapy-bench.git\n      cd scrapy-bench/\n      virtualenv env\n      . env/bin/activate\n      pip install --editable .\n\n## Usage\n\n\tUsage: scrapy-bench [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...\n\n\t  A benchmark suite for Scrapy.\n\n\tOptions:\n\t  --n-runs INTEGER  Take multiple readings for the benchmark.\n\t  --only_result     Display the results only.\n\t  --upload_result   Upload the results to local codespeed\n\t  --book_url TEXT   Use with bookworm command. The url to book.toscrape.com on\n\t                    your local machine\n\n\t  --vmprof          Profling benchmarkers with Vmprof and upload the result to\n\t                    the web\n\n\t  -s, --set TEXT    Settings to be passed to the Scrapy command. Use with the\n\t                    bookworm/broadworm commands.\n\n\t  --help            Show this message and exit.\n\n\tCommands:\n\t  bookworm         Spider to scrape locally hosted site\n\t  broadworm        Broad crawl spider to scrape locally hosted sites\n\t  cssbench         Micro-benchmark for extraction using css\n\t  csv              Visit URLs from a CSV file\n\t  itemloader       Item loader benchmarker\n\t  linkextractor    Micro-benchmark for LinkExtractor()\n\t  urlparseprofile  Urlparse benchmarker\n\t  xpathbench       Micro-benchmark for extraction using xpath\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapy%2Fscrapy-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscrapy%2Fscrapy-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapy%2Fscrapy-bench/lists"}