{"id":13464645,"url":"https://github.com/rivermont/spidy","last_synced_at":"2026-01-16T22:19:00.524Z","repository":{"id":21152969,"uuid":"91849845","full_name":"rivermont/spidy","owner":"rivermont","description":"The simple, easy to use command line web crawler.","archived":false,"fork":false,"pushed_at":"2024-08-08T14:25:58.000Z","size":85756,"stargazers_count":346,"open_issues_count":13,"forks_count":68,"subscribers_count":23,"default_branch":"master","last_synced_at":"2025-03-04T14:49:14.114Z","etag":null,"topics":["crawler","crawling","python","python3","web-crawler","web-spider"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rivermont.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-05-19T22:25:27.000Z","updated_at":"2025-02-23T10:32:43.000Z","dependencies_parsed_at":"2024-01-17T02:15:59.086Z","dependency_job_id":"c6cdcc2f-768d-4dd4-9bcc-816c4604e0bc","html_url":"https://github.com/rivermont/spidy","commit_stats":{"total_commits":541,"total_committers":20,"mean_commits":27.05,"dds":"0.11829944547134941","last_synced_commit":"15d4e8c58db0061d78fb8066b4de00a463017be1"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rivermont%2Fspidy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rivermont%2Fspidy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rivermont%2Fspidy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rivermont%2Fspidy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rivermont","download_url":"https://codeload.github.com/rivermont/spidy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245454074,"owners_count":20617971,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","crawling","python","python3","web-crawler","web-spider"],"created_at":"2024-07-31T14:00:47.918Z","updated_at":"2026-01-16T22:19:00.490Z","avatar_url":"https://github.com/rivermont.png","language":"Python","funding_links":[],"categories":["All","Python"],"sub_categories":[],"readme":"# spidy Web Crawler [![Mentioned in awesome-crawler](https://awesome.re/mentioned-badge.svg)](https://github.com/BruceDone/awesome-crawler)\n\nSpidy (/spˈɪdi/) is the simple, easy to use command line web crawler.\u003cbr\u003e\nGiven a list of web links, it uses Python [`requests`](http://docs.python-requests.org) to query the webpages, and [`lxml`](http://lxml.de/index.html) to extract all links from the page.\u003cbr\u003e\nPretty simple!\n\n[![spidy Logo](https://raw.githubusercontent.com/rivermont/spidy/master/media/spidy_logo.png)](https://github.com/rivermont/spidy#contributors)\n\n![Version: 1.6.5](https://img.shields.io/badge/version-1.6.5-brightgreen.svg)\n[![Release: 1.4.0](https://img.shields.io/github/release/rivermont/spidy.svg)](https://github.com/rivermont/spidy/releases)\n[![License: GPL v3](https://img.shields.io/badge/license-GPLv3.0-blue.svg)](http://www.gnu.org/licenses/gpl-3.0)\n[![Python 3.3+](https://img.shields.io/badge/python-3.3+-brightgreen.svg)](https://docs.python.org/3/)\n![All Platforms!](https://img.shields.io/badge/Windows,%20OS/X,%20Linux-%20%20-brightgreen.svg)\n![Open Source Love](https://badges.frapsoft.com/os/v1/open-source.png?v=103)\n\u003cbr\u003e\n![Lines of Code: 1553](https://img.shields.io/badge/lines%20of%20code-1553-brightgreen.svg)\n![Lines of Docs: 605](https://img.shields.io/badge/lines%20of%20docs-605-orange.svg)\n[![Last Commit](https://img.shields.io/github/last-commit/rivermont/spidy.svg)](https://github.com/rivermont/spidy/graphs/punch-card)\n[![Travis CI Status](https://img.shields.io/travis/com/rivermont/spidy)](https://travis-ci.com/github/rivermont/spidy)\n[![PyPI Wheel](https://img.shields.io/pypi/wheel/spidy-web-crawler.svg)](https://pypi.org/project/spidy-web-crawler/)\n[![PyPI Status](https://img.shields.io/pypi/status/spidy-web-crawler.svg)](https://pypi.org/project/spidy-web-crawler/)\n\u003cbr\u003e\n[![Contributors](https://img.shields.io/github/contributors/rivermont/spidy.svg)](https://github.com/rivermont/spidy/graphs/contributors)\n[![Forks](https://img.shields.io/github/forks/rivermont/spidy.svg?style=social\u0026label=Forks)](https://github.com/rivermont/spidy/network)\n[![Stars](https://img.shields.io/github/stars/rivermont/spidy.svg?style=social\u0026label=Stars)](https://github.com/rivermont/spidy/stargazers)\n\nCreated by [rivermont](https://github.com/rivermont) (/rɪvɜːrmɒnt/) and [FalconWarriorr](https://github.com/FalconWarriorr) (/fælcʌnraɪjɔːr/), and developed with help from [these awesome people](https://github.com/rivermont/spidy#contributors).\u003cbr\u003e\nLooking for technical documentation? Check out [`DOCS.md`](https://github.com/rivermont/spidy/blob/master/spidy/docs/DOCS.md)\u003cbr\u003e\nLooking to contribute to this project? Have a look at [`CONTRIBUTING.md`](https://github.com/rivermont/spidy/blob/master/spidy/docs/CONTRIBUTING.md), then check out the docs.\n\n***\n\n# 🎉 New Features!\n\n### Multithreading\nCrawl all the things! Run separate threads to work on multiple pages at the same time.\u003cbr\u003e\nSuch fast. Very wow.\n\n### PyPI\nInstall spidy with one line: `pip install spidy-web-crawler`!\n\n### Automatic Testing with Travis CI\n\n### Release v1.4.0 - #[31663d3](https://github.com/rivermont/spidy/commit/31663d34ceeba66ea9de9819b6da555492ed6a80)\n[spidy Web Crawler Release 1.4](https://github.com/rivermont/spidy/releases/tag/1.4.0)\n\n\n# Contents\n\n  - [spidy Web Crawler](https://github.com/rivermont/spidy#spidy-web-crawler)\n  - [New Features!](https://github.com/rivermont/spidy#-new-features)\n  - [Contents](https://github.com/rivermont/spidy#contents)\n  - [How it Works](https://github.com/rivermont/spidy#how-it-works)\n  - [Why It's Different](https://github.com/rivermont/spidy#why-its-different)\n  - [Features](https://github.com/rivermont/spidy#features)\n  - [Tutorial](https://github.com/rivermont/spidy#tutorial)\n    - [Using with Docker](https://github.com/rivermont/spidy#using-with-docker)\n    - [Installing from PyPI](https://github.com/rivermont/spidy#installing-from-pypi)\n    - [Installing from Source Code](https://github.com/rivermont/spidy#installing-from-source-code)\n      - [Python Installation](https://github.com/rivermont/spidy#python-installation)\n        - [Windows and Mac](https://github.com/rivermont/spidy#windows-and-mac)\n          - [Anaconda](https://github.com/rivermont/spidy#anaconda)\n          - [Python Base](https://github.com/rivermont/spidy#python-base)\n        - [Linux](https://github.com/rivermont/spidy#linux)\n      - [Crawler Installation](https://github.com/rivermont/spidy#crawler-installation)\n      - [Launching](https://github.com/rivermont/spidy#launching)\n      - [Running](https://github.com/rivermont/spidy#running)\n        - [Config](https://github.com/rivermont/spidy#config)\n        - [Start](https://github.com/rivermont/spidy#start)\n        - [Autosave](https://github.com/rivermont/spidy#autosave)\n        - [Force Quit](https://github.com/rivermont/spidy#force-quit)\n  - [How Can I Support This?](https://github.com/rivermont/spidy#how-can-i-support-this)\n  - [Contributors](https://github.com/rivermont/spidy#contributors)\n  - [License](https://github.com/rivermont/spidy#license)\n\n\n# How it Works\nSpidy has two working lists, `TODO` and `DONE`.\u003cbr\u003e\n'TODO' is the list of URLs it hasn't yet visited.\u003cbr\u003e\n'DONE' is the list of URLs it has already been to.\u003cbr\u003e\nThe crawler visits each page in TODO, scrapes the DOM of the page for links, and adds those back into TODO.\u003cbr\u003e\nIt can also save each page, because datahoarding 😜.\n\n\n# Why It's Different\nWhat sets spidy apart from other web crawling solutions written in Python?\n\nMost of the other options out there are not web crawlers themselves, simply frameworks and libraries through which one can create and deploy a web spider for example Scrapy and BeautifulSoup.\nScrapy is a Web crawling framework, written in Python, specifically created for downloading, cleaning and saving data from the web whereas BeautifulSoup is a parsing library that allows a programmer to get specific elements out of a webpage but BeautifulSoup alone is not enough because you have to actually get the webpage in the first place.\n\nBut with Spidy, everything runs right out of the box.\nSpidy is a Web Crawler which is easy to use and is run from the command line. You have to give it a URL link of the webpage and it starts crawling away! A very simple and effective way of fetching stuff off of the web. \n\n\n\n# Features\nWe built a lot of the functionality in spidy by watching the console scroll by and going, \"Hey, we should add that!\"\u003cbr\u003e\nHere are some features we figure are worth noting.\n\n  - Error Handling: We have tried to recognize all of the errors spidy runs into and create custom error messages and logging for each. There is a set cap so that after accumulating too many errors the crawler will stop itself.\n  - Cross-Platform compatibility: spidy will work on all three major operating systems, Windows, Mac OS/X, and Linux!\n  - Frequent Timestamp Logging: Spidy logs almost every action it takes to both the console and one of two log files.\n  - Browser Spoofing: Make requests using User Agents from 4 popular web browsers, use a custom spidy bot one, or create your own!\n  - Portability: Move spidy's folder and its contents somewhere else and it will run right where it left off. *Note*: This only works if you run it from source code.\n  - User-Friendly Logs: Both the console and log file messages are simple and easy to interpret, but packed with information.\n  - Webpage saving: Spidy downloads each page that it runs into, regardless of file type. The crawler uses the HTTP `Content-Type` header returned with most files to determine the file type.\n  - File Zipping: When autosaving, spidy can archive the contents of the `saved/` directory to a `.zip` file, and then clear `saved/`.\n\n\n# Tutorial\n\n## Using with Docker\nSpidy can be easily run in a Docker container.\u003cbr\u003e\n\n- First, build the [`Dockerfile`](dockerfile): `docker build -t spidy .`\n  - Verify that the Docker image has been created: `docker images`\n- Then, run it: `docker run --rm -it -v $PWD:/data spidy`\n  - `--rm` tells Docker to clean up after itself by removing stopped containers.\n  - `-it` tells Docker to run the container interactively and allocate a pseudo-TTY.\n  - `-v $PWD:/data` tells Docker to mount the current working directory as `/data` directory inside the container. This is needed if you want Spidy's files (e.g. `crawler_done.txt`, `crawler_words.txt`, `crawler_todo.txt`) written back to your host filesystem.\n\n### Spidy Docker Demo\n\n![Spidy Docker Demo](media/spidy_docker_demo.gif)\n\n## Installing from PyPI\nSpidy can be found on the Python Package Index as `spidy-web-crawler`.\u003cbr\u003e\nYou can install it from your package manager of choice and simple run the `spidy` command.\u003cbr\u003e\nThe working files will be found in your home directory.\n\n## Installing from Source Code\nAlternatively, you can download the source code and run it.\n\n### Python Installation\nThe way that you will run spidy depends on the way you have Python installed.\u003cbr\u003e\n\n#### Windows and Mac\n\nThere are many different versions of [Python](https://www.python.org/about/), and hundreds of different installations for each them.\u003cbr\u003e\nSpidy is developed for Python v3.5.2, but should run without errors in other versions of Python 3.\n\n##### Anaconda\nWe recommend the [Anaconda distribution](https://www.continuum.io/downloads).\u003cbr\u003e\nIt comes pre-packaged with lots of goodies, including `lxml`, which is required for spidy to run and not including in the standard Python package.\n\n##### Python Base\nYou can also just install [default Python](https://www.python.org/downloads/), and install the external libraries separately.\u003cbr\u003e\nThis can be done with `pip`:\n\n    pip install -r requirements.txt\n\n#### Linux\nPython 3 should come preinstalled with most flavors of Linux, but if not, simply run\n\n    sudo apt update\n    sudo apt install python3 python3-lxml python3-requests\n\nThen `cd` into the crawler's directory and run `python3 crawler.py`.\n\n### Crawler Installation\nIf you have git or GitHub Desktop installed, you can clone the repository [from here](https://github.com/rivermont/spidy.git). If not, download [the latest source code](https://github.com/rivermont/spidy/archive/master.zip) or grab the [latest release](https://github.com/rivermont/spidy/releases).\n\n### Launching\n\nUse `cd` to navigate to the directory that spidy is located in, then run:\n\n    python crawler.py\n\n![](https://raw.githubusercontent.com/rivermont/spidy/master/media/run.gif)\n\n### Running\nSpidy logs a lot of information to the command line throughout its life.\u003cbr\u003e\nOnce started, a bunch of `[INIT]` lines will print.\u003cbr\u003e\nThese announce where spidy is in its initialization process.\u003cbr\u003e\n\n#### Config\nOn running, spidy asks for input regarding certain parameters it will run off of.\u003cbr\u003e\nHowever, you can also use one of the configuration files, or even create your own.\n\nTo use spidy with a configuration file, input the name of the file when the crawler asks\n\nThe config files included with spidy are:\n\n  - *`blank.txt`*: Template for creating your own configurations.\n  - `default.cfg`: The default version.\n  - `heavy.cfg`: Run spidy with all of its features enabled.\n  - `infinite.cfg`: The default config, but it never stops itself.\n  - `light.cfg`: Disable most features; only crawls pages for links.\n  - `rivermont.cfg`: My personal favorite settings.\n  - `rivermont-infinite.cfg`: My favorite, never-ending configuration.\n\n#### Start\nSample start log.\n\n![](https://raw.githubusercontent.com/rivermont/spidy/master/media/start.png)\n\n#### Autosave\nSample log after hitting the autosave cap.\n\n![](https://raw.githubusercontent.com/rivermont/spidy/master/media/log.png)\n\n#### Force Quit\nSample log after performing a `^C` (CONTROL + C) to force quit the crawler.\n\n![](https://raw.githubusercontent.com/rivermont/spidy/master/media/keyboardinterrupt.png)\n\n\n# How Can I Support This?\nThe easiest thing you can do is Star spidy if you think it's cool, or Watch it if you would like to get updates.\u003cbr\u003e\nIf you have a suggestion, [create an Issue](https://github.com/rivermont/spidy/issues/new) or Fork the `master` branch and open a Pull Request.\n\n\n# Contributors\nSee the [`CONTRIBUTING.md`](https://github.com/rivermont/spidy/blob/master/spidy/docs/CONTRIBUTING.md)\n\n* The logo was designed by [Cutwell](https://github.com/Cutwell)\n\n* [3onyc](https://github.com/3onyc) - PEP8 Compliance.\n* [DeKaN](https://github.com/DeKaN) - Getting PyPI packaging to work.\n* [esouthren](https://github.com/esouthren) - Unit testing.\n* [Hrily](https://github.com/Hrily) - Multithreading.\n* [j-setiawan](https://github.com/j-setiawan) - Paths that work on all OS's.\n* [michellemorales](https://github.com/michellemorales) - Confirmed OS/X support.\n* [petermbenjamin](https://github.com/petermbenjamin) - Docker support.\n* [quatroka](https://github.com/quatroka) - Fixed testing bugs.\n* [stevelle](https://github.com/stevelle) - Respect robots.txt.\n* [thatguywiththatname](https://github.com/thatguywiththatname) - README link corrections.\n\n# License\nWe used the [Gnu General Public License](https://www.gnu.org/licenses/gpl-3.0.en.html) (see [`LICENSE`](https://github.com/rivermont/spidy/blob/master/LICENSE)) as it was the license that best suited our needs.\u003cbr\u003e\nHonestly, if you link to this repo and credit `rivermont` and `FalconWarriorr`, and you aren't selling spidy in any way, then we would love for you to distribute it.\u003cbr\u003e\nThanks!\n\n***\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frivermont%2Fspidy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frivermont%2Fspidy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frivermont%2Fspidy/lists"}