{"id":21962728,"url":"https://github.com/icio/crul","last_synced_at":"2025-07-22T13:32:16.338Z","repository":{"id":143094701,"uuid":"51214421","full_name":"icio/crul","owner":"icio","description":"Python website-scraper","archived":false,"fork":false,"pushed_at":"2016-02-15T13:49:03.000Z","size":17,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2023-03-16T12:21:55.094Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/icio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-02-06T18:08:17.000Z","updated_at":"2023-05-02T09:02:17.722Z","dependencies_parsed_at":"2023-04-27T21:46:57.881Z","dependency_job_id":null,"html_url":"https://github.com/icio/crul","commit_stats":null,"previous_names":[],"tags_count":null,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/icio%2Fcrul","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/icio%2Fcrul/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/icio%2Fcrul/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/icio%2Fcrul/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/icio","download_url":"https://codeload.github.com/icio/crul/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227105396,"owners_count":17731746,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-29T10:54:22.686Z","updated_at":"2024-11-29T10:54:23.362Z","avatar_url":"https://github.com/icio.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Crul :see_no_evil: :hear_no_evil: :speak_no_evil: \n\nA silly little web crawler, not dissimilar to [`gergle`](https://github.com/icio/gergle).\n\n## Installation\n\n```bash\n# Prepare a virtual environment (optional)\nvirtualenv env\n. env/bin/activate\n\n# Install crul\npip install git+git://github.com/icio/crul\\#egg=crul\n```\n\n## Usage\n\n```text\n$ crul\nCrul website scraper.\n\nExamples:\n\n    A human-readable summary of the (rather outdated) pages on my website:\n\n        $ crul http://www.paul-scott.com/\n\n    Record the results of scraping into a file:\n\n        $ crul http://www.paul-scott.com/ --json \u003e me.scrape\n\n    Render a sitemap of what we just scraped:\n\n        $ crul --replay me.scrape --sitemap \u003e sitemap.xml\n\n    Scrape only 3 pages deep with a single worker, leaving 2 seconds between\n    each subsequent request to the site, w/o anything under /developer/flash\n    or /forum/:\n\n        $ crul -d 3 -w 1 -t 2 https://www.kirupa.com/ \\\n            -i developer/flash -i forum/\n\n    Ignore robots.txt and grab everything we can, as fast as we can, with 5\n    workers, from \u003curl\u003e:\n\n        $ crul --yolo -w 5 \u003curl\u003e\n\nUsage:\n    crul (\u003curl\u003e [options] | --replay=\u003cfile\u003e)\n            [--disallow=\u003cpath\u003e]...\n            [--dot | --sitemap | --text | --json]\n            [-v|-q] [--log-file=\u003clog-file\u003e]\n    crul [--help | --version]\n\nOptions:\n       --dot              Output in dot format.\n       --sitemap          Output in XML sitemap format.\n       --json             Output in JSON format. [default]\n       --text             Output in human-readable text format.\n    -A --user-agent       The user-agent sent from the client.\n                          [default: Crul/1.0 (+https://github.com/icio/crul)]\n    -d --depth=\u003cn\u003e        Traverse n pages deep from the starting point.\n                          [default: 100]\n    -h --help             Print this help.\n    -i --disallow=\u003cpath\u003e  Ignore/disallow file paths from being scraped.\n    -l --log-file=\u003cfile\u003e  Log to the given file.\n    -q --quiet            Quiet logging.\n    -r --replay=\u003cfile\u003e    Load responses from a JSON file, instead of scraping.\n    -t --delay=\u003cn\u003e        Wait n seconds between requests to the site.\n    -v --verbose          Verbose logging.\n       --version          Print the version number.\n    -w --workers=\u003cn\u003e      Use n worker threads to make requests in parallel.\n                          [default: 4]\n       --yolo             Don't bother checking robots.txt.\n```\n\n## Implementation\n\n`crul` is implemented as a threaded set of workers processing a queue of URLs (`crul.scrape.site_crawl`). When new pages have been collected the linked pages (`crul.parse.PageParser`) are appended to the queue (`crul.traverse.PageTraverser`) for the workers to continue processing.\n\n```text\n                  v site_crawl(init_url)\n                  |\n    +----------------------------------------------------+      +--- worker_sentinel [thread x 1] ------------+\n    |             |                                      | \u003e-------\u003e Wait for all tasks to be completed.      |\n    |  +---\u003c  \u003c---+   pending (Task queue)        \u003c---+  |      |                                             |\n    |  |                                              |  | \u003c-------* Append kill signals to all queues.       |\n    +----------------------------------------------------+      +---------------------------------------------+\n       |                                              |            |\n    +----+ worker [thread x num_workers] +---------------+         |\n    |  |                                              |  |         |\n    |  |   + worker_request +---------------------+   |  |         |\n    |  |   |                                      |   |  |         |\n    |  +-\u003e | page = page_parser(session.get(url)) |   |  |         |\n    |      | page_traverser.follow(page)          | \u003e-+  |         |\n    |  +-\u003c | return page                          |      |         |\n    |  |   |                                      |      |         |\n    |  |   +--------------------------------------+      |         |\n    |  |                                                 |         |\n    +----------------------------------------------------+         |\n       |                                                           |\n    +----------------------------------------------------+         |\n    |  |                                                 |         |\n    |  +---\u003e  \u003e---+   completed (Page queue)             | \u003c-------+\n    |             |                                      |\n    +----------------------------------------------------+\n                  |\n                  v iter([Page, ...])\n```\n\n## Limitations\n\n* Query-strings being used to uniquely identify a page means that links to pages with junk query strings risk causing a lot of duplication.\n* None of the output formats conflate pages based on their canonical-url.\n* Python's GIL.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ficio%2Fcrul","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ficio%2Fcrul","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ficio%2Fcrul/lists"}