{"id":15373757,"url":"https://github.com/debugtalk/webcrawler","last_synced_at":"2025-04-15T11:32:57.643Z","repository":{"id":57461361,"uuid":"86071591","full_name":"debugtalk/WebCrawler","owner":"debugtalk","description":"A web crawler based on requests-html, mainly targets for url validation test.","archived":false,"fork":false,"pushed_at":"2020-03-23T08:00:39.000Z","size":83,"stargazers_count":32,"open_issues_count":1,"forks_count":12,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-28T20:51:25.944Z","etag":null,"topics":["crawler","requests-html","web-crawler","weblink"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/debugtalk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-03-24T13:34:26.000Z","updated_at":"2023-09-08T17:22:42.000Z","dependencies_parsed_at":"2022-09-19T08:51:21.941Z","dependency_job_id":null,"html_url":"https://github.com/debugtalk/WebCrawler","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/debugtalk%2FWebCrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/debugtalk%2FWebCrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/debugtalk%2FWebCrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/debugtalk%2FWebCrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/debugtalk","download_url":"https://codeload.github.com/debugtalk/WebCrawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248695439,"owners_count":21146954,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","requests-html","web-crawler","weblink"],"created_at":"2024-10-01T13:56:15.204Z","updated_at":"2025-04-15T11:32:57.373Z","avatar_url":"https://github.com/debugtalk.png","language":"Python","readme":"# WebCrawler\n\nA simple web crawler, mainly targets for link validation test.\n\n## Features\n\n- running in BFS or DFS mode\n- specify concurrent running workers in BFS mode\n- crawl seeds can be set to more than one urls\n- support crawl with cookies\n- configure hyper links regex, including match type and ignore type\n- group visited urls by HTTP status code\n- flexible configuration in YAML\n- send test result by mail, through SMTP protocol or mailgun service\n- cancel jobs\n\n## Installation/Upgrade\n\n```bash\n$ pip install -U git+https://github.com/debugtalk/WebCrawler.git#egg=requests-crawler --process-dependency-links\n```\n\nTo ensure the installation or upgrade is successful, you can execute command `webcrawler -V` to see if you can get the correct version number.\n\n```bash\n$ webcrawler -V\njenkins-mail-py version: 0.2.4\nWebCrawler version: 0.3.0\n```\n\n## Usage\n\n```text\n$ webcrawler -h\nusage: webcrawler [-h] [-V] [--log-level LOG_LEVEL]\n                  [--config-file CONFIG_FILE] [--seeds SEEDS]\n                  [--include-hosts INCLUDE_HOSTS] [--cookies COOKIES]\n                  [--crawl-mode CRAWL_MODE] [--max-depth MAX_DEPTH]\n                  [--concurrency CONCURRENCY] [--save-results SAVE_RESULTS]\n                  [--grey-user-agent GREY_USER_AGENT]\n                  [--grey-traceid GREY_TRACEID]\n                  [--grey-view-grey GREY_VIEW_GREY]\n                  [--mailgun-api-id MAILGUN_API_ID]\n                  [--mailgun-api-key MAILGUN_API_KEY]\n                  [--mail-sender MAIL_SENDER]\n                  [--mail-recepients [MAIL_RECEPIENTS [MAIL_RECEPIENTS ...]]]\n                  [--mail-subject MAIL_SUBJECT] [--mail-content MAIL_CONTENT]\n                  [--jenkins-job-name JENKINS_JOB_NAME]\n                  [--jenkins-job-url JENKINS_JOB_URL]\n                  [--jenkins-build-number JENKINS_BUILD_NUMBER]\n\nA web crawler for testing website links validation.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -V, --version         show version\n  --log-level LOG_LEVEL\n                        Specify logging level, default is INFO.\n  --config-file CONFIG_FILE\n                        Specify config file path.\n  --seeds SEEDS         Specify crawl seed url(s), several urls can be\n                        specified with pipe; if auth needed, seeds can be\n                        specified like user1:pwd1@url1|user2:pwd2@url2\n  --include-hosts INCLUDE_HOSTS\n                        Specify extra hosts to be crawled.\n  --cookies COOKIES     Specify cookies, several cookies can be joined by '|'.\n                        e.g. 'lang:en,country:us|lang:zh,country:cn'\n  --crawl-mode CRAWL_MODE\n                        Specify crawl mode, BFS or DFS.\n  --max-depth MAX_DEPTH\n                        Specify max crawl depth.\n  --concurrency CONCURRENCY\n                        Specify concurrent workers number.\n  --save-results SAVE_RESULTS\n                        Specify if save results, default is NO.\n  --grey-user-agent GREY_USER_AGENT\n                        Specify grey environment header User-Agent.\n  --grey-traceid GREY_TRACEID\n                        Specify grey environment cookie traceid.\n  --grey-view-grey GREY_VIEW_GREY\n                        Specify grey environment cookie view_gray.\n  --mailgun-api-id MAILGUN_API_ID\n                        Specify mailgun api id.\n  --mailgun-api-key MAILGUN_API_KEY\n                        Specify mailgun api key.\n  --mail-sender MAIL_SENDER\n                        Specify email sender.\n  --mail-recepients [MAIL_RECEPIENTS [MAIL_RECEPIENTS ...]]\n                        Specify email recepients.\n  --mail-subject MAIL_SUBJECT\n                        Specify email subject.\n  --mail-content MAIL_CONTENT\n                        Specify email content.\n  --jenkins-job-name JENKINS_JOB_NAME\n                        Specify jenkins job name.\n  --jenkins-job-url JENKINS_JOB_URL\n                        Specify jenkins job url.\n  --jenkins-build-number JENKINS_BUILD_NUMBER\n                        Specify jenkins build number.\n```\n\n## Examples\n\nSpecify config file.\n\n```bash\n$ webcrawler --seeds http://debugtalk.com --crawl-mode bfs --max-depth 5 --config-file path/to/config.yml\n```\n\nCrawl in BFS mode with 20 concurrent workers, and set maximum depth to 5.\n\n```bash\n$ webcrawler --seeds http://debugtalk.com --crawl-mode bfs --max-depth 5 --concurrency 20\n```\n\nCrawl in DFS mode, and set maximum depth to 10.\n\n```bash\n$ webcrawler --seeds http://debugtalk.com --crawl-mode dfs --max-depth 10\n```\n\nCrawl several websites in BFS mode with 20 concurrent workers, and set maximum depth to 10.\n\n```bash\n$ webcrawler --seeds http://debugtalk.com,http://blog.debugtalk.com --crawl-mode bfs --max-depth 10 --concurrency 20\n```\n\nCrawl with different cookies.\n\n```text\n$ webcrawler --seeds http://debugtalk.com --crawl-mode BFS --max-depth 10 --concurrency 50 --cookies 'lang:en,country:us|lang:zh,country:cn'\n```\n\n## Supported Python Versions\n\nWebCrawler supports Python 2.7, 3.3, 3.4, 3.5, and 3.6.\n\n## License\n\nOpen source licensed under the MIT license (see LICENSE file for details).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdebugtalk%2Fwebcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdebugtalk%2Fwebcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdebugtalk%2Fwebcrawler/lists"}