{"id":13845834,"url":"https://github.com/bartdag/pylinkvalidator","last_synced_at":"2025-04-07T19:13:43.041Z","repository":{"id":18108649,"uuid":"21178373","full_name":"bartdag/pylinkvalidator","owner":"bartdag","description":"pylinkvalidator is a standalone and pure python link validator and crawler that traverses a web site and reports errors (e.g., 500 and 404 errors) encountered.","archived":false,"fork":false,"pushed_at":"2019-05-17T11:00:35.000Z","size":126,"stargazers_count":143,"open_issues_count":21,"forks_count":37,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-03-31T18:22:08.215Z","etag":null,"topics":["crawler","link-checker","networking","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bartdag.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG.rst","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-06-24T19:43:32.000Z","updated_at":"2025-01-19T11:27:59.000Z","dependencies_parsed_at":"2022-09-22T18:44:27.526Z","dependency_job_id":null,"html_url":"https://github.com/bartdag/pylinkvalidator","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bartdag%2Fpylinkvalidator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bartdag%2Fpylinkvalidator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bartdag%2Fpylinkvalidator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bartdag%2Fpylinkvalidator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bartdag","download_url":"https://codeload.github.com/bartdag/pylinkvalidator/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247713258,"owners_count":20983683,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","link-checker","networking","python"],"created_at":"2024-08-04T17:03:37.769Z","updated_at":"2025-04-07T19:13:42.510Z","avatar_url":"https://github.com/bartdag.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"pylinkvalidator\n===============\n\n:Version: 0.3\n\npylinkvalidator is a standalone and pure python link validator and crawler that\ntraverses a web site and reports errors (e.g., 500 and 404 errors) encountered.\nThe crawler can also download resources such as images, scripts and\nstylesheets.\n\npylinkvalidator's performance can be improved by installing additional libraries\nthat require a C compiler, but these libraries are optional.\n\nWe created pylinkvalidator so that it could be executed in environments without\naccess to a compiler (e.g., Microsoft Windows, some posix production\nenvironments) or with an old version of python (e.g., Centos).\n\npylinkvalidator is highly modular and has many configuration options, but the\nonly required parameter is the starting url: pylinkvalidate.py\nhttp://www.example.com/\n\npylinkvalidator can also be used programmatically by calling one of the functions\nin ``pylinkvalidator.api``\n\n.. image:: https://api.travis-ci.org/bartdag/pylinkvalidator.png\n\n\nQuick Start\n-----------\n\nInstall pylinkvalidator with pip or easy_install:\n\n::\n\n  pip install pylinkvalidator\n\n\nCrawl all pages from a site and show progress:\n\n::\n\n  pylinkvalidate.py -P http://www.example.com/\n\n\nRequirements\n------------\n\npylinkvalidator does not require external libraries if executed with python 2.x.\nIt requires beautifulsoup4 if executed with python 3.x. It has been tested on\npython 2.6, python 2.7, and python 3.6.\n\nFor production use, it is strongly recommended to use lxml or html5lib because\nthe default HTML parser provided by python is not very lenient.\n\n\nOptional Requirements\n---------------------\n\nThese libraries can be installed to enable certain modes in pylinkvalidator:\n\nlxml\n  beautifulsoup can use lxml to speed up the parsing of HTML pages. Because\n  lxml requires C libraries, this is only an optional requirement.\n\nhtml5lib\n  beautifulsoup can use html5lib to process incorrect or strange markup. It is\n  slower than lxml, but believed to be more lenient.\n\ngevent\n  this non-blocking io library enables pylinkvalidator to use green threads\n  instead of processes or threads. gevent could potentially speed up the\n  crawling speed on web sites with many small pages.\n\ncchardet\n  this library speeds up the detection of document encoding.\n\n\nUsage\n-----\n\nThis is a list of all available options. See the end of the README file for\nusage examples.\n\n::\n\n  Usage: pylinkvalidate.py [options] URL ...\n\n  Options:\n    --version             Show program's version number and exit\n    -h, --help            Show this help message and exit\n    -V VERBOSE, --verbose=VERBOSE\n                          Display debugging info\n                            None:  --verbose=0 (default)\n                            Quiet: --verbose=1\n                            Info:  --verbose=2\n\n    Crawler Options:\n      These options modify the way the crawler traverses the site.\n\n      -O, --test-outside  Fetch resources from other domains without crawling\n                          them\n      -H ACCEPTED_HOSTS, --accepted-hosts=ACCEPTED_HOSTS\n                          Comma-separated list of additional hosts to crawl\n                          (e.g., example.com,subdomain.another.com)\n      -i IGNORED_PREFIXES, --ignore=IGNORED_PREFIXES\n                          Comma-separated list of host/path prefixes to ignore\n                          (e.g., www.example.com/ignore_this_and_after/)\n      -b, --ignore-bad-tel-urls\n                          ignore badly formed tel URLs missing the leading +\n                          sign, e.g., tel:1234567890 - only necessary for Python\n                          \u003e 2.6\n      -u USERNAME, --username=USERNAME\n                          Username to use with basic HTTP authentication\n      -p PASSWORD, --password=PASSWORD\n                          Password to use with basic HTTP authentication\n      -M, --multi         each argument is considered to be a different site\n      -D HEADER, --header=HEADER\n                          custom header of the form Header: Value (repeat for\n                          multiple headers)\n      --url-file-path=URL_FILE_PATH\n                          get starting URLs from a line-separated file\n      -t TYPES, --types=TYPES\n                          Comma-separated values of tags to look for when\n                          crawling a site. Default (and supported types):\n                          a,img,link,script\n      -T TIMEOUT, --timeout=TIMEOUT\n                          Seconds to wait before considering that a page timed\n                          out (default = 10)\n      -C, --strict        Does not strip href and src attributes from\n                          whitespaces\n      -P, --progress      Prints crawler progress in the console\n      -N, --run-once      Only crawl the first page (eq. to depth=0)\n      -d DEPTH, --depth=DEPTH\n                          Maximum crawl depth (default = 1)\n      -e, --prefer-server-encoding\n                          Prefer server encoding if specified. Else detect\n                          encoding\n      --check-presence=CONTENT_PRESENCE\n                          Check presence of raw or HTML content on all pages.\n                          e.g., \u003ctag attr1=\"val\"\u003eregex:content\u003c/tag\u003e. Content\n                          can be either regex:pattern or plain content\n      --check-absence=CONTENT_ABSENCE\n                          Check absence of raw or HTML content on all pages.\n                          e.g., \u003ctag attr1=\"val\"\u003eregex:content\u003c/tag\u003e. Content\n                          can be either regex:pattern or plain content\n      --check-presence-once=CONTENT_PRESENCE_ONCE\n                          Check presence of raw or HTML content for one page:\n                          path,content, e.g.,: /path,\u003ctag\n                          attr1=\"val\"\u003eregex:content\u003c/tag\u003e. Content can be either\n                          regex:pattern or plain content. Path can be either\n                          relative or absolute with domain.\n      --check-absence-once=CONTENT_ABSENCE_ONCE\n                          Check absence of raw or HTML content for one page:\n                          path,content, e.g.,path,\u003ctag\n                          attr1=\"val\"\u003eregex:content\u003c/tag\u003e. Content can be either\n                          regex:pattern or plain content. Path can be either\n                          relative or absolute with domain.\n      -S, --show-source   Show source of links (html) in the report.\n      --allow-insecure-content\n                          Allow insecure content for HTTPS sites with\n                          certificate errors\n\n    Performance Options:\n      These options can impact the performance of the crawler.\n\n      -w WORKERS, --workers=WORKERS\n                          Number of workers to spawn (default = 1)\n      -m MODE, --mode=MODE\n                          Types of workers: thread (default), process, or green\n      -R PARSER, --parser=PARSER\n                          Types of HTML parse: html.parser (default) or lxml\n\n    Output Options:\n      These options change the output of the crawler.\n\n      -f FORMAT, --format=FORMAT\n                          Format of the report: plain (default)\n      -o OUTPUT, --output=OUTPUT\n                          Path of the file where the report will be printed.\n      -W WHEN, --when=WHEN\n                          When to print the report. error (only if a\n                          crawling error occurs) or always (default)\n      -E REPORT_TYPE, --report-type=REPORT_TYPE\n                          Type of report to print: errors (default, summary and\n                          erroneous links), summary, all (summary and all links)\n      -c, --console       Prints report to the console in addition to other\n                          output options such as file or email.\n\n    Email Options:\n      These options allows the crawler to send a report by email.\n\n      -a ADDRESS, --address=ADDRESS\n                          Comma-separated list of email addresses used to send a\n                          report\n      --from=FROM_ADDRESS\n                          Email address to use in the from field of the email\n                          (optional)\n      -s SMTP, --smtp=SMTP\n                          Host of the smtp server\n      --port=PORT         Port of the smtp server (optional)\n      --tls               Use TLS with the email server.\n      --subject=SUBJECT   Subject of the email (optional)\n      --smtp-username=SMTP_USERNAME\n                          Username to use with the smtp server (optional)\n      --smtp-password=SMTP_PASSWORD\n                          Password to use with the smtp server (optional)\n\nUsage Example\n-------------\n\nCrawl a site and show progress\n  ``pylinkvalidate.py --progress http://example.com/``\n\nCrawl a site starting from 2 URLs\n  ``pylinkvalidate.py http://example.com/ http://example2.com/``\n\nCrawl a site (example.com) and all pages belonging to another host\n  ``pylinkvalidate.py -H additionalhost.com http://example.com/``\n\nReport status of all links (even successful ones)\n  ``pylinkvalidate.py --report-type=all http://example.com/``\n\nReport status of all links and HTML show source of these links\n  ``pylinkvalidate.py --report-type=all --show-source http://example.com/``\n\nOnly crawl starting URLs and access all linked resources\n  ``pylinkvalidate.py --run-once http://example.com/``\n\nCrawl two levels (one more than run-once) and access all linked resources\n  ``pylinkvalidate.py --depth=1 http://example.com/``\n\nOnly access links (a href) and ignore images, stylesheets and scripts\n  ``pylinkvalidate.py --types=a http://example.com/``\n\nCrawl a site with 4 threads (default is one thread)\n  ``pylinkvalidate.py --workers=4 http://example.com/``\n\nCrawl a site with 4 processes (default is one thread)\n  ``pylinkvalidate.py --mode=process --workers=4 http://example.com/``\n\nCrawl a site and use LXML to parse HTML (faster, must be installed)\n  ``pylinkvalidate.py --parser=LXML http://example.com/``\n\nPrint debugging info\n  ``pylinkvalidate.py --verbose=2 http://example.com/``\n\nChange User-Agent request header\n  ``pylinkvalidate.py --header=\"User-Agent: Mozilla/5.0\" http://example.com/``\n\nCrawl multiple sites and report results per site\n  ``pylinkvalidate.py --multi http://example.com/ http://www.example2.net/``\n\nCheck that all HTML pages have a body tag with a specific class:\n  ``pylinkvalidate.py --check-content '\u003cbody class=\"test\"\u003e\u003c/body\u003e' http://example.com/``\n\nCheck that no HTML pages have a paragraph tag with a pattern:\n  ``pylinkvalidate.py --check-absence '\u003cp\u003eregex:Hello\\s+World\u003c/body\u003e' http://example.com/``\n\nCheck that robots.txt have a Disallow none:\n  ``pylinkvalidate.py --check-content-once '/robots.txt,regex:^Disallow:\\s*$' http://example.com/``\n\nAllow insecure content for HTTPS sites with certificate errors [SSL: CERTIFICATE_VERIFY_FAILED]\n  ``pylinkvalidate.py --allow-insecure-content https://self-signed.example.com/``\n\n\nAPI Usage\n---------\n\nTo crawl a site from a single URL:\n\n.. code-block:: python\n\n  from pylinkvalidator.api import crawl\n  crawled_site = crawl(\"http://www.example.com/\")\n  number_of_crawled_pages = len(crawled_site.pages)\n  number_of_errors = len(crawled_sites.error_pages)\n\n\nTo crawl a site and pass some configuration options (the same supported by the\ncommand line interface):\n\n\n.. code-block:: python\n\n  from pylinkvalidator.api import crawl_with_options\n  crawled_site = crawl_with_options([\"http://www.example.com/\"], {\"run-once\":\n      True, \"workers\": 10})\n  number_of_crawled_pages = len(crawled_site.pages)\n  number_of_errors = len(crawled_sites.error_pages)\n\n\nFAQ and Troubleshooting\n-----------------------\n\nI cannot find pylinkvalidate.py on Windows with virtualenv\n  This is a known problem with virtualenv on windows. The interpreter is\n  different than the one used by the virtualenv. Prefix pylinkvalidate.py with the\n  full path: ``python c:\\myvirtualenv\\Scripts\\pylinkvalidate.py``\n\nI see Exception KeyError ... module 'threading' when using --mode=green\n  This output is generally harmless and is generated by gevent patching the\n  python thread module. If someone knows how to make it go away, patches are\n  more than welcome :-)\n\n\nLicense\n-------\n\nThis software is licensed under the `New BSD License`. See the `LICENSE` file\nin the for the full license text. It includes the beautifulsoup library which\nis licensed under the MIT license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbartdag%2Fpylinkvalidator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbartdag%2Fpylinkvalidator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbartdag%2Fpylinkvalidator/lists"}