{"id":20052989,"url":"https://github.com/eghuro/crawlcheck","last_synced_at":"2026-04-12T23:46:16.198Z","repository":{"id":27490697,"uuid":"30970817","full_name":"eghuro/crawlcheck","owner":"eghuro","description":"Extensible web crawler","archived":false,"fork":false,"pushed_at":"2024-04-02T20:28:16.000Z","size":3151,"stargazers_count":0,"open_issues_count":6,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-04-02T21:36:05.655Z","etag":null,"topics":["configuration","crawler","http","plugin","python","robots-txt","sitemap"],"latest_commit_sha":null,"homepage":"https://eghuro.github.io/crawlcheck/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eghuro.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2015-02-18T15:16:08.000Z","updated_at":"2024-04-14T19:27:41.204Z","dependencies_parsed_at":"2023-01-16T23:00:12.084Z","dependency_job_id":"e50c03bd-59d3-49a6-b4a1-10d63dd8cc9e","html_url":"https://github.com/eghuro/crawlcheck","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eghuro%2Fcrawlcheck","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eghuro%2Fcrawlcheck/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eghuro%2Fcrawlcheck/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eghuro%2Fcrawlcheck/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eghuro","download_url":"https://codeload.github.com/eghuro/crawlcheck/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241482069,"owners_count":19969847,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["configuration","crawler","http","plugin","python","robots-txt","sitemap"],"created_at":"2024-11-13T12:20:39.202Z","updated_at":"2025-10-15T23:17:02.548Z","avatar_url":"https://github.com/eghuro.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Crawlcheck\n\nCrawlcheck is a web crawler invoking plugins on received content.\nIt's intended for verification of websites prior to deployment.\nThe process of verification is customizable by configuration script that allows\ncomplex specification which plugins should check particular URIs and\ncontent-types.\n\n## Version\n\n1.1.0\n\n## Dependencies\n\nCrawlcheck's engine currently runs on Python 3.5 and 3.6 and uses SQLite3 as a\ndatabase backend. Crawlcheck uses a number of open source projects to work\nproperly. Python dependencies are listed in\n[requirements.txt](https://github.com/eghuro/crawlcheck/blob/master/requirements.txt)\n\nFor a web report, there's a [separate project](https://github.com/eghuro/crawlcheck-report).\n\n## Installation\n\n0) You will need python3, python-pip and sqlite3, virtualenv, libmagic, libtidy,\nlibxml2 and libxslt installed. All dev or devel versions.\n\n1) Fetch sources\n\n```sh\ngit clone https://github.com/eghuro/crawlcheck crawlcheck\n```\n\n2) Run install script\n\n```sh\ncd crawlcheck\npip install -r requirements.txt\n```\n\n## Configuration\n\nA configuration file is a YAML file defined as follows:\n\n```sh\n---\nversion: 1.6                    # configuration format version\ndatabase: crawlcheck.sqlite     # SQLite database file\nmaxDepth: 10                    # max amount of links followed from any entry point (default: 0 meaning unlimited)\nagent: \"Crawlcheck/1.1\"         # user agent used (default: Crawlcheck/1.1)\nlogfile: cc.log                 # where to store logs\nmaxContentLength: 2000000       # max file size to download\npluginDir: plugin               # where to look for plugins (including subfolders, default: 'plugin')\ntimeout: 1                      # timeout for networking (default: 1)\ncleandb: True                   # clean the database before execution\ninitdb: True                    # initialize the database\nreport: \"http://localhost:5000\" # report REST API\ncleanreport: True               # clean entries in the report before sending current\nmaxVolume: 100000000            # max 100 MB of temp files (default: sys.maxsize)\nmaxAttempts: 2                  # attempts to download a web page (default: 3)\ndbCacheLimit: 1000000           # cache up to 1M of DB queries\ntmpPrefix: \"Crawlcheck\"         # prefix for temporary file names with downloaded content (default: Crawlcheck)\ntmpSuffix: \"content\"            # suffix for temporary file names with downloaded content (default: content)\ntmpDir: \"/tmp/\"                 # where to store temporary files (default: /tmp/)\ndbCacheLimit: 100000            # the amount of cached database queries (default: sys.maxsize)\nurlLimit: 10000000              # limit on seen URIs\nverifyHttps: True               # verify HTTPS? (default: False)\ncores: 2                        # the amount of cores available (eg. for parallel report payload generation)\nrecordParams: False             # record request data or URL params? (default: True)\nrecordHeaders: False            # record response headers? (default: True)\nsitemap-file: \"sitemap.xml\"     # where to store generated sitemap.xml\nsitemap-regex: \"https?://ksp.mff.cuni.cz(/.*)?\" # regex for sitemap generator\nyaml-out-file: \"cc.yml\"         # where to write YAML report\nreport-file: \"report\"           # where to write PDF report (.pdf will be added automatically)\n\n# other parameters used by plugins written as ```key: value```\n# these parameters can be also specified on command line using --param key=value\n# command line parameters override configuration ones\n\nregexes:\n -\n    regex: \"http://ksp.mff.cuni.cz/(?!sksp|profil|forum|auth).*\"\n    plugins: # which plugins are allowed for given URL\n       - linksFinder\n       - tidyHtmlValidator\n       - tinycss\n       - css_scraper\n       - formChecker\n       - seoimg\n       - seometa\n       - dupdeteict\n       - non_semantic_html\n -\n    regex: \"https?://(?!ksp.mff.cuni.cz/(sksp|profil|forum|auth)).+\" #test links (HEAD request) only\n    plugins:\n\nfilters: #Filters (plugins of category header and filter) that can be used\n - depth\n - robots\n - contentLength\n - canonical\n - acceptedType\n - acceptedUri\n - uri_normalizer\n - expectedType\n\n# filters: True\n# alternative option to allow all available filters\n# can be passed on command line using --param\n\npostprocess:\n - sitemap_generator\n - report_exporter\n - yaml_exporter\n - TexReporter\n\n# postprocess: True\n# alternative option to allow all available postprocessors\n# can be passed on command line using --param\n\nentryPoints: # where to start\n# Note, that once URI get's to the database it's no longer being requested\n# (beware of repeated starts, if entry point remains in the database execution won't\n# start from this entry point)\n#\n# Entry points can also be specified via command line parameter --entry=url\n - \"http://ksp.mff.cuni.cz/\"\n\n#additional content type rules can be still specified and take precedence over plugin defined rules\ncontent-types:\n -\n    \"content-type\": \"text/html\"\n    plugins: # plugins to use for given content-type\n       - linksFinder\n       - tidyHtmlValidator\n       - css_scraper\n       - formChecker\n       - seoimg\n       - seometa\n       - dupdetect\n       - non_semantic_html\n -\n    \"content-type\": \"text/css\"\n    plugins:\n       - tinycss\n       - dupdetect\n -\n    \"content-type\": \"application/gzip\"\n    plugins:\n       - sitemapScanner\n-\n    \"content-type\": \"application/xml\"\n    plugins:\n       - sitemapScanner\n       - dupdetect\n\n```\n\n## Running crawlcheck\n\nAssuming you have gone through set-up and configuration, now run checker:\n\n```sh\ncd [root]/crawlcheck/src/\npython checker/ [config.yml]\n```\n\nNote: ``[root]/crawlcheck`` is where repository was cloned to,\n``[config.yml]`` stands for the configuration file path.\n\n## Plugins\n\nThere are currently 5 types of plugins: crawlers, checkers, headers, filters,\nand postprocessors. The crawlers are specializing in discovering new links.\nThe checkers check the syntax of various files. The headers check HTTP headers\nand together with the filters serve to customize the crawling process itself.\nThe postprocessors are used to generate reports or other outputs from\nthe application.\n\nCrawlcheck is currently extended with the following plugins:\n\n* linksFinder (crawler)\n* sitemapScanner (crawler)\n* tidyHtmlValidator (checker)\n* tinycss (checker)\n* css_scraper (checker)\n* seoimg (checker)\n* seometa (checker)\n* dupdetect (checker)\n* non_semantic_html (checker)\n* noscript (checker)\n* mailer (checker)\n* contentLength (header)\n* expectedType (header)\n* canonical (header)\n* acceptedType (header)\n* acceptedUri (header)\n* uri_normalizer (header)\n* depth (filter)\n* robots (filter)\n* report_exporter (postprocessor)\n* yaml_exporter (postprocessor)\n* sitemap_generator (postprocessor)\n\n## How to write a plugin\n\nGo to ``crawlcheck/src/checker/plugin/``, create ``my_new_plugin.py`` and\n``my_new_plugin.yapsy-plugin`` files there.\nFill out the .yapsy-plugin file:\n\n```sh\n[Core]\nName = Human readable plugin name\nModule = my_new_plugin\n\n[Documentation]\nAuthor = Your Name\nVersion = 0.0\nDescription = My New Plugin\n```\n\nFor plugin itself you need to implement following:\n\n```sh\nfrom yapsy.IPlugin import IPlugin\nfrom common import PluginType\nfrom filter import FilterException  # for headers and filters\nclass MyPlugin(IPlugin):\n\n    category = PluginType.CHECKER # pick appropriate type\n    id = myPlugin\n    contentTypes = [\"text/html\"] #accepted content types (checkers \u0026 crawlers)\n\n    def acceptType(ctype): #alternatively a method resolving more complex content-type rules\n        return True\n\n    def setJournal(self, journal):\n        # record journal somewhere - all categories\n\n    def setQueue(self, queue):\n        # record queue somewhere - if needed\n\n    def setConf(self, conf):\n        # record configuration - only headers and filters\n\n    def check(self, transaction):\n        # implement the checking logic here for crawlers and checkers\n\n    def filter(self, transaction):\n        # implement the filtering logic here for filters and headers\n        # raise FilterException to filter the transaction out\n\n    def setDb(self, db):\n        # record DB somewhere - only postprocessors\n\n    def process(self):\n        # implement the postprocessing logic here for postprocessor\n```\n\nSee \u003chttp://yapsy.sourceforge.net/IPlugin.html\u003e and\n\u003chttp://yapsy.sourceforge.net/PluginManager.html#plugin-info-file-format\u003e for\nmore details.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feghuro%2Fcrawlcheck","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feghuro%2Fcrawlcheck","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feghuro%2Fcrawlcheck/lists"}