{"id":13478685,"url":"https://github.com/JarryShaw/darc","last_synced_at":"2025-03-27T08:30:56.329Z","repository":{"id":37955393,"uuid":"211512930","full_name":"JarryShaw/darc","owner":"JarryShaw","description":"Darkweb Crawler Project","archived":false,"fork":false,"pushed_at":"2025-03-22T10:03:29.000Z","size":290029,"stargazers_count":160,"open_issues_count":5,"forks_count":25,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-03-22T11:19:13.558Z","etag":null,"topics":["crawler","darkweb"],"latest_commit_sha":null,"homepage":"https://jarryshaw.github.io/darc/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JarryShaw.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":null,"patreon":"jarryshaw","open_collective":null,"ko_fi":null,"tidelift":"pypi/python-darc","community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"custom":null}},"created_at":"2019-09-28T14:28:57.000Z","updated_at":"2025-03-22T10:03:13.000Z","dependencies_parsed_at":"2023-10-15T04:08:24.613Z","dependency_job_id":"9b4ca9e4-aec1-4889-9e2c-bd0876575f6d","html_url":"https://github.com/JarryShaw/darc","commit_stats":{"total_commits":742,"total_committers":5,"mean_commits":148.4,"dds":"0.42048517520215634","last_synced_commit":"0b02f29e10f4c932bbb79575be65e43165e0aae5"},"previous_names":[],"tags_count":87,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JarryShaw%2Fdarc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JarryShaw%2Fdarc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JarryShaw%2Fdarc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JarryShaw%2Fdarc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JarryShaw","download_url":"https://codeload.github.com/JarryShaw/darc/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245809655,"owners_count":20676028,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","darkweb"],"created_at":"2024-07-31T16:02:00.745Z","updated_at":"2025-03-27T08:30:51.320Z","avatar_url":"https://github.com/JarryShaw.png","language":"Python","funding_links":["https://patreon.com/jarryshaw","https://tidelift.com/funding/github/pypi/python-darc"],"categories":["[↑](#-content) 🛠️ Tools","Python"],"sub_categories":[],"readme":"``darc`` - Darkweb Crawler Project\n==================================\n\n   For any technical and/or maintenance information,\n   please kindly refer to the |docs|_.\n\n.. |docs| replace:: **Official Documentation**\n.. _docs: https://darc.jarryshaw.me\n\n**NB**: Starting from version ``1.0.0``, new features of the project will not be\ndeveloped into this public repository. Only bugfix and security patches will be\napplied to the update and new releases.\n\n``darc`` is designed as a swiss army knife for darkweb crawling.\nIt integrates ``requests`` to collect HTTP request and response\ninformation, such as cookies, header fields, etc. It also bundles\n``selenium`` to provide a fully rendered web page and screenshot\nof such view.\n\n.. image:: https://darc.jarryshaw.me/en/latest/_images/darc.jpeg\n\nThe general process of ``darc`` can be described as following:\n\nThere are two types of *workers*:\n\n* ``crawler`` -- runs the ``darc.crawl.crawler`` to provide a\n  fresh view of a link and test its connectability\n\n* ``loader`` -- run the ``darc.crawl.loader`` to provide an\n  in-depth view of a link and provide more visual information\n\nThe general process can be described as following for *workers* of ``crawler`` type:\n\n1. ``darc.process.process_crawler``: obtain URLs from the ``requests``\n   link database (c.f. ``darc.db.load_requests``), and feed such URLs to\n   ``darc.crawl.crawler``.\n\n   **NOTE:**\n\n      If ``darc.const.FLAG_MP`` is ``True``, the function will be\n      called with *multiprocessing* support; if ``darc.const.FLAG_TH``\n      if ``True``, the function will be called with *multithreading*\n      support; if none, the function will be called in single-threading.\n\n2. ``darc.crawl.crawler``: parse the URL using\n   ``darc.link.parse_link``, and check if need to crawl the\n   URL (c.f. ``darc.const.PROXY_WHITE_LIST``, ``darc.const.PROXY_BLACK_LIST``,\n   ``darc.const.LINK_WHITE_LIST`` and ``darc.const.LINK_BLACK_LIST``);\n   if true, then crawl the URL with ``requests``.\n\n   If the URL is from a brand new host, ``darc`` will first try\n   to fetch and save ``robots.txt`` and sitemaps of the host\n   (c.f. ``darc.proxy.null.save_robots`` and ``darc.proxy.null.save_sitemap``),\n   and extract then save the links from sitemaps (c.f. ``darc.proxy.null.read_sitemap``)\n   into link database for future crawling (c.f. ``darc.db.save_requests``).\n   Also, if the submission API is provided, ``darc.submit.submit_new_host``\n   will be called and submit the documents just fetched.\n\n   If ``robots.txt`` presented, and ``darc.const.FORCE`` is\n   ``False``, ``darc`` will check if allowed to crawl the URL.\n\n   **NOTE:**\n\n      The root path (e.g. ``/`` in https://www.example.com/) will always\n      be crawled ignoring ``robots.txt``.\n\n   At this point, ``darc`` will call the customised hook function\n   from ``darc.sites`` to crawl and get the final response object.\n   ``darc`` will save the session cookies and header information,\n   using ``darc.save.save_headers``.\n\n   **NOTE:**\n\n      If `requests.exceptions.InvalidSchema` is raised, the link\n      will be saved by ``darc.proxy.null.save_invalid``. Further\n      processing is dropped.\n\n   If the content type of response document is not ignored (c.f.\n   ``darc.const.MIME_WHITE_LIST`` and ``darc.const.MIME_BLACK_LIST``),\n   ``darc.submit.submit_requests`` will be called and submit the document\n   just fetched.\n\n   If the response document is HTML (``text/html`` and ``application/xhtml+xml``),\n   ``darc.parse.extract_links`` will be called then to extract all possible\n   links from the HTML document and save such links into the database\n   (c.f. ``darc.db.save_requests``).\n\n   And if the response status code is between ``400`` and ``600``,\n   the URL will be saved back to the link database\n   (c.f. ``darc.db.save_requests``). If **NOT**, the URL will\n   be saved into ``selenium`` link database to proceed next steps\n   (c.f. ``darc.db.save_selenium``).\n\nThe general process can be described as following for *workers* of ``loader`` type:\n\n1. ``darc.process.process_loader``: in the meanwhile, ``darc`` will\n   obtain URLs from the ``selenium`` link database (c.f. ``darc.db.load_selenium``),\n   and feed such URLs to ``darc.crawl.loader``.\n\n   **NOTE:**\n\n      If ``darc.const.FLAG_MP`` is ``True``, the function will be\n      called with *multiprocessing* support; if ``darc.const.FLAG_TH``\n      if ``True``, the function will be called with *multithreading*\n      support; if none, the function will be called in single-threading.\n\n2. ``darc.crawl.loader``: parse the URL using\n   ``darc.link.parse_link`` and start loading the URL using\n   ``selenium`` with Google Chrome.\n\n   At this point, ``darc`` will call the customised hook function\n   from ``darc.sites`` to load and return the original\n   ``selenium.webdriver.chrome.webdriver.WebDriver`` object.\n\n   If successful, the rendered source HTML document will be saved, and a\n   full-page screenshot will be taken and saved.\n\n   If the submission API is provided, ``darc.submit.submit_selenium``\n   will be called and submit the document just loaded.\n\n   Later, ``darc.parse.extract_links`` will be called then to\n   extract all possible links from the HTML document and save such\n   links into the ``requests`` database (c.f. ``darc.db.save_requests``).\n\n------------\nInstallation\n------------\n\n**NOTE:**\n\n   ``darc`` supports Python all versions above and includes **3.6**.\n   Currently, it only supports and is tested on Linux (*Ubuntu 18.04*)\n   and macOS (*Catalina*).\n\n   When installing in Python versions below **3.8**, ``darc`` will\n   use |walrus|_ to compile itself for backport compatibility.\n\n   .. |walrus| replace:: ``walrus``\n   .. _walrus: https://github.com/pybpc/walrus\n\n.. code-block:: shell\n\n   pip install python-darc\n\nPlease make sure you have Google Chrome and corresponding version of Chrome\nDriver installed on your system.\n\n   Starting from version **0.3.0**, we introduced `Redis`_ for the task\n   queue database backend.\n\n   .. _Redis: https://redis.io\n\n   Since version **0.6.0**, we introduced relationship database storage\n   (e.g. `MySQL`_, `SQLite`_, `PostgreSQL`_, etc.) for the task queue database\n   backend, besides the `Redis`_ database, since it can be too much memory-costly\n   when the task queue becomes vary large.\n\n   .. _MySQL: https://mysql.com/\n   .. _SQLite: https://www.sqlite.org/\n   .. _PostgreSQL: https://www.postgresql.org/\n\n   Please make sure you have one of the backend database installed, configured,\n   and running when using the ``darc`` project.\n\nHowever, the ``darc`` project is shipped with Docker and Compose support.\nPlease see the project root for relevant files and more information.\n\nOr, you may refer to and/or install from the `Docker Hub`_ repository:\n\n.. code-block:: shell\n\n   docker pull jsnbzh/darc[:TAGNAME]\n\n.. _Docker Hub: https://hub.docker.com/r/jsnbzh/darc\n\nor GitHub Container Registry, with more updated and comprehensive images:\n\n.. code-block:: shell\n\n   docker pull ghcr.io/jarryshaw/darc[:TAGNAME]\n   # or the debug image\n   docker pull ghcr.io/jarryshaw/darc-debug[:TAGNAME]\n\n-----\nUsage\n-----\n\nThe ``darc`` project provides a simple CLI::\n\n   usage: darc [-h] [-v] -t {crawler,loader} [-f FILE] ...\n\n   the darkweb crawling swiss army knife\n\n   positional arguments:\n     link                  links to craw\n\n   optional arguments:\n     -h, --help            show this help message and exit\n     -v, --version         show program's version number and exit\n     -t {crawler,loader}, --type {crawler,loader}\n                           type of worker process\n     -f FILE, --file FILE  read links from file\n\nIt can also be called through module entrypoint::\n\n   python -m darc ...\n\n**NOTE:**\n\n   The link files can contain **comment** lines, which should start with ``#``.\n   Empty lines and comment lines will be ignored when loading.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FJarryShaw%2Fdarc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FJarryShaw%2Fdarc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FJarryShaw%2Fdarc/lists"}