{"id":14989979,"url":"https://github.com/a3onn/mapptth","last_synced_at":"2025-04-12T01:50:46.273Z","repository":{"id":154900485,"uuid":"607732188","full_name":"A3onn/mapptth","owner":"A3onn","description":"A simple to use multi-threaded web-crawler written in C with libcURL and Lexbor.","archived":false,"fork":false,"pushed_at":"2024-02-13T17:09:19.000Z","size":210,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-25T21:35:25.198Z","etag":null,"topics":["c","cmake","gplv3","graphviz","lexbor","libcurl","multi-threading","robots-txt","sitemap","web-crawler"],"latest_commit_sha":null,"homepage":"https://github.com/A3onn/mapptth","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/A3onn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-02-28T15:11:27.000Z","updated_at":"2024-02-14T13:04:08.000Z","dependencies_parsed_at":null,"dependency_job_id":"cb46aaeb-8d66-4dc7-9ff7-55634095732a","html_url":"https://github.com/A3onn/mapptth","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/A3onn%2Fmapptth","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/A3onn%2Fmapptth/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/A3onn%2Fmapptth/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/A3onn%2Fmapptth/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/A3onn","download_url":"https://codeload.github.com/A3onn/mapptth/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248505871,"owners_count":21115354,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c","cmake","gplv3","graphviz","lexbor","libcurl","multi-threading","robots-txt","sitemap","web-crawler"],"created_at":"2024-09-24T14:19:16.246Z","updated_at":"2025-04-12T01:50:46.251Z","avatar_url":"https://github.com/A3onn.png","language":"C","readme":"[MapPTTH](https://github.com/A3onn/mapptth \"MapPTTH github\")\n====\n\nA simple to use multi-threaded web-crawler written in C with libcURL and Lexbor.\n---\n\n## Dependencies\n\nMapPTTH uses:\n\n- libcURL (\u003e= 7.62.0)\n- Lexbor (see [Installation](#installation) if you don't want to install it on your system)\n- libxml2\n- libPCRE\n- CMake (\u003e= 3.1.0)\n\n#### Optional\n\n- GraphViz (_libgvc_ and _libcgraph_): generate graphs\n- libcheck: unit tests\n\n## Installation\n\n### Dependencies\n\nOn Ubuntu (with GraphViz support):\n\n`\nsudo apt install cmake libpcre3-dev libcurl4-openssl-dev libxml2-dev libgraphviz-dev\n`\n\n### Cloning and building\n\nIf you don't have Lexbor installed and don't want to install it, you can clone Lexbor while cloning MapPTTH and compile without any installation:\n\n```\ngit clone --recurse-submodules https://github.com/A3onn/mapptth/\ncd mapptth/\nmkdir build/ \u0026\u0026 cd build/\ncmake .. \u0026\u0026 make -j5\n```\n\nIf you have all dependencies installed on your system:\n\n```\ngit clone https://github.com/A3onn/mapptth/\ncd mapptth/\nmkdir build/ \u0026\u0026 cd build/\ncmake .. \u0026\u0026 make -j5\n```\n\n#### Generate tests\n\nIf you want to generate unit tests\n\n### GraphViz support\n\nIf GraphViz is found on the system when running CMake, you will be able to generate graphs.\n\nIf you want to disable it, you can run `cmake -DMAPPTTH_NO_GRAPHVIZ=1 ..` instead of `cmake ..`.\n\n## How to use\n\n### Parameters\n\nThe only required argument is an URL. This URL specifies where the crawler will start its crawling.\n\nHere is the list of available parameters grouped by category:\n\n#### Connection\n\n| Name | Argument |\n| --- | --- |\n| URL where to start crawling, the last specified will be used. __(REQUIRED)__ | \\\u003cURL\u003e |\n| String that will be used as user-agent. You can disable sending the user-agent header by giving an empty string. (default='MAPPTTH/\u003cversion\u003e') | -U \\\u003cuser-agent\u003e |\n| Timeout in seconds for each connection. If a connection timeout, an error will be printed to standard error but no informations about the URL. (default=3) | -m \\\u003ctimeout\u003e |\n| Only resolve to IPv4 addresses. | -4 |\n| Only resolve to IPv6 addresses. | -6 |\n| Add headers in the HTTP request, they are like this: \"\\\u003ckey\u003e:\\\u003cvalue\u003e;\", the ':' and the value are optionals and they have to end with a ';'. | -Q \\\u003cheader\u003e |\n| Allow insecure connections when using SSL/TLS. | -i |\n| Add cookies in the HTTP request, they are like this: \"\\\u003ckey\u003e:\\\u003cvalue\u003e;\", you can specify mulitple cookies at once by separating them by a ';'. Note that they won't be modified during the crawl. | -C \\\u003ccookies\u003e |\n\n\n#### Controlling where the crawler goes\n\n| Name | Argument |\n| --- | --- |\n| Allow the crawler to go into subdomains of the initial URL and allowed domains. (default=false) | -s |\n| Allow the crawler to go to these domains. | -a \\\u003cdomain\u003e |\n| Disallow the crawler to go to these domains. | -d \\\u003cdomain\u003e |\n| Allow the crawler to only fetch URL starting with these paths. Can be a regex (extended and case-sensitive). | -p \\\u003cpath or regex\u003e |\n| Disallow the crawler to fetch URL starting with these paths. Can be a regex (extended and case-sensitive). | -P \\\u003cpath or regex\u003e |\n| Maximum depth of paths. If a path has a longer depth, it won't be fetched. | -D \\\u003cdepth\u003e |\n| Only fetch URLs with HTTP as scheme (Don't forget to add '-r 80' if you start with an 'https://' URL). | -f |\n| Only fetch URLs with HTTPS as scheme (Don't forget to add '-r 443' if you start with an 'http://' URL). | -F |\n| Allow the crawler to only fetch files with these extensions. If no extension is found then this filter won't apply. | -x .\\\u003cextension\u003e |\n| Disallow the crawler to fetch files with these extensions. If no extension is found then this filter won't apply. | -X .\\\u003cextension\u003e |\n| Allow the crawler to go to theses ports | -r \u003cport\u003e |\n| Keep the query part of the URL. Note that if two same URLs with a different query is found, both will be fetched. | -q |\n\n\n#### Parsing\n\n| Name | Argument |\n| --- | --- |\n| Only parse the \\\u003chead\u003e part. | -H |\n| Only parse the \\\u003cbody\u003e part. | -B |\n\n\n#### Output\n\n| Name | Argument |\n| --- | --- |\n| Don't print with colors. | -c |\n| Print the title of the page if there is one when displaying an URL. | -T |\n| File to write output into (without colors). | -o \\\u003cfile name\u003e |\n| Print a summary of what was found as a directory structure | -O |\n| Print when encountering tel: and mailto: URLs. | -I |\n\n#### Graph\n\n_MapPTTH must be compiled with GraphViz support._\n\n| Name | Argument |\n| --- | --- |\n| Create a graph. | -g |\n| Change the layout of the graph. (default='sfdp') | -L \\\u003clayout\u003e |\n| Change the output graph file format. (default='png') | -G \\\u003cformat\u003e |\n\n#### Other\n\n| Name | Argument |\n| --- | --- |\n| Number of threads that will fetch URLs. (default=5) | -t \\\u003cnumber of threads\u003e |\n| Parse the sitemap of the site, this should speeds up the crawler and will maybe provide URLs that couldn't be found without the sitemap. | -S \\\u003cURL of the sitemap\u003e |\n| Parse the robots.txt of the site, paths found in 'allowed' and 'disallowed' directives are added to the list of found URLs. Other directives are ignored. | -R \\\u003cURL of the robots.txt file\u003e |\n| URL of the proxy to use. | -z \\\u003cURL of the proxy\u003e |\n| Print the help. | -h |\n| Print the version. | -V |\n\nYou can stop the crawler with _CTRL-C_ at any moment, this will gracefully stop the crawler and it will finish as normal.\n\n\n### Exemples\n\nSimple crawl:\n\n```\nmapptth https://google.com\n```\n\nStart crawling at a certain URL:\n\n```\nmapptth https://google.com/some/url/file.html\n```\n\nMore threads:\n\n```\nmapptth https://google.com -t 10\n```\n\nAllow to crawl into subdomains (ex: www.google.com, mail.google.com, ww.mail.google.com):\n\n```\nmapptth https://google.com -s\n```\n\nAllow to crawl certain domains and their subdomains (ex: www.google.com, mail.gitlab.com, www.mail.github.com):\n\n```\nmapptth http://google.com -s -a gitlab.com -a github.com -r 443\n```\n\nDisallow some paths:\n\n```\nmapptth https://google.com -P /path -P /some-path\n```\n\nDisallow a path and only fetch .html and .php files:\n\n```\nmapptth https://google.com -P /some-path -x .html -x .php\n```\n\nOnly crawl in the /path directory:\n\n```\nmapptth https://google.com -p /path\n```\n\nA more complete and complicated one:\n\n```\nmapptth https://google.com/mail -x .html -P /some-path -t 10 -m 5 -s -q -D 6 -T -o output.txt -H -S http://www.google.com/sitemap.xml\n```\n\n## TODO\n\nASAP:\n\n- [X] Handling the \\\u003cbase\\\u003e tag\n\n\nWithout any priority :\n\n- [ ] Add a parameter to control the connection rate\n\n- [ ] Create logo (maybe)\n\n- [X] Print when encountering mailto: or tel:\n\n- [X] Add robots.txt parser\n\n- [X] Add proxy support\n\n- [X] Use regex in filters (disallowed paths, allowed paths, etc...)\n\n- [X] Add exemples in readme\n\n- [X] More unit tests\n\n- [X] Use only getopt to parse arguments\n\n- [X] GraphViz support to generate graphs\n\n- [X] Output to file\n\n- [X] Add parameters to control: disallowed domains, only allowed paths and disallowed extensions\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fa3onn%2Fmapptth","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fa3onn%2Fmapptth","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fa3onn%2Fmapptth/lists"}