https://github.com/a3onn/mapptth
A simple to use multi-threaded web-crawler written in C with libcURL and Lexbor.
https://github.com/a3onn/mapptth
c cmake gplv3 graphviz lexbor libcurl multi-threading robots-txt sitemap web-crawler
Last synced: 10 months ago
JSON representation
A simple to use multi-threaded web-crawler written in C with libcURL and Lexbor.
- Host: GitHub
- URL: https://github.com/a3onn/mapptth
- Owner: A3onn
- License: gpl-3.0
- Created: 2023-02-28T15:11:27.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2024-02-13T17:09:19.000Z (almost 2 years ago)
- Last Synced: 2025-03-25T21:35:25.198Z (11 months ago)
- Topics: c, cmake, gplv3, graphviz, lexbor, libcurl, multi-threading, robots-txt, sitemap, web-crawler
- Language: C
- Homepage: https://github.com/A3onn/mapptth
- Size: 205 KB
- Stars: 6
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[MapPTTH](https://github.com/A3onn/mapptth "MapPTTH github")
====
A simple to use multi-threaded web-crawler written in C with libcURL and Lexbor.
---
## Dependencies
MapPTTH uses:
- libcURL (>= 7.62.0)
- Lexbor (see [Installation](#installation) if you don't want to install it on your system)
- libxml2
- libPCRE
- CMake (>= 3.1.0)
#### Optional
- GraphViz (_libgvc_ and _libcgraph_): generate graphs
- libcheck: unit tests
## Installation
### Dependencies
On Ubuntu (with GraphViz support):
`
sudo apt install cmake libpcre3-dev libcurl4-openssl-dev libxml2-dev libgraphviz-dev
`
### Cloning and building
If you don't have Lexbor installed and don't want to install it, you can clone Lexbor while cloning MapPTTH and compile without any installation:
```
git clone --recurse-submodules https://github.com/A3onn/mapptth/
cd mapptth/
mkdir build/ && cd build/
cmake .. && make -j5
```
If you have all dependencies installed on your system:
```
git clone https://github.com/A3onn/mapptth/
cd mapptth/
mkdir build/ && cd build/
cmake .. && make -j5
```
#### Generate tests
If you want to generate unit tests
### GraphViz support
If GraphViz is found on the system when running CMake, you will be able to generate graphs.
If you want to disable it, you can run `cmake -DMAPPTTH_NO_GRAPHVIZ=1 ..` instead of `cmake ..`.
## How to use
### Parameters
The only required argument is an URL. This URL specifies where the crawler will start its crawling.
Here is the list of available parameters grouped by category:
#### Connection
| Name | Argument |
| --- | --- |
| URL where to start crawling, the last specified will be used. __(REQUIRED)__ | \ |
| String that will be used as user-agent. You can disable sending the user-agent header by giving an empty string. (default='MAPPTTH/') | -U \ |
| Timeout in seconds for each connection. If a connection timeout, an error will be printed to standard error but no informations about the URL. (default=3) | -m \ |
| Only resolve to IPv4 addresses. | -4 |
| Only resolve to IPv6 addresses. | -6 |
| Add headers in the HTTP request, they are like this: "\:\;", the ':' and the value are optionals and they have to end with a ';'. | -Q \ |
| Allow insecure connections when using SSL/TLS. | -i |
| Add cookies in the HTTP request, they are like this: "\:\;", you can specify mulitple cookies at once by separating them by a ';'. Note that they won't be modified during the crawl. | -C \ |
#### Controlling where the crawler goes
| Name | Argument |
| --- | --- |
| Allow the crawler to go into subdomains of the initial URL and allowed domains. (default=false) | -s |
| Allow the crawler to go to these domains. | -a \ |
| Disallow the crawler to go to these domains. | -d \ |
| Allow the crawler to only fetch URL starting with these paths. Can be a regex (extended and case-sensitive). | -p \ |
| Disallow the crawler to fetch URL starting with these paths. Can be a regex (extended and case-sensitive). | -P \ |
| Maximum depth of paths. If a path has a longer depth, it won't be fetched. | -D \ |
| Only fetch URLs with HTTP as scheme (Don't forget to add '-r 80' if you start with an 'https://' URL). | -f |
| Only fetch URLs with HTTPS as scheme (Don't forget to add '-r 443' if you start with an 'http://' URL). | -F |
| Allow the crawler to only fetch files with these extensions. If no extension is found then this filter won't apply. | -x .\ |
| Disallow the crawler to fetch files with these extensions. If no extension is found then this filter won't apply. | -X .\ |
| Allow the crawler to go to theses ports | -r |
| Keep the query part of the URL. Note that if two same URLs with a different query is found, both will be fetched. | -q |
#### Parsing
| Name | Argument |
| --- | --- |
| Only parse the \ part. | -H |
| Only parse the \ part. | -B |
#### Output
| Name | Argument |
| --- | --- |
| Don't print with colors. | -c |
| Print the title of the page if there is one when displaying an URL. | -T |
| File to write output into (without colors). | -o \ |
| Print a summary of what was found as a directory structure | -O |
| Print when encountering tel: and mailto: URLs. | -I |
#### Graph
_MapPTTH must be compiled with GraphViz support._
| Name | Argument |
| --- | --- |
| Create a graph. | -g |
| Change the layout of the graph. (default='sfdp') | -L \ |
| Change the output graph file format. (default='png') | -G \ |
#### Other
| Name | Argument |
| --- | --- |
| Number of threads that will fetch URLs. (default=5) | -t \ |
| Parse the sitemap of the site, this should speeds up the crawler and will maybe provide URLs that couldn't be found without the sitemap. | -S \ |
| Parse the robots.txt of the site, paths found in 'allowed' and 'disallowed' directives are added to the list of found URLs. Other directives are ignored. | -R \ |
| URL of the proxy to use. | -z \ |
| Print the help. | -h |
| Print the version. | -V |
You can stop the crawler with _CTRL-C_ at any moment, this will gracefully stop the crawler and it will finish as normal.
### Exemples
Simple crawl:
```
mapptth https://google.com
```
Start crawling at a certain URL:
```
mapptth https://google.com/some/url/file.html
```
More threads:
```
mapptth https://google.com -t 10
```
Allow to crawl into subdomains (ex: www.google.com, mail.google.com, ww.mail.google.com):
```
mapptth https://google.com -s
```
Allow to crawl certain domains and their subdomains (ex: www.google.com, mail.gitlab.com, www.mail.github.com):
```
mapptth http://google.com -s -a gitlab.com -a github.com -r 443
```
Disallow some paths:
```
mapptth https://google.com -P /path -P /some-path
```
Disallow a path and only fetch .html and .php files:
```
mapptth https://google.com -P /some-path -x .html -x .php
```
Only crawl in the /path directory:
```
mapptth https://google.com -p /path
```
A more complete and complicated one:
```
mapptth https://google.com/mail -x .html -P /some-path -t 10 -m 5 -s -q -D 6 -T -o output.txt -H -S http://www.google.com/sitemap.xml
```
## TODO
ASAP:
- [X] Handling the \ tag
Without any priority :
- [ ] Add a parameter to control the connection rate
- [ ] Create logo (maybe)
- [X] Print when encountering mailto: or tel:
- [X] Add robots.txt parser
- [X] Add proxy support
- [X] Use regex in filters (disallowed paths, allowed paths, etc...)
- [X] Add exemples in readme
- [X] More unit tests
- [X] Use only getopt to parse arguments
- [X] GraphViz support to generate graphs
- [X] Output to file
- [X] Add parameters to control: disallowed domains, only allowed paths and disallowed extensions