https://github.com/9dl/robotssniffer
Tool to analyze and parse website robots.txt for crawler rules.
https://github.com/9dl/robotssniffer
csharp parallel-processing robots-txt web-crawling web-scraping
Last synced: about 1 year ago
JSON representation
Tool to analyze and parse website robots.txt for crawler rules.
- Host: GitHub
- URL: https://github.com/9dl/robotssniffer
- Owner: 9dl
- License: cc0-1.0
- Created: 2024-11-25T20:00:58.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-25T20:09:09.000Z (over 1 year ago)
- Last Synced: 2025-02-07T22:20:46.010Z (about 1 year ago)
- Topics: csharp, parallel-processing, robots-txt, web-crawling, web-scraping
- Language: C#
- Homepage:
- Size: 48.8 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# RobotsSniffer

**RobotsSniffer** is a command-line tool written in C# for analyzing the `robots.txt` files of websites. The tool
retrieves and parses `robots.txt` files to determine which paths are allowed or disallowed for web crawlers, helping
users understand site restrictions and accessibility rules.
## Features
- Retrieve and analyze `robots.txt` files from a single URL or a list of URLs.
- Parse the `robots.txt` file to display **allowed** and **disallowed** paths.
- Optionally save the results to an output file.
- Multi-threaded processing for improved performance when working with multiple URLs.
- Configurable timeout for HTTP requests.
- Gets `sitemap` URLs from `robots.txt`
---
## Requirements
- .NET 9 (for Compiling/Debugging)
- Brain (optional)
---
## Usage
### Syntax
```bash
RobotsSniffer -u | -l [-o ] [-timeout ]
```
### Arguments
| Argument | Description |
|-----------------|-----------------------------------------------------------------------------|
| `-u ` | Analyze the `robots.txt` file of a single URL. |
| `-l ` | Provide a file containing multiple URLs (one per line) to analyze in batch. |
| `-o ` | Save the results to the specified file. Optional. |
| `-timeout ` | Set the HTTP request timeout in milliseconds (default: 5000). |
---
### Examples
#### Analyze a Single URL
```bash
RobotsSniffer -u https://example.com
```
Output:
```plaintext
[>] Url: https://example.com
[+] Checking url...
[+] Robots.txt found.
[?] Robots.txt content:
[?] Allowed:
[+] /
[?] Disallowed:
[-] /admin
[-] /private
```
#### Analyze a List of URLs
```bash
RobotsSniffer -l urls.txt -o output.txt
```
Where `urls.txt` contains:
```plaintext
https://example.com
https://another-site.com
```
Output:
- Results are printed to the console and saved in `output.txt`.
---
## How It Works
1. **Argument Parsing**:
The tool validates and processes the command-line arguments to determine the mode of operation:
- Single URL (`-u`).
- Multiple URLs from a file (`-l`).
2. **Fetching `robots.txt`**:
For each URL, the tool attempts to fetch the `robots.txt` file by appending `/robots.txt` to the base URL.
3. **Parsing the Content**:
The `robots.txt` content is parsed to extract **allowed** (`Allow`) and **disallowed** (`Disallow`) paths.
4. **Output**:
Results are displayed in the console and optionally written to the specified output file.
5. **Parallel Processing**:
When analyzing multiple URLs, the tool uses multithreading (`Parallel.ForEach`) to process URLs concurrently for
better performance.
## Future Improvements
- Support for identifying and extracting `Sitemap` URLs from `robots.txt`.
- Enhanced error reporting and logging.
- Option to customize the number of concurrent threads for URL processing.
- HTTP headers customization (e.g., user-agent string).
## Contributing
Contributions are welcome! If you'd like to add features, improve performance, or fix issues, feel free to submit a pull
request.
---
### Author
RobotsSniffer was created as a utility tool for web analysis, helping users understand how websites interact with web
crawlers. Author takes no responsibility for the misuse of this tool.