{"id":22973610,"url":"https://github.com/9dl/robotssniffer","last_synced_at":"2025-04-02T07:24:02.886Z","repository":{"id":264698714,"uuid":"894127913","full_name":"9dl/RobotsSniffer","owner":"9dl","description":"Tool to analyze and parse website robots.txt for crawler rules.","archived":false,"fork":false,"pushed_at":"2024-11-25T20:09:09.000Z","size":50,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-07T22:20:46.010Z","etag":null,"topics":["csharp","parallel-processing","robots-txt","web-crawling","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/9dl.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-25T20:00:58.000Z","updated_at":"2024-11-25T20:09:13.000Z","dependencies_parsed_at":"2024-11-25T21:21:27.407Z","dependency_job_id":"78fc2300-d26d-4127-9ec9-8f33a1ab28ba","html_url":"https://github.com/9dl/RobotsSniffer","commit_stats":null,"previous_names":["9dl/robotssniffer"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/9dl%2FRobotsSniffer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/9dl%2FRobotsSniffer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/9dl%2FRobotsSniffer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/9dl%2FRobotsSniffer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/9dl","download_url":"https://codeload.github.com/9dl/RobotsSniffer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246771146,"owners_count":20831019,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csharp","parallel-processing","robots-txt","web-crawling","web-scraping"],"created_at":"2024-12-14T23:57:32.377Z","updated_at":"2025-04-02T07:24:02.865Z","avatar_url":"https://github.com/9dl.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RobotsSniffer\n\n![image.png](image.png)\n\n**RobotsSniffer** is a command-line tool written in C# for analyzing the `robots.txt` files of websites. The tool\nretrieves and parses `robots.txt` files to determine which paths are allowed or disallowed for web crawlers, helping\nusers understand site restrictions and accessibility rules.\n\n## Features\n\n- Retrieve and analyze `robots.txt` files from a single URL or a list of URLs.\n- Parse the `robots.txt` file to display **allowed** and **disallowed** paths.\n- Optionally save the results to an output file.\n- Multi-threaded processing for improved performance when working with multiple URLs.\n- Configurable timeout for HTTP requests.\n- Gets `sitemap` URLs from `robots.txt`\n\n---\n\n## Requirements\n\n- .NET 9 (for Compiling/Debugging)\n- Brain (optional)\n\n---\n\n## Usage\n\n### Syntax\n\n```bash\nRobotsSniffer -u \u003curl\u003e | -l \u003curl-list\u003e [-o \u003coutput-file\u003e] [-timeout \u003cms\u003e]\n```\n\n### Arguments\n\n| Argument        | Description                                                                 |\n|-----------------|-----------------------------------------------------------------------------|\n| `-u \u003curl\u003e`      | Analyze the `robots.txt` file of a single URL.                              |\n| `-l \u003curl-list\u003e` | Provide a file containing multiple URLs (one per line) to analyze in batch. |\n| `-o \u003coutput\u003e`   | Save the results to the specified file. Optional.                           |\n| `-timeout \u003cms\u003e` | Set the HTTP request timeout in milliseconds (default: 5000).               |\n\n---\n\n### Examples\n\n#### Analyze a Single URL\n\n```bash\nRobotsSniffer -u https://example.com\n```\n\nOutput:\n\n```plaintext\n[\u003e] Url: https://example.com\n[+] Checking url...\n[+] Robots.txt found.\n[?] Robots.txt content:\n[?] Allowed:\n[+] /\n[?] Disallowed:\n[-] /admin\n[-] /private\n```\n\n#### Analyze a List of URLs\n\n```bash\nRobotsSniffer -l urls.txt -o output.txt\n```\n\nWhere `urls.txt` contains:\n\n```plaintext\nhttps://example.com\nhttps://another-site.com\n```\n\nOutput:\n\n- Results are printed to the console and saved in `output.txt`.\n\n---\n\n## How It Works\n\n1. **Argument Parsing**:\n   The tool validates and processes the command-line arguments to determine the mode of operation:\n    - Single URL (`-u`).\n    - Multiple URLs from a file (`-l`).\n\n2. **Fetching `robots.txt`**:\n   For each URL, the tool attempts to fetch the `robots.txt` file by appending `/robots.txt` to the base URL.\n\n3. **Parsing the Content**:\n   The `robots.txt` content is parsed to extract **allowed** (`Allow`) and **disallowed** (`Disallow`) paths.\n\n4. **Output**:\n   Results are displayed in the console and optionally written to the specified output file.\n\n5. **Parallel Processing**:\n   When analyzing multiple URLs, the tool uses multithreading (`Parallel.ForEach`) to process URLs concurrently for\n   better performance.\n\n## Future Improvements\n\n- Support for identifying and extracting `Sitemap` URLs from `robots.txt`.\n- Enhanced error reporting and logging.\n- Option to customize the number of concurrent threads for URL processing.\n- HTTP headers customization (e.g., user-agent string).\n\n## Contributing\n\nContributions are welcome! If you'd like to add features, improve performance, or fix issues, feel free to submit a pull\nrequest.\n\n---\n\n### Author\n\nRobotsSniffer was created as a utility tool for web analysis, helping users understand how websites interact with web\ncrawlers. Author takes no responsibility for the misuse of this tool.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F9dl%2Frobotssniffer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F9dl%2Frobotssniffer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F9dl%2Frobotssniffer/lists"}