{"id":13840115,"url":"https://github.com/ameenmaali/urldedupe","last_synced_at":"2025-07-11T07:32:32.136Z","repository":{"id":41534044,"uuid":"268668624","full_name":"ameenmaali/urldedupe","owner":"ameenmaali","description":"Pass in a list of URLs with query strings, get back a unique list of URLs and query string combinations","archived":false,"fork":false,"pushed_at":"2020-06-17T23:15:40.000Z","size":282,"stargazers_count":313,"open_issues_count":4,"forks_count":53,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-08-05T17:24:44.628Z","etag":null,"topics":["bugbounty","cpp","hacking","infosec","penetration-testing","url-parser"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ameenmaali.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-06-02T01:06:44.000Z","updated_at":"2024-08-05T07:48:44.000Z","dependencies_parsed_at":"2022-07-15T07:47:42.869Z","dependency_job_id":null,"html_url":"https://github.com/ameenmaali/urldedupe","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ameenmaali%2Furldedupe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ameenmaali%2Furldedupe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ameenmaali%2Furldedupe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ameenmaali%2Furldedupe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ameenmaali","download_url":"https://codeload.github.com/ameenmaali/urldedupe/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225705232,"owners_count":17511250,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bugbounty","cpp","hacking","infosec","penetration-testing","url-parser"],"created_at":"2024-08-04T17:00:42.161Z","updated_at":"2024-11-21T09:30:58.625Z","avatar_url":"https://github.com/ameenmaali.png","language":"C++","funding_links":[],"categories":["C++","C++ (225)"],"sub_categories":[],"readme":"# urldedupe\n\nurldedupe is a tool to quickly pass in a list of URLs, and get back a list of deduplicated (unique)\nURL and query string combination. This is useful to ensure you don't have a URL list will hundreds of duplicated parameters\nwith differing qs values. For an example run, take the following URL list passed in:\n\n```\nhttps://google.com\nhttps://google.com/home?qs=value\nhttps://google.com/home?qs=secondValue\nhttps://google.com/home?qs=newValue\u0026secondQs=anotherValue\nhttps://google.com/home?qs=asd\u0026secondQs=das\n```\n\nPassing through `urldedupe` will only maintain the non-duplicate URL \u0026 query string (ignoring values) combinations:\n\n```\n$ cat urls.txt | urldedupe\nhttps://google.com\nhttps://google.com/home?qs=value\nhttps://google.com/home?qs=newValue\u0026secondQs=anotherValue\n```\n\nIt's also possible to deduplicate similar URLs. This is done with `-s|--similar` flag, to deduplicate endpoints such as API endpoints with different IDs, or assets:\n\n```\n$ cat urls.txt\nhttps://site.com/api/users/123\nhttps://site.com/api/users/222\nhttps://site.com/api/users/412/profile\nhttps://site.com/users/photos/photo.jpg\nhttps://site.com/users/photos/myPhoto.jpg\nhttps://site.com/users/photos/photo.png\n```\n\nBecomes:\n\n```\n$ cat urls.txt | urldedupe -s\nhttps://site.com/api/users/123\nhttps://site.com/api/users/412/profile\nhttps://site.com/users/photos/photo.jpg\n```\n\nWhy C++? Because it's super fast?!?! No not really, I'm working on my C++ skills and mostly just wanted to create a real-world C++ project as opposed to educational related work.\n\n## Installation\nUse the binary already compiled within the repository...Or better yet to not run a random binary from myself who can be very shady, compile from source:\n\nYou'll need `cmake` installed and C++ 17 or higher.\n\nClone the repository \u0026 navigate to it:\n```\ngit clone https://github.com/ameenmaali/urldedupe.git\ncd urldedupe\n```\n\nIn the `urldedupe` directory\n```\ncmake CMakeLists.txt\n```\n\nIf you don't have `cmake` installed, do that. On Mac OS X it is:\n```\nbrew install cmake\n```\n\nRun make:\n```\nmake\n```\n\nThe `urldedupe` binary should now be created in the same directory. For easy use, you can move it to your `bin` directory.\n\n## Usage\n`urldedupe` takes URLs from stdin, or a file with the `-u` flag, of which you will most likely want in a file such as:\n```\n$ cat urls.txt\nhttps://google.com/home/?q=2\u0026d=asd\nhttps://my.site/profile?param1=1\u0026param2=2\nhttps://my.site/profile?param3=3\n```\n\n## Help\n```\n$ ./urldedupe -h\n(-h|--help) - Usage/help info for urldedupe\n(-u|--urls) - Filename containing urls (use this if you don't pipe urls via stdin)\n(-V|--version) - Get current version for urldedupe\n(-r|--regex-parse) - This is significantly slower than normal parsing, but may be more thorough or accurate\n(-s|--similar) - Remove similar URLs (based on integers and image/font files) - i.e. /api/user/1 \u0026 /api/user/2 deduplicated\n(-qs|--query-strings-only) - Only include URLs if they have query strings\n(-ne|--no-extensions) - Do not include URLs if they have an extension (i.e. .png, .jpg, .woff, .js, .html)\n(-m|--mode) - The mode/filters to be enabled (can be 1 or more, comma separated). Default is none, available options are the other flags (--mode \"r,s,qs,ne\")\n```\n\n## Examples\n\nVery simple, simply pass URLs from stdin or with the `-u` flag:\n\n`./urldedupe -u urls.txt`\n\nAfter moving the `urldedupe` binary to your `bin` dir..Pass in list from stdin and save to a file:\n\n`cat urls.txt | urldedupe \u003e deduped_urls.txt`\n\nDeduplicate similar URLs with `-s|--similar` flag, such as API endpoints with different IDs, or assets:\n\n`cat urls.txt | urldedupe -s`\n\n```\nhttps://site.com/api/users/123\nhttps://site.com/api/users/222\nhttps://site.com/api/users/412/profile\nhttps://site.com/users/photos/photo.jpg\nhttps://site.com/users/photos/myPhoto.jpg\nhttps://site.com/users/photos/photo.png\n```\n\nBecomes:\n\n```\nhttps://site.com/api/users/123\nhttps://site.com/api/users/412/profile\nhttps://site.com/users/photos/photo.jpg\n```\n\nFor all the bug bounty hunters, I recommend chaining with tools such as `waybackurls` or `gau` to get back only unique URLs as those sources are prone to have many similar/duplicated URLs:\n\n`cat waybackurls | urldedupe \u003e deduped_urls.txt`\n\nFor max thoroughness (usually not necessary), you can use an RFC complaint regex for URL parsing, but it is significantly slower for large data sets:\n\n`cat urls.txt | urldedupe -r \u003e deduped_urls_regex.txt`\n\nAlternatively, use `-m|--mode` with the flag values you'd like to run with. For example, if you want\nto get URLs deduped based on similarity, include only URLs that have query strings, and do not have extensions...\n\nInstead of:\n\n`urldedupe -u urls.txt -s -qs -ne`\n\nYou can also do:\n\n`urldedupe -u urls.txt -m \"s,qs,ne\"`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fameenmaali%2Furldedupe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fameenmaali%2Furldedupe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fameenmaali%2Furldedupe/lists"}