Projects in Awesome Lists tagged with url-normalization

https://github.com/sindresorhus/normalize-url

Normalize a URL

compare-urls npm-package sanitize-url url-normalization

Last synced: 14 May 2025

https://github.com/patternhelloworld/url-knife

Extract and decompose (fuzzy) URLs (including emails, which are conceptually a part of URLs) in texts with Area-Pattern-based modularity

email-extractor email-parser email-parsing pre-processing uri-template url-extractor url-normalization url-normalizer url-parser url-parsing url-validation

Last synced: 09 Jul 2025

https://github.com/xojoc/cleanurl

Remove clutter from URLs and return a canonicalized version

url url-normalization

Last synced: 05 Apr 2026

https://github.com/hanover-computing/canonicize-url

Get a stable, canonical version of any URL, with DNS and HTTPS checks, redirects, tracker stripping, and canonical link extraction!

amp canonical canonical-urls compare-urls javascript normalize-url npm-package privacy sanitize-url ssrf tracker tracking url-normalization

Last synced: 28 Jul 2025

https://github.com/vladkens/url-normalize

🔗🧹 Normalize URLs to a standardized form. HTTPS by default, flexible configuration, custom protocols, domain extraction, humazing URL, and punycode support. Both CJS & ESM modules available.

cjs esm normalization normalizer npm-package punycode typescript url url-normalization url-normalizer

Last synced: 24 Apr 2025

https://github.com/seroperson/urlopt4s

Allows you to remove ad/tracking query params from a given URL in Scala

adguard graaljs js query-params-filtering scala url-canonicalization url-normalization url-query

Last synced: 07 Mar 2026

https://github.com/opensite-ai/domain_extractor

🔗 Lightweight Ruby library for parsing URLs and extracting domain components with accurate multi-part TLD support. Handles nested subdomains, query parameters, and URL normalization. Perfect for web scraping, analytics, and URL manipulation. Built on URI and public_suffix gem.

analytics domain-analysis domain-extraction domain-parser public-suffix ruby ruby-library rubygem subdomain-parser tld-parser url-manipulation url-normalization url-parser url-parsing web-scraping

Last synced: 12 Dec 2025

https://github.com/chipslays/php-url-fingerprint

🔗 Pathor is a PHP library for normalizing, analyzing, and comparing URLs.

fingerprint url url-fingerprint url-normalization url-normalizer

Last synced: 09 Feb 2026

https://github.com/manu-sh/http_normalizer

http url normalization for web crawlers

crawler http spider url-normalization

Last synced: 12 Jun 2025

https://github.com/simonpierreboucher/crawler

A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.

concurrent-crawling content-extraction data-collection data-extraction-pipeline data-preservation-and-recovery data-scraping error-handling html-parsing http-requests metadata-storage modular-design pdf-text-extraction python-crawler rate-limiting structured-data-storage text-processing url-normalization web-crawling yaml-configuration