{"id":19092800,"url":"https://github.com/eon01/urlpy2","last_synced_at":"2025-06-14T03:41:51.657Z","repository":{"id":41195534,"uuid":"508706529","full_name":"eon01/urlpy2","owner":"eon01","description":"URL parsing, cleanup, canonicalization, equivalence and tracking remover","archived":false,"fork":false,"pushed_at":"2022-07-02T08:25:16.000Z","size":46,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-30T16:11:17.329Z","etag":null,"topics":["privacy","privacy-tools","seo","seotools","tracking","url","urlpy","urls"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eon01.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-06-29T13:38:27.000Z","updated_at":"2024-02-14T14:39:52.000Z","dependencies_parsed_at":"2022-08-25T18:03:03.395Z","dependency_job_id":null,"html_url":"https://github.com/eon01/urlpy2","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eon01%2Furlpy2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eon01%2Furlpy2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eon01%2Furlpy2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eon01%2Furlpy2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eon01","download_url":"https://codeload.github.com/eon01/urlpy2/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248933368,"owners_count":21185464,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["privacy","privacy-tools","seo","seotools","tracking","url","urlpy","urls"],"created_at":"2024-11-09T03:22:01.062Z","updated_at":"2025-04-18T13:34:47.223Z","avatar_url":"https://github.com/eon01.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# URLPY2\n\nurlpy2 is a small library for URL parsing, cleanup, canonicalization and equivalence.\n \nAt the heart of the `urlpy` package is the `URL` object. You can get one by\npassing in a unicode or string object into the top-level `parse` method. All\nstrings asre assumed to be Unicode:\n\n```python\nimport urlpy2 as urlpy\nmyurl = urlpy.parse('http://foo.com')\n```\n\nThe workflow is that you'll chain a number of permutations together to get the type\nof URL you're after:\n\n```python\n# Defrag, remove some parameters and give me a string\nstr(urlpy.parse(...).defrag().deparam(['utm_source']))\n\n# Escape the path, and punycode the host, and give me a string\nstr(urlpy.parse(...).escape().punycode())\n\n# Give me the absolute path url as some encoding\nstr(urlpy.parse(...).abspath()).encode('some encoding')\n```\n\n## Installation\n\n```\npip install urlpy2\n```\n\n## URL Equivalence\n\nURL objects compared with `==` are interpreted very strictly, but for a more\nlax interpretation, consider using `equiv` to test if two urls are functionally\nequivalent:\n\n```python\na = urlpy.parse(u'https://föo.com:443/a/../b/.?b=2\u0026\u0026\u0026\u0026\u0026\u0026a=1')\nb = urlpy.parse(u'https://xn--fo-fka.COM/b/?a=1\u0026b=2')\n\n# These urls are not equal\nassert(a != b)\n\n# But they are equivalent\nassert(a.equiv(b))\nassert(b.equiv(a))\n```\n\nThis equivalence test takes default ports for common schemes into account (so\nif both urls are the same scheme, but one explicitly specifies the default\nport), punycoding, case of the host name, and parameter order.\n\n\n## Absolute URLs\n\nYou can perform many operations on relative urls (those without a hostname),\nbut punycoding and unpunycoding are not among them. You can also tell whether\nor not a url is absolute:\n\n```python\na = urlpy.parse('foo/bar.html')\nassert(not a.absolute())\n```\n\n## Chaining\n\nMany of the methods on the `URL` class can be chained to produce a number of\neffects in sequence:\n\n```python\nimport urlpy2 as urlpy\n\n# Create a url object\nmyurl = urlpy.URL.parse('http://www.FOO.com/bar?utm_source=foo#what')\n# Remove some parameters and the fragment\nprint(myurl.defrag().deparam(['utm_source']))\n```\n\nIn fact, unless the function explicitly returns a string, then the method may\nbe chained.\n\n\n### `canonical`\n\nAccording to the RFC, the order of parameters is not supposed to matter. In\npractice, it can (depending on how the server matches URL routes), but it's\nalso helpful to be able to put parameters in a canonical ordering. This\nordering happens to be alphabetical order:\n\n```python\n\u003e\u003e\u003e str(urlpy.parse('http://foo.com/?b=2\u0026a=1\u0026d=3').canonical())\n'http://foo.com/?a=1\u0026b=2\u0026d=3'\n```\n\n\n### `defrag`\n\nRemove any fragment identifier from the url. This isn't part of the reuqest\nthat gets sent to an HTTP server, and so it's often useful to remove the \nfragment when doing url comparisons:\n\n```python\n\u003e\u003e\u003e str(urlpy.parse('http://foo.com/#foo').defrag())\n'http://foo.com/'\n```\n\n### `deparam`\n\nSome parameters are commonly added to urls that we may not be interested in. Or\nthey may be misleading. Common examples include referrering pages, `utm_source`\nand session ids. To strip out all such parameters from your url:\n\n```python\n\u003e\u003e\u003e str(urlpy.parse('http://foo.com/?do=1\u0026not=2\u0026want=3\u0026this=4').deparam(['do', 'not', 'want']))\n'http://foo.com/?this=4'\n```\n\n### `r_deparam`\n\nSame as `deparam` but uses regex:\n\n\n```python\n\u003e\u003e\u003e str(urlpy.parse('http://foo.com/?utm_a=1\u0026utm_b=2\u0026utm_c=3\u0026utm_d=4').deparam(['utm_*',]))\n'http://foo.com/'\n```\n\n### `remove_tracking`\n\nRemoves all tracking and referall marketing parameters from the URL based on [CleanURLs list](https://gitlab.com/ClearURLs/rules/-/raw/master/data.min.json).\n\n```python\n\u003e\u003e\u003e str(urlpy.parse('https://www.google.com/search?q=python\u0026oq=python\u0026aqs=chrome..69i57j0l5.8984j0j7\u0026sourceid=chrome\u0026ie=UTF-8').remove_tracking())\n'https://www.google.com/search?q=python'\n```\n\nTo keep the referall marketing parameters, use `remove_tracking(remove_referall_marketing=False)`.\n\n### `abspath`\n\nLike its `os.path` namesake, this makes sure that the path of the url is\nabsolute. This includes removing redundant forward slashes, `.` and `..`:\n\n```python\n\u003e\u003e\u003e str(urlpy.parse('http://foo.com/foo/./bar/../a/b/c/../../d').abspath())\n'http://foo.com/foo/a/d'\n```\n\n### `escape`\n\nNon-ASCII characters in the path are typically encoded as UTF-8 and then\nescaped as `%HH` where `H` are hexidecimal values. It's important to note that\nthe `escape` function is idempotent, and can be called repeatedly:\n\n```python\n\u003e\u003e\u003e str(urlpy.parse(u'http://foo.com/ümlaut').escape())\n'http://foo.com/%C3%BCmlaut'\n\u003e\u003e\u003e str(urlpy.parse(u'http://foo.com/ümlaut').escape().escape())\n'http://foo.com/%C3%BCmlaut'\n```\n\n### `unescape`\n\nIf you have a URL that might have been escaped before it was given to you, but\nyou'd like to display something a little more meaningful than `%C3%BCmlaut`, \nyou can unescape the path:\n\n```python\n\u003e\u003e\u003e print(urlpy.parse('http://foo.com/%C3%BCmlaut').unescape())\nhttp://foo.com/ümlaut\n```\n\n## Properties\n\nMany attributes are available on URL objects:\n\n- `scheme` -- empty string if URL is relative\n- `host` -- `None` if URL is relative\n- `hostname` -- like `host`, but empty string if URL is relative\n- `port` -- `None` if absent (or removed)\n- `path` -- always with a leading `/`\n- `params` -- string of params following the `;` (with extra `;`'s removed)\n- `query` -- string of queries following the `?` (with extra `?`'s and `\u0026`'s removed)\n- `fragment` -- empty string if absent\n- `absolute` -- a `bool` indicating whether the URL is absolute\n- `unicode` -- a unicode version of the URL\n\n\n## Running tests\n\n```bash\n./configure\npytest\n```\n\n\n## Credits and License\n\n- urlpy2 is originally forked from [nexB/urlpy](https://github.com/nexB/urlpy) which is derived from Moz's [url.py v0.2.0](https://github.com/seomoz/url-py) and has been simplified to run on Python 2 and Python 3 using a pure Python library. (Newer version of Moz's url.py use a C++ extension).\n- urlpy2 uses [CleanURLs rules data](https://gitlab.com/ClearURLs/rules) licensed under the GNU Lesser General Public License. Refer the the original author/license if you'd like to  update, distribute and copy their work. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feon01%2Furlpy2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feon01%2Furlpy2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feon01%2Furlpy2/lists"}