https://github.com/xojoc/cleanurl

Remove clutter from URLs and return a canonicalized version
https://github.com/xojoc/cleanurl

url url-normalization

Last synced: 3 months ago
JSON representation

Remove clutter from URLs and return a canonicalized version

Host: GitHub
URL: https://github.com/xojoc/cleanurl
Owner: xojoc
License: agpl-3.0
Created: 2022-02-06T11:03:26.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2024-06-03T10:08:59.000Z (about 2 years ago)
Last Synced: 2026-01-05T11:18:24.906Z (6 months ago)
Topics: url, url-normalization
Language: Python
Homepage: https://pypi.org/project/cleanurl/
Size: 77.1 KB
Stars: 21
Watchers: 2
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # cleanurl

Remove clutter from URLs and return a canonicalized version

# Install

```

pip install cleanurl

```

or if you're using poetry:

```

poetry add cleanurl

```

# Usage

By default *cleanurl* retuns a cleaned URL without respecting semantics.

For example:

```

>>> import cleanurl

>>> r = cleanurl.cleanurl('https://www.xojoc.pw/blog/focus.html?utm_content=buffercf3b2&utm_medium=social&utm_source=snapchat.com&utm_campaign=buffe')

>>> r.url

'https://xojoc.pw/blog/focus'

>>> r.parsed_url

ParseResult(scheme='https', netloc='xojoc.pw', path='/blog/focus', params='', query='', fragment='')

```

The default parameters are useful if you want to get a *canonical* URL without caring if the resulting URL is still valid.

If you want to get a clean URL which is still valid call it like this:

```

>>> r = cleanurl.cleanurl('https://www.xojoc.pw/blog/////focus.html', respect_semantics=True)

>>> r.url

'https://www.xojoc.pw/blog/focus.html'

```

```celeanurl.cleanurl``` parameters:

 - ```generic``` -> if True don't use site specific rules

 - ```respect_semantics``` -> if True make sure the returned URL is still valid, altough it may still contain some superfluous elements

 - ```host_remap``` -> whether to remap hosts. Example:

```

>>> import cleanurl

>>> cleanurl.cleanurl('https://threadreaderapp.com/thread/1453753924960219145', host_remap=True).url

'https://twitter.com/i/status/1453753924960219145'

>>> cleanurl.cleanurl('https://threadreaderapp.com/thread/1453753924960219145', host_remap=False).url

'https://threadreaderapp.com/thread/1453753924960219145'

```

For more examples see the [unit tests](https://github.com/xojoc/cleanurl/blob/main/src/test_cleanurl.py).

# Why?

While there are some libraries that handle general cases, this library has website specific rules that more aggresivly normalize urls.

# Users

Initially used for [discu.eu](https://discu.eu).

[Discussions around the web](https://discu.eu/q/https://github.com/xojoc/cleanurl)

# Who?

*cleanurl* was written by [Alexandru Cojocaru](https://xojoc.pw).

# License

*cleanurl* is [Free Software](https://www.gnu.org/philosophy/free-sw.html) and is released as [AGPLv3](https://github.com/xojoc/cleanurl/blob/main/LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xojoc/cleanurl

Awesome Lists containing this project

README