Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/commonsearch/urlparse4
Faster replacement for Python's urlparse module
https://github.com/commonsearch/urlparse4
Last synced: about 2 months ago
JSON representation
Faster replacement for Python's urlparse module
- Host: GitHub
- URL: https://github.com/commonsearch/urlparse4
- Owner: commonsearch
- License: apache-2.0
- Created: 2016-07-07T06:57:23.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2018-09-30T00:11:39.000Z (over 6 years ago)
- Last Synced: 2024-09-26T10:06:06.726Z (4 months ago)
- Language: Python
- Homepage:
- Size: 1.78 MB
- Stars: 46
- Watchers: 6
- Forks: 6
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- starred-awesome - urlparse4 - Faster replacement for Python's urlparse module (Python)
README
# urlparse4
`urlparse4` is a performance-focused replacement for Python's `urlparse` module, using C++ code from Chromium's own URL parser.
It is not production-ready yet.
Many credits go to [gurl-cython](https://github.com/Preetwinder/gurl-cython) for inspiration.
## Differences with Python's `urlparse`
`urlparse4` should be a transparent, drop-in replacement in almost all cases. Still, there are a few differences to be aware of:
- `urlparse4` is 2-7x faster for most operations (see benchmarks below)
- `urlparse4` currently doesn't pass CPython's `test_urlparse.py` suite due to edge cases that Chromium's parser manages differently (usually in accordance to the RFCs, which `urlparse` doesn't follow entirely).
- `urlparse4` only supports Python 2.7 for now## How to install
```
pip install urlparse4
```## How to use
The most straightforward way to use `urlparse4` is to replace your imports of `urlparse` with this:
```
import urlparse4 as urlparse
```You could also monkey-patch its methods in the global `urlparse` module like this:
```
import urlparse4
import urlparsefor attr in dir(urlparse4):
setattr(urlparse, attr, getattr(urlparse4, attr))```
## How to test
You must have Docker installed and running. You can run CPython's test suite for `urlparse` like this:
```
make docker_build
make docker_test
```## Benchmarks
We are testing the following librairies on a sample of 100k URLs from Blink and DMOZ:
- urlparse4 ;-)
- [CPython's urlparse](https://github.com/python/cpython/blob/2.7/Lib/urlparse.py)
- [urlparse2](https://github.com/mwhooker/urlparse2)
- [YURL](http://github.com/homm/yurl/)
- [uritools](https://github.com/tkem/uritools)
- [pygurl / gurl-cython](https://github.com/Preetwinder/gurl-cython)
- [cyuri](https://github.com/mitghi/cyuri)Each of them is being tested on a few different types of operations (basic urlsplit, relative link resolution, hostname extraction)
Here is how to launch the tests:
```
make docker_build
make docker_benchmark
```Current results on a 2.2GHz Intel Core i7 MBP (in seconds):
```
Benchmark results on 104300 URLs x 10 times, in seconds:Name Sum Mean Median 90%
---------------- ------------- ----------------- ----------------- -----------------urlsplit:
---- ---- ---- ---- ----
urlparse4 1.681858 1.61251965484e-06 1.99999999984e-06 2.00000000006e-06
pygurl 2.031712 1.94795014382e-06 1.99999999984e-06 2.00000000028e-06
uritools 2.638991 2.53019271333e-06 2.00000000028e-06 3.00000000042e-06
yurl 3.910247 3.74903835091e-06 3.00000000131e-06 4.99999999981e-06
urlparse2 3.756782 3.60190028763e-06 2.99999999953e-06 4.00000000056e-06
urlparse 3.862006 3.70278619367e-06 3.00000000308e-06 4.99999999803e-06
cyuri 9.912275 9.50361936721e-06 8.00000000112e-06 1.30000000027e-05urljoin_sibling:
---- ---- ---- ---- ----
urlparse4 2.008453 1.92565004794e-06 2.00000000206e-06 2.00000000206e-06
pygurl 2.193427 2.10299808245e-06 2.00000000206e-06 2.99999999953e-06
uritools 10.575344 1.01393518696e-05 9.99999999607e-06 1.20000000052e-05
yurl 13.213052 1.26683144775e-05 1.19999999981e-05 1.60000000022e-05
urlparse2 14.239327 1.36522790029e-05 1.19999999981e-05 1.69999999997e-05
urlparse 9.25991500001 8.87815436242e-06 8.00000000822e-06 1.10000000006e-05
cyuri 5.742724 5.50596740172e-06 5.00000000159e-06 7.00000001075e-06hostname:
---- ---- ---- ---- ----
urlparse4 1.883982 1.80631064237e-06 1.99999999495e-06 2.00000000916e-06
pygurl 1.67332099999 1.60433461169e-06 1.99999999495e-06 2.00000000916e-06
uritools 3.31632199999 3.17959923297e-06 3.00000000664e-06 4.00000000411e-06
yurl 3.853319 3.69445733461e-06 3.00000000664e-06 4.00000000411e-06
urlparse2 4.641513 4.45015627996e-06 4.00000000411e-06 5.99999999906e-06
urlparse 5.122682 4.91148801534e-06 4.00000000411e-06 5.99999999906e-06
cyuri 11.108649 1.06506701822e-05 9.0000000057e-06 1.5999999988e-05
```Some libraries are included in the benchmark code but disabled for various reasons:
- [urlparse3](https://pypi.python.org/pypi/urlparse3/) (Raises on valid URLs)
- [slimurl](https://github.com/mosquito/slimurl) (Too slow)Feel free to submit pull requests to add new ones!
## Feedback
We'd love to hear your feedback! Feel free to look at the issues on GitHub and open new ones if needed :)