Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/riimu/kit-urlparser

RFC 3986 compliant url parsing library with PSR-7 Uri component
https://github.com/riimu/kit-urlparser

php php-library psr-7 rfc-3986 uri url-parsing

Last synced: about 2 months ago
JSON representation

RFC 3986 compliant url parsing library with PSR-7 Uri component

Awesome Lists containing this project

README

        

# RFC 3986 URL Parser #

*UrlParser* is PHP library that provides a [RFC 3986](https://tools.ietf.org/html/rfc3986)
compliant URL parser and a [PSR-7](http://www.php-fig.org/psr/psr-7/) compatible
URI component. The purpose of this library is to provide a parser that
accurately implements the RFC specification unlike the built in function
`parse_url()`, which differs from the specification in some subtle ways.

This library has two main purposes. The first to provide information from the
parsed URLs. To achieve this, the library implements the standard URI handling
interface from the PSR-7 and also provides additional methods that make it
easier to retrieve commonly used information from the URLs. The second purpose
is to also permit the modification of said URLs using the interface from the
PSR-7 standard in addition to few extra methods that make some tasks more
straightforward.

While this library is mainly intended for parsing URLs, the parsing is simply
based on the generic URI syntax. Thus, it is possible to use this library to
validate and parse any other types of URIs against the generic syntax. The
library does not perform any scheme specific validation for the URLs.

In addition to the default RFC 3986 compliant mode, the library also offers
options that allow parsing of URLs that contain UTF-8 characters in different
components of the URL while converting them to the appropriate percent encoded
and IDN ascii formats.

The API documentation is available at: http://kit.riimu.net/api/urlparser/

[![CI](https://img.shields.io/github/workflow/status/Riimu/Kit-UrlParser/CI/main?style=flat-square)](https://github.com/Riimu/Kit-UrlParser/actions)
[![Scrutinizer](https://img.shields.io/scrutinizer/quality/g/Riimu/Kit-UrlParser/main?style=flat-square)](https://scrutinizer-ci.com/g/Riimu/Kit-UrlParser/)
[![codecov](https://img.shields.io/codecov/c/github/Riimu/Kit-UrlParser/main?style=flat-square)](https://codecov.io/gh/Riimu/Kit-UrlParser)
[![Packagist](https://img.shields.io/packagist/v/riimu/kit-urlparser.svg?style=flat-square)](https://packagist.org/packages/riimu/kit-urlparser)

## Requirements ##

* The minimum supported PHP version is 5.6
* The library depends on the following external PHP libraries:
* [psr/http-message](https://packagist.org/packages/psr/http-message) (`^1.0`)
* The library depends on the following PHP Extensions
* [`intl`](http://php.net/manual/en/book.intl.php) (only required IDN support)

## Installation ##

### Installation with Composer ###

The easiest way to install this library is to use Composer to handle your
dependencies. In order to install this library via Composer, simply follow
these two steps:

1. Acquire the `composer.phar` by running the Composer
[Command-line installation](https://getcomposer.org/download/)
in your project root.

2. Once you have run the installation script, you should have the `composer.phar`
file in you project root and you can run the following command:

```
php composer.phar require "riimu/kit-urlparser:^2.1"
```

After installing this library via Composer, you can load the library by
including the `vendor/autoload.php` file that was generated by Composer during
the installation.

### Adding the library as a dependency ###

If you are already familiar with how to use Composer, you may alternatively add
the library as a dependency by adding the following `composer.json` file to your
project and running the `composer install` command:

```json
{
"require": {
"riimu/kit-urlparser": "^2.1"
}
}
```

### Manual installation ###

If you do not wish to use Composer to load the library, you may also download
the library manually by downloading the [latest release](https://github.com/Riimu/Kit-UrlParser/releases/latest)
and extracting the `src` folder to your project. You may then include the
provided `src/autoload.php` file to load the library classes.

Please note that using Composer will also automatically download the other
required PHP libraries. If you install this library manually, you will also need
to make those other required libraries available.

## Usage ##

Using this library is relatively straightforward. The library provides a URL
parsing class `UriParser` and an immutable value object class `Uri` that
represents the URL. To parse an URL, you could simply provide the URL as a
string to the `parse()` method in `UriParser` which returns an instance of `Uri`
that has been generated from the parsed URL.

For example:

```php
parse('http://www.example.com');

echo $uri->getHost(); // Outputs 'www.example.com'
```

Alternatively, you can just skip using the `UriParser` completely and simply
provide the URL as a constructor parameter to the `Uri`:

```php
getHost(); // Outputs 'www.example.com'
```

The main difference between using the `parse()` method and the constructor is
that the `parse()` method will return a `null` if the provided URL is not a
valid url, while the constructor will throw an `InvalidArgumentException`.

To retrieve different types of information from the URL, the `Uri` class
provides various different methods to help you. Here is a simple example as an
overview of the different available methods:

```php
parse('http://jane:[email protected]:8080/site/index.php?action=login&prev=index#form');

echo $uri->getScheme() . PHP_EOL; // outputs: http
echo $uri->getUsername() . PHP_EOL; // outputs: jane
echo $uri->getPassword() . PHP_EOL; // outputs: pass123
echo $uri->getHost() . PHP_EOL; // outputs: www.example.com
echo $uri->getTopLevelDomain() . PHP_EOL; // outputs: com
echo $uri->getPort() . PHP_EOL; // outputs: 8080
echo $uri->getStandardPort() . PHP_EOL; // outputs: 80
echo $uri->getPath() . PHP_EOL; // outputs: /site/index.php
echo $uri->getPathExtension() . PHP_EOL; // outputs: php
echo $uri->getQuery() . PHP_EOL; // outputs: action=login&prev=index
echo $uri->getFragment() . PHP_EOL; // outputs: form

print_r($uri->getPathSegments()); // [0 => 'site', 1 => 'index.php']
print_r($uri->getQueryParameters()); // ['action' => 'login', 'prev' => 'index']
```

The `Uri` component also provides various methods for modifying the URL, which
allows you to construct new URLs from separate components or modify existing
ones. Note that the `Uri` component is an immutable value object, which means
that each of the modifying methods return a new `Uri` instance instead of
modifying the existing one. Here is a simple example of constructing an URL
from it's components:

```php
withScheme('http')
->withUserInfo('jane', 'pass123')
->withHost('www.example.com')
->withPort(8080)
->withPath('/site/index.php')
->withQueryParameters(['action' => 'login', 'prev' => 'index'])
->withFragment('form');

// Outputs: http://jane:[email protected]:8080/site/index.php?action=login&prev=index#form
echo $uri;
```

As can be seen from the previous example, the `Uri` component also provides a
`__toString()` method that provides the URL as a string.

### Retrieving Information ###

Here is the list of methods that the `Uri` component provides for retrieving
information from the URL:

* `getScheme()` returns the scheme from the URL or an empty string if the URL
has no scheme.

* `getAuthority()` returns the component from the URL that consists of the
username, password, hostname and port in the format `user-info@hostname:port`

* `getUserInfo()` returns the component from the URL that contains the
username and password separated by a colon.

* `getUsername()` returns the *decoded* username from the URL or an empty
string if there is no username present in the URL.

* `getPassword()` returns the *decoded* password from the URL or an empty
string if there is no password present in the URL.

* `getHost()` return the hostname from the URL or an empty string if the URL
has no host.

* `getIpAddress()` returns the IP address from the host, if the host is an
IP address. Otherwise this method will return `null`. If an IPv6 address
was provided, the address is returned without the surrounding braces.

* `getTopLevelDomain()` returns the top level domain from the host. If there
is no host or the host is an IP address, an empty string will be returned
instead.

* `getPort()` returns the port from the URL or a `null` if there is no port
present in the url. This method will also return a `null` if the port is the
standard port for the current scheme (e.g. 80 for http).

* `getStandardPort()` returns the standard port for the current scheme. If
there is no scheme or the standard port for the scheme is not known, a
`null` will be returned instead.

* `getPath()` returns the path from the URL or an empty string if the URL has
no path.

* `getPathSegments()` returns an array of *decoded* path segments (i.e. the
path split by each forward slash). Empty path segments are discarded and not
included in the returned array.

* `getPathExtension()` returns the file extension from the path or an empty
string if the URL has no path.

* `getQuery()` returns the query string from the URL or an empty string if the
URL has no query string.

* `getQueryParameters()` parses the query string from the URL using the
`parse_str()` function and returns the array of parsed values.

* `getFragment()` returns the fragment from the URL or an empty string if the
URL has no fragment.

* `__toString()` returns the URL as a string.

### Modifying the URL ###

The `Uri` component provides various methods that can be used to modify URLs
and construct new ones. Note that since the `Uri` class is an immutable value
object, each method returns a new instance of `Uri` rather than modifying the
existing one.

* `withScheme($scheme)` returns a new instance with the given scheme. An empty
scheme can be used to remove the scheme from the URL. Note that any provided
scheme is normalized to lowercase.

* `withUserInfo($user, $password = null)` returns a new instance with the
given username and password. Note that the password is ignored unless an
username is provided. Empty username can be used to remove the username and
password from the URL. Any character that cannot be inserted in the URL by
itself will be percent encoded.

* `withHost($host)` returns a new instance with the given host. An empty host
can be used to remove the host from the URL. Note that this method does not
accept international domain names. Note that this method will also normalize
the host to lowercase.

* `withPort($port)` returns a new instance with the given port. A `null` can
be used to remove the port from the URL.

* `withPath($path)` returns a new instance with the given path. An empty path
can be used to remove the path from the URL. Note that any character that is
not a valid path character will be percent encoded in the URL. Existing
percent encoded characters will not be double encoded, however.

* `withPathSegments(array $segments)` returns a new instance with the path
constructed from the array of path segments. All invalid path characters in
the segments will be percent encoded, including the forward slash and
existing percent encoded characters.

* `withQuery($query)` returns a new instance with the given query string. An
empty query string can be used to remove the path from the URL. Note that
any character that is not a valid query string character will be percent
encoded in the URL. Existing percent encoded characters will not be double
encoded, however.

* `withQueryParameters(array $parameters)` returns a new instance with the
query string constructed from the provided parameters using the
`http_build_query()` function. All invalid query string characters in the
parameters will be percent encoded, including the ampersand, equal sign and
existing percent encoded characters.

* `withFragment($fragment)` returns a new instance with the given fragment. An
empty string can be used to remove the fragment from the URL. Note that any
character that is not a valid fragment character will be percent encoded in
the URL. Existing percent encoded characters will not be double encoded,
however.

### UTF-8 and International Domains Names ###

By default, this library provides a parser that is RFC 3986 compliant. The RFC
specification does not permit the use of UTF-8 characters in the domain name or
any other parts of the URL. The correct representation for these in the URL is
to use the an IDN standard for domain names and percent encoding the UTF-8
characters in other parts.

However, to help you deal with UTF-8 encoded characters, many of the methods in
the `Uri` component will automatically percent encode any characters that cannot
be inserted in the URL on their own, including UTF-8 characters. Due to
complexities involved, however, the `withHost()` method does not allow UTF-8
encoded characters.

By default, the parser also does not parse any URLs that include UTF-8 encoded
characters because that would be against the RFC specification. However, the
parser does provide two additional parsing modes that allows these characters
whenever possible.

If you wish to parse URLs that may contain UTF-8 characters in the user
information (i.e. the username or password), path, query or fragment components
of the URL, you can simply use the UTF-8 parsing mode. For example:

```php
setMode(\Riimu\Kit\UrlParser\UriParser::MODE_UTF8);

$uri = $parser->parse('http://www.example.com/föö/bär.html');
echo $uri->getPath(); // Outputs: /f%C3%B6%C3%B6/b%C3%A4r.html
```

UTF-8 characters in the domain name, however, are a bit more complex issue. The
parser, however, does provide a rudimentary support for parsing these domain
names using the IDNA mode. For example:

```php
setMode(\Riimu\Kit\UrlParser\UriParser::MODE_IDNA);

$uri = $parser->parse('http://www.fööbär.com');
echo $uri->getHost(); // Outputs: www.xn--fbr-rla2ga.com
```

Note that using this parsing mode requires the PHP extension `intl` to be
enabled. The appropriate parsing mode can also be provided to the constructor of
the `Uri` component using the second constructor parameter.

While support for parsing these UTF-8 characters is available, this library does
not provide any methods for the reverse operations since the purpose of this
library is to deal with RFC 3986 compliant URIs.

### URL Normalization ###

Due to the fact that the RFC 3986 specification defines some URLs as equivalent
despite having some slight differences, this library does some minimal
normalization to the provided values. You may encounter these instances when,
for example, parsing URLs provided by users. The most notable normalizations you
may encounter are as follows:

* The `scheme` and `host` components are considered case insensitive. Thus,
these components will always be normalized to lower case.
* The port number will not be included in the strings returned by
`getAuthority()` and `__toString()` if the port is the standard port for the
current scheme.
* Percent encodings are treated in case insensitive manner. Thus, this library
will normalize the hexadecimal characters to upper case.
* The number of forward slashes in the beginning of the path in the string
returned by `__toString()` may change depending on whether the URL has an
`authority` component or not.
* Percent encoded characters in parsed and generated URIs in the `userinfo`
component may differ due to the fact that the `UriParser` works with the
PSR-7 specification which does not provide a way to provide encoded username
or password.

## Credits ##

This library is Copyright (c) 2013-2022 Riikka Kalliomäki.

See LICENSE for license and copying information.