An open API service indexing awesome lists of open source software.

https://github.com/patternhelloworld/url-knife

Extract and decompose (fuzzy) URLs (including emails, which are conceptually a part of URLs) in texts with Area-Pattern-based modularity
https://github.com/patternhelloworld/url-knife

email-extractor email-parser email-parsing pre-processing uri-template url-extractor url-normalization url-normalizer url-parser url-parsing url-validation

Last synced: 5 months ago
JSON representation

Extract and decompose (fuzzy) URLs (including emails, which are conceptually a part of URLs) in texts with Area-Pattern-based modularity

Awesome Lists containing this project

README

          

# Url-knife [![NPM version](https://img.shields.io/npm/v/url-knife.svg)](https://www.npmjs.com/package/url-knife) [![](https://data.jsdelivr.com/v1/package/gh/patternknife/url-knife/badge)](https://www.jsdelivr.com/package/gh/patternknife/url-knife) [![](https://badgen.net/bundlephobia/minzip/url-knife)](https://bundlephobia.com/result?p=url-knife)
## Overview
Extract and decompose (fuzzy) URLs (including emails, which are conceptually a part of URLs) in texts with ``Area-Pattern-based modularity``.
- This library is currently being refactored into TypeScript, as it was originally developed in JavaScript.

#### URL knife
LIVE DEMO

## Area-Pattern-Based Modularity

The **Area** represents a designated section of content, such as general text, XML (HTML) areas, URL areas, or EMAIL areas. Each **Area** is associated with a specific set of **Patterns** (regular expressions) tailored to its context.

### Example:

1. In a **TextArea** (general plain text), the system applies a URL-specific regular expression to extract potential URLs.
2. Once the area is narrowed down to contain URLs, **UrlArea** logic is used, applying URL-specific patterns to decompose the URL into its components (e.g., protocol, domain, path, query parameters).

### Enhanced Accuracy with Regular Expression Indexes:
To further improve accuracy, the system leverages the **index** (or **offset**) values from regular expressions. These indexes help pinpoint exact locations of matches within the text, ensuring precise extraction and minimizing false positives.

For example:
- If a **CommentArea** is processed using its specific patterns, the system identifies indexes for matches within that area.
- These indexes can then be used to exclude matched URLs from a broader **TextArea**, ensuring only relevant URLs are processed and avoiding redundant or incorrect extractions.

### Key Benefits:
This modular approach ensures that each **Area** is processed efficiently with the most relevant and optimized regular expressions. By incorporating index-based matching, it enables robust, scalable, and highly accurate parsing for various content types while preventing conflicts between overlapping patterns.

## Installation

For ES5 users, refer to ``public/index.html``.

``` html



<--! OR !-->

```

For ES6 npm users, run 'npm install --save url-knife' in the console.
(**Requred Node v18.20.4**)
``` html
import {TextArea, UrlArea, XmlArea} from 'url-knife';
```
For ES5, add Pattern before usage:
```javascript
Pattern.UrlArea...
````

## Syntax & Usage

[Chapter 1. Normalize or parse one URL](#chapter-1-normalize-or-parse-one-url)

[Chapter 2. Extract all URLs or emails](#chapter-2-extract-all-urls-or-emails)

[Chapter 3. Extract URIs with certain names](#chapter-3-extract-uris-with-certain-names)

[Chapter 4. Extract all URLs in raw HTML or XML](#chapter-4-extract-all-urls-in-raw-html-or-xml)

#### Chapter 1. Normalize or parse one URL
The following two methods should be used for processing a single URL, not for multiple URLs within a text.
(For handling multiple URLs, refer to Chapters 2 and 4.)

##### normalizeUrl vs parseUrl
If you need to parse a standard URL without any typos, it is safe to use ``parseUrl``. However, ``normalizeUrl`` is designed to handle URLs that may contain human errors.

* ##### Run ``normalizeUrl``

``` javascript
/**
* @brief
* Normalize an url with potential human errors (Intranet urls are not allowed.)
*/
var sample1 = Pattern.UrlArea.normalizeUrl("htp/:/abcgermany.,def;:9094 #park//noon??abc=retry")
var sample2 = Pattern.UrlArea.normalizeUrl("'://abc.jppp:9091 /park/noon'")
var sample3 = Pattern.UrlArea.normalizeUrl("ss hd : /university,.acd. ;jpkp: 9091/adc??abc=.com")

```
* ##### Results
``` javascript
{
"url": "htp/:/abcgermany.,def;:9094 #park//noon??abc=retry",
"normalizedUrl": "http://abcgermany.de:9094#park/noon?abc=retry",
"removedTailOnUrl": "",
"protocol": "http",
"onlyDomain": "abcgermany.de",
"onlyParams": "?abc=retry",
"onlyUri": "#park/noon",
"onlyUriWithParams": "#park/noon?abc=retry",
"onlyParamsJsn": {
"abc": "retry"
},
"type": "domain",
"port": "9094"
}
{
"url": "'://abc.jppp:9091 /park/noon'",
"normalizedUrl": "abc.jp:9091/park/noon",
"removedTailOnUrl": "'",
"protocol": null,
"onlyDomain": "abc.jp",
"onlyParams": null,
"onlyUri": "/park/noon'",
"onlyUriWithParams": "/park/noon'",
"onlyParamsJsn": null,
"type": "domain",
"port": "9091"
}
{
"url": "ss hd : /university,.acd. ;jpkp로 접속",
"normalizedUrl": "ssh://university.ac.jp",
"removedTailOnUrl": "",
"protocol": "ssh",
"onlyDomain": "university.ac.jp",
"onlyParams": null,
"onlyUri": null,
"onlyUriWithParams": null,
"onlyParamsJsn": null,
"type": "domain",
"port": null
}
```

* ##### Run ``parseUrl``

``` javascript
/**
* @brief
* Parse an url with no potential human errors
*/
var url = Pattern.UrlArea.parseUrl("xtp://gooppalgo.com/park/tree/?abc=1")
```
###### console.log()
``` javascript
{
"url": "xtp://gooppalgo.com/park/tree/?abc=1",
"removedTailOnUrl": "",
"protocol": "xtp (unknown protocol)",
"onlyDomain": "gooppalgo.com",
"onlyParams": "?abc=1",
"onlyUri": "/park/tree/",
"onlyUriWithParams": "/park/tree/?abc=1",
"onlyParamsJsn": {
"abc": "1"
},
"type": "domain",
"port": null
}
```

#### Chapter 2. Extract all URLs or emails

##### The following methods are recommended to use in most cases.

* ##### extractAllUrls

``` javascript
var textStr = 'http://[::1]:8000에서 http ://www.example.com/wpstyle/?p=364 is ok \n' +
'HTTP://foo.com/blah_blah_(wikipedia) https://www.google.com/maps/place/USA/@36.2218457,... tnae1ver.com:8000on the internet Asterisk\n ' +
'the packed1book.net. fakeshouldnotbedetected.url?abc=fake s5houl7十七日dbedetected.jp?japan=go&html=가나다@pacbook.net; abc.com/ad/fg/?kk=5 abc@daum.net' +
'Have you visited http://goasidaio.ac.kr?abd=5안녕하세요?5...,.&kkk=5rk.,, ' +
'http://✪df.ws/123\n' +
'http://142.42.1.1:8080/\n' +
'http://-.~_!$&\'()*+,;=:%40:80%2f::::::@example.com ' +
'Have you visited goasidaio.ac.kr?abd=5hell0?5...&kkk=5rk.,. ';

/**
* @brief
* Distill all urls from normal text
* @author Andrew Kang
* @param textStr string required
* @param noProtocolJsn object
* default : {
'ipV4' : false,
'ipV6' : false,
'localhost' : false,
'intranet' : false
}

var urls = Pattern.TextArea.extractAllUrls(textStr, {
'ipV4' : true,
'ipV6' : false,
'localhost' : false,
'intranet' : true
})
```

* ##### extractAllEmails

```
/**
* @brief
* Distill all emails from normal text
* @author Andrew Kang
* @param textStr string required
* @param prefixSanitizer boolean (default : false)
* @return array
*/

var emails = Pattern.TextArea.extractAllEmails(textStr, true)

```
###### console.log()
##### You may be wondering what the 'pass' property below means. If 'pass' is true, that is the email pattern is strictly true following RFC rules.
```json
[{
"value": {
"email": "가나다@apacbook.ac.kr",
"removedTailOnEmail": null,
"type": "domain"
},
"area": "text",
"index": {
"start": 222,
"end": 240
},
"pass": false
},
{
"value": {
"email": "adssd@asdasd.ac.jp",
"removedTailOnEmail": null,
"type": "domain",
"removedTailOnUrl": "..."
},
"area": "text",
"index": {
"start": 242,
"end": 263
},
"pass": true
}]
```
LIVE DEMO

#### Chapter 3. Extract URIs with certain names

``` javascript

var sampleText = 'https://google.com/abc/777?a=5&b=7 abc/def 333/kak abc/55에서 abc/53 abc/533/ka abc/53a/ka /123a/abc/556/dd /abc/123?a=5&b=tkt /xyj/asff' +
'a333/kak nice/guy/ bad/or/nice/guy ssh://nice.guy.com/?a=dkdfl';

/**
* @brief
* Distill uris with certain names from normal text
* @author Andrew Kang
* @param textStr string required
* @param uris array required
* for example, [['a','b'], ['c','d']]
* If you use {number}, this means 'only number' ex) [['a','{number}'], ['c','d']]
* @param endBoundary boolean (default : false)
* @return array
*/

var uris = Pattern.TextArea.extractCertainUris(sampleText,
[['{number}', 'kak'], ['nice','guy'],['abc', '{number}']], true)

// 'If endBoundary is set to false, more uris are detected.'
// This detects all URIs containing '{number}/kak' or nice/guy' or 'abc/{number}'
```
###### console.log()
``` javascript
[
{
"uriDetected": {
"value": {
"url": "/abc/777?a=5&b=7",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": "",
"onlyParams": "?a=5&b=7",
"onlyUri": "/abc/777",
"onlyUriWithParams": "/abc/777?a=5&b=7",
"onlyParamsJsn": {
"a": "5",
"b": "7"
},
"type": "domain",
"port": null
},
"area": "text",
"index": {
"start": 18,
"end": 34
}
},
"inWhatUrl": {
"value": {
"url": "https://google.com/abc/777?a=5&b=7",
"removedTailOnUrl": "",
"protocol": "https",
"onlyDomain": "google.com",
"onlyParams": "?a=5&b=7",
"onlyUri": "/abc/777",
"onlyUriWithParams": "/abc/777?a=5&b=7",
"onlyParamsJsn": {
"a": "5",
"b": "7"
},
"type": "domain",
"port": null
},
"area": "text",
"index": {
"start": 0,
"end": 34
}
}
},
{
"uriDetected": {
"value": {
"url": "333/kak",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": null,
"onlyParams": null,
"onlyUri": "333/kak",
"onlyUriWithParams": "333/kak",
"onlyParamsJsn": null,
"type": "uri",
"port": null
},
"area": "text",
"index": {
"start": 43,
"end": 51
}
},
"inWhatUrl": undefined
},
{
"uriDetected": {
"value": {
"url": "abc/53",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": null,
"onlyParams": null,
"onlyUri": "abc/53",
"onlyUriWithParams": "abc/53",
"onlyParamsJsn": null,
"type": "uri",
"port": null
},
"area": "text",
"index": {
"start": 60,
"end": 67
}
},
"inWhatUrl": undefined
},
{
"uriDetected": {
"value": {
"url": "abc/533/ka",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": null,
"onlyParams": null,
"onlyUri": "abc/533/ka",
"onlyUriWithParams": "abc/533/ka",
"onlyParamsJsn": null,
"type": "uri",
"port": null
},
"area": "text",
"index": {
"start": 67,
"end": 77
}
},
"inWhatUrl": undefined
},
{
"uriDetected": {
"value": {
"url": "/123a/abc/556/dd",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": null,
"onlyParams": null,
"onlyUri": "/123a/abc/556/dd",
"onlyUriWithParams": "/123a/abc/556/dd",
"onlyParamsJsn": null,
"type": "uri",
"port": null
},
"area": "text",
"index": {
"start": 89,
"end": 105
}
},
"inWhatUrl": undefined
},
{
"uriDetected": {
"value": {
"url": "/abc/123?a=5&b=tkt",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": null,
"onlyParams": "?a=5&b=tkt",
"onlyUri": "/abc/123",
"onlyUriWithParams": "/abc/123?a=5&b=tkt",
"onlyParamsJsn": {
"a": "5",
"b": "tkt"
},
"type": "uri",
"port": null
},
"area": "text",
"index": {
"start": 106,
"end": 124
}
},
"inWhatUrl": undefined
},
{
"uriDetected": {
"value": {
"url": "nice/guy",
"removedTailOnUrl": "/",
"protocol": null,
"onlyDomain": null,
"onlyParams": null,
"onlyUri": "nice/guy",
"onlyUriWithParams": "nice/guy",
"onlyParamsJsn": null,
"type": "uri",
"port": null
},
"area": "text",
"index": {
"start": 144,
"end": 153
}
},
"inWhatUrl": undefined
},
{
"uriDetected": {
"value": {
"url": "/or/nice/guy",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": null,
"onlyParams": null,
"onlyUri": "/or/nice/guy",
"onlyUriWithParams": "/or/nice/guy",
"onlyParamsJsn": null,
"type": "uri",
"port": null
},
"area": "text",
"index": {
"start": 157,
"end": 170
}
},
"inWhatUrl": null
}
]
```

#### Chapter 4. Extract all URLs in raw HTML or XML

``` javascript
// The sample of 'XML (HTML)'
var xmlStr =
'en.wikipedia.org/wiki/Wikipedia:About\n' +
'

packed1book.net?user[name][first]=tj&user[name][last]=holowaychuk

\n' +
'fakeshouldnotbedetected.url?abc=fake -s5houl7十七日dbedetected.jp?japan=go- ' +
'plus.google.co.kr0에서.., \n' +
'https://plus.google.com/+google\n' +
'https://www.google.com/maps/place/USA/@36.2218457,...' +
' float : none ; height: 200px;max-width: 50%;margin-top : 3%\' alt="undefined" src="http://www.aaa가가.com/image/showWorkOrderImg?fileName=12345.png"/>\n' +
' "abc@daum.net"로 보내주세요. ' +
'-gigi.dau.ac.kr?mac=10 -dau.ac.kr?mac=10

abcd@daum.co.kr에서 가나다@pacbook.netPlease align the paper to the left. 

\n' +
'

구루.com undefined

\n' +
'http: //ne1ver.com:8000?abc=1&dd=5 localhost:80 estonia.ee/ estonia.ee?

https://flaviocopes.com/how-to-inspect-javascript-object/ ※Please ask 203.35.33.555:8000 if you have any issues! ※    

Have you visited goasidaioaaa.ac.kr';

var urls = PatternExtractor.XmlArea.extractAllUrls(xmlStr);
```
###### console.log()
``` javascript
[
// Not all listed
{
"value": {
"url": "packed1book.net?user[name][first]=tj&user[name][last]=holowaychuk",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": "packed1book.net",
"onlyParams": "?user[name][first]=tj&user[name][last]=holowaychuk",
"onlyUri": null,
"onlyUriWithParams": "?user[name][first]=tj&user[name][last]=holowaychuk",
"onlyParamsJsn": {
"user": {
"name": {
"first": "tj",
"last": "holowaychuk"
}
}
},
"type": "domain",
"port": null
},
"area": "text"
},
{
"value": {
"url": "adackedbooked.co.kr",
"removedTailOnUrl": "",
"protocol": null,
"onlyDomain": "adackedbooked.co.kr",
"onlyParams": null,
"onlyUri": null,
"onlyUriWithParams": null,
"onlyParamsJsn": null,
"type": "domain",
"port": null
},
"area": "comment"
}
.....
]
```