Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tudborg/whatthefield
Discover stuff.
https://github.com/tudborg/whatthefield
Last synced: about 2 months ago
JSON representation
Discover stuff.
- Host: GitHub
- URL: https://github.com/tudborg/whatthefield
- Owner: tudborg
- Created: 2015-06-23T19:39:19.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2016-01-06T09:42:03.000Z (about 9 years ago)
- Last Synced: 2024-12-07T23:04:02.398Z (about 2 months ago)
- Language: PHP
- Homepage:
- Size: 4.36 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# whatthefield
_Detect structures of collections in XML (and other) documents_
## What
WhatTheField a library for searchinh a DOM (feed) for values (nodes) matching a customizable configuration of value types.
WhatTheField takes a score approach to value type discovery, where you express the importance of different features.
The scoring can be compared to the ElasticSearch approach to composable queries.Here is a short config example:
```php
$isDatetime = new Score\Max([
new Score\IsDateTimeFormat(DateTime::ISO8601),
new Score\IsDateTimeFormat(DateTime::ATOM),
new Score\IsDateTimeFormat(DateTime::RSS),
new Score\IsDateTimeFormat(DateTime::W3C),
new Score\IsDateTimeFormat(DateTime::RFC3339),
new Score\IsDateTimeFormat('Y-m-d H:i:s'),
new Score\Boost(0.5, [
new Score\IsDateTimeFormat(DateTime::RFC822),
new Score\IsDateTimeFormat(DateTime::RFC850),
new Score\IsDateTimeFormat(DateTime::RFC1036),
new Score\IsDateTimeFormat(DateTime::RFC1123),
new Score\IsDateTimeFormat(DateTime::RFC2822),
new Score\IsDateTimeFormat(DateTime::COOKIE),
]),
]);
return [
'datetime' => $isDatetime,
'id' => new Score\Sum([
// is unique
new Score\IsUnique(),
// is not (boost of -1 == IS NOT)
new Score\Boost(-1, [
// a URL
new Score\IsFilterVar(FILTER_VALIDATE_URL),
// a decimal number ("." seperated)
new Score\IsDecimal(),
]),
// Not a date.
new Score\Boost(-1, [
$isDatetime
]),
// // tie breaker on ancestor level
new Score\Boost(-0.001, [
new Score\AncestorCount(),
]),
// // tie breaker, by word count. More words == less likely to be the id
new Score\Boost(-0.05, [
new Score\MatchCount('/\s+/S'),
]),
// // tie breaker, not a number
new Score\Boost(-0.25, [
new Score\IsMatch('/[^\d]+/S'),
]),
])
];
```This config describes two fields to look for:
- datetime
- id## How
WhatTheField will first look for the primary collection in the provided document.
The primary collection is the outer-most, biggest collection of similar children.
e.g.
```xml
```
Here the collection is `` in ``, or XPath: `/things/thing`.From the collection of items, WhatTheField searches for fields that best match your config
across the entire feed.
The result is an array where the key is the configuration name (e.g. `datetime` or `id`) and the value
is a sorted array of xpath to score mapping:
```php
$result = [
'id' => ['/things/thing/datetime' => 2.2, '/things/thing/looks_a_little_like_datetime' => 0.1],
/* ... */
]
```