https://github.com/apajo/php-data-miner
Train your Miner with previously entered data. Start collecting data from them automatically. Supports semi-structured data-sctructures (such as PDF invoices) and unstuctured data (free text like emails)
https://github.com/apajo/php-data-miner
annotations dataextraction nltk rubix-ml
Last synced: 7 months ago
JSON representation
Train your Miner with previously entered data. Start collecting data from them automatically. Supports semi-structured data-sctructures (such as PDF invoices) and unstuctured data (free text like emails)
- Host: GitHub
- URL: https://github.com/apajo/php-data-miner
- Owner: apajo
- License: apache-2.0
- Created: 2022-03-01T00:39:52.000Z (almost 4 years ago)
- Default Branch: master
- Last Pushed: 2022-09-05T23:31:48.000Z (over 3 years ago)
- Last Synced: 2025-03-04T12:24:37.397Z (11 months ago)
- Topics: annotations, dataextraction, nltk, rubix-ml
- Language: PHP
- Homepage:
- Size: 221 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
PHP Data Miner
============
### Introduction
[php-data-miner](https://github.com/apajo/php-data-miner) extracts data
from structured data formats (such as PDF documents).
### Prerequisites
* Unix-like OS
* GNU Make
* PHP (>=7.4)
* NodeJS (>=8)
### Installation
```bash
$ sudo apt-get install make gcc gfortran php-dev libopenblas-dev liblapacke-dev re2c build-essential
```
### Usage
#### Annotation
Annotate your model with `@Model()` and properties with `@Property()` annotations.
```php
use PhpDataMiner\Model\Annotation\Model;
use PhpDataMiner\Model\Annotation\Property;
/**
* @Model()
*/
class Invoice
{
/**
* @var string
* @Property()
*/
protected string $number;
}
```
#### Create your miner
```php
$miner = $this->miner->create($entity, [
'storage' => new CustomStorage(),
'property_types' => [
new FloatProperty(),
new IntegerProperty(),
new DateProperty(),
new Property(),
]
]);
$pdfContents = shell_exec('pdftotext -layout incoice.pdf -');
$doc = $miner->normalize($pdfContents, [
'filters' => [
DateFilter::class,
ColonFilter::class,
Section::class,
WordTree::class,
]
]);
$entity = new Invoice();
```
> You need to have __pdftotext__ installed to read PDF contents like shown above
* filters (or transformers) transform and normalize the content
* __WordTree__ filter is as special kind of tokenizer for nesting and grouping the contents (by rows, columns, sentences etc)
> It's recommended that you place your tokenizers as the last ones in the filters list
#### Training
Train your model with data you've already entered (supervised learning):
```php
...
$trainedProperties = $miner->train($entity, $doc);
```
#### Mining (or predicting)
Apply predicted data to your model:
```php
...
$predictedProperties = $miner->predict($entity, $doc);
```
#### Entry discrimination (filtering)
Edit your storage model `PhpDataMiner\Storage\Model\Model::createEntryDiscriminator()` method to set entry filter:
```php
use PhpDataMiner\Storage\Model\Model;
use PhpDataMiner\Storage\Model\ModelInterface;
class InvoiceModel extends Model implements ModelInterface
{
public static function createEntryDiscriminator($invoice): DiscriminatorInterface
{
return new Discriminator([
$invoice->getClient() ? $entity->getClient()->getId() : null,
$invoice->getId(),
]);
}
}
```
### Versioning
Version numbering is done following the [semantic versioning](https://semver.org/)
### TODO
* Natural language toolkit (NLTK) support
* Feature vectors for properties
### Testing
```bash
$ make tests [test_name]
```