{"id":20330259,"url":"https://github.com/apajo/php-data-miner","last_synced_at":"2025-07-30T23:17:03.770Z","repository":{"id":85844588,"uuid":"464692175","full_name":"apajo/php-data-miner","owner":"apajo","description":"Train your Miner with previously entered data. Start collecting data from them automatically. Supports semi-structured data-sctructures (such as PDF invoices) and unstuctured data (free text like emails)","archived":false,"fork":false,"pushed_at":"2022-09-05T23:31:48.000Z","size":226,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-04T12:24:37.397Z","etag":null,"topics":["annotations","dataextraction","nltk","rubix-ml"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apajo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-03-01T00:39:52.000Z","updated_at":"2022-05-16T16:25:33.000Z","dependencies_parsed_at":null,"dependency_job_id":"316d8d33-9ffa-45b6-8dca-e4be640e0c42","html_url":"https://github.com/apajo/php-data-miner","commit_stats":null,"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"purl":"pkg:github/apajo/php-data-miner","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apajo%2Fphp-data-miner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apajo%2Fphp-data-miner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apajo%2Fphp-data-miner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apajo%2Fphp-data-miner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apajo","download_url":"https://codeload.github.com/apajo/php-data-miner/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apajo%2Fphp-data-miner/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262362048,"owners_count":23299119,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotations","dataextraction","nltk","rubix-ml"],"created_at":"2024-11-14T20:15:43.696Z","updated_at":"2025-06-28T02:03:24.590Z","avatar_url":"https://github.com/apajo.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"PHP Data Miner\n============\n\n### Introduction\n\n[php-data-miner](https://github.com/apajo/php-data-miner) extracts data\nfrom structured data formats (such as PDF documents). \n\n### Prerequisites\n\n* Unix-like OS\n* GNU Make\n* PHP (\u003e=7.4)\n* NodeJS (\u003e=8)\n\n### Installation\n\n```bash \n$ sudo apt-get install make gcc gfortran php-dev libopenblas-dev liblapacke-dev re2c build-essential\n```\n\n### Usage\n\n#### Annotation\n\nAnnotate your model with `@Model()` and properties with `@Property()` annotations.\n\n```php\nuse PhpDataMiner\\Model\\Annotation\\Model;\nuse PhpDataMiner\\Model\\Annotation\\Property;\n\n/**\n * @Model()\n */\nclass Invoice\n{\n    /**\n     * @var string\n     * @Property()\n     */\n    protected string $number;\n}\n```\n\n\n#### Create your miner\n\n```php\n$miner = $this-\u003eminer-\u003ecreate($entity, [\n    'storage' =\u003e new CustomStorage(),\n    'property_types' =\u003e [\n        new FloatProperty(),\n        new IntegerProperty(),\n        new DateProperty(),\n        new Property(),\n    ]\n]);\n\n$pdfContents = shell_exec('pdftotext -layout incoice.pdf -');\n\n$doc = $miner-\u003enormalize($pdfContents, [\n    'filters' =\u003e [\n        DateFilter::class,\n        ColonFilter::class,\n        Section::class,\n        WordTree::class,\n    ]\n]);\n\n$entity = new Invoice();\n```\n\n\u003e You need to have __pdftotext__ installed to read PDF contents like shown above \n\n* filters (or transformers) transform and normalize the content\n* __WordTree__ filter is as special kind of tokenizer for nesting and grouping the contents (by rows, columns, sentences etc)\n\n\u003e It's recommended that you place your tokenizers as the last ones in the filters list\n\n#### Training\n\nTrain your model with data you've already entered (supervised learning):\n\n```php\n...\n\n$trainedProperties = $miner-\u003etrain($entity, $doc);\n```\n\n#### Mining (or predicting)\n\nApply predicted data to your model:\n\n```php\n...\n\n$predictedProperties = $miner-\u003epredict($entity, $doc);\n```\n\n#### Entry discrimination (filtering)\n\nEdit your storage model `PhpDataMiner\\Storage\\Model\\Model::createEntryDiscriminator()` method to set entry filter:\n\n```php\n\nuse PhpDataMiner\\Storage\\Model\\Model;\nuse PhpDataMiner\\Storage\\Model\\ModelInterface;\n\nclass InvoiceModel extends Model implements ModelInterface\n{\n    public static function createEntryDiscriminator($invoice): DiscriminatorInterface\n    {\n        return new Discriminator([\n            $invoice-\u003egetClient() ? $entity-\u003egetClient()-\u003egetId() : null,\n            $invoice-\u003egetId(),\n        ]);\n    }\n}\n```\n\n### Versioning\n\nVersion numbering is done following the [semantic versioning](https://semver.org/) \n\n### TODO\n\n* Natural language toolkit (NLTK) support \n* Feature vectors for properties\n\n### Testing\n\n```bash \n$ make tests [test_name]\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapajo%2Fphp-data-miner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapajo%2Fphp-data-miner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapajo%2Fphp-data-miner/lists"}