{"id":26716478,"url":"https://gitlab.com/sean-c/pdf_rules","last_synced_at":"2025-04-14T01:25:51.897Z","repository":{"id":57451469,"uuid":"19287805","full_name":"sean-c/pdf_rules","owner":"sean-c","description":"Turn PDFs into CSVs by defining rules","archived":false,"fork":false,"pushed_at":null,"size":null,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":null,"default_branch":"master","last_synced_at":"2025-03-27T15:32:51.252Z","etag":null,"topics":["Data Cleaning","automation","data","data parsing"],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-2.1","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":null,"metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-06-09T20:38:28.786Z","updated_at":"2021-12-24T12:03:53.905Z","dependencies_parsed_at":"2022-09-10T08:01:06.925Z","dependency_job_id":null,"html_url":"https://gitlab.com/sean-c/pdf_rules","commit_stats":null,"previous_names":[],"tags_count":1,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com/repositories/sean-c%2Fpdf_rules","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com/repositories/sean-c%2Fpdf_rules/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com/repositories/sean-c%2Fpdf_rules/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com/repositories/sean-c%2Fpdf_rules/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com/owners/sean-c","download_url":"https://gitlab.com/sean-c/pdf_rules/-/archive/master/pdf_rules-master.zip","host":{"name":"gitlab.com","url":"https://gitlab.com","kind":"gitlab","repositories_count":4518657,"owners_count":6933,"icon_url":"https://github.com/gitlab.png","version":null,"created_at":"2022-05-30T11:31:42.605Z","updated_at":"2024-07-18T11:24:13.055Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com/owners"}},"keywords":["Data Cleaning","automation","data","data parsing"],"created_at":"2025-03-27T15:27:43.349Z","updated_at":"2025-04-14T01:25:51.869Z","avatar_url":null,"language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# pdf_rules\n\nThis library is used to extract information from PDFs and output a CSV.\n\nThe library was designed to automate the extraction of data from invoices and other such documents; which hold data in hierarchical structures. The user must first define this structure with `add_level` and `add_field` functions, the library is designed to allow maximum user control.\n\n## Installation\n\n### From the PyPI\n\n\tpython -m pip install pdf_rules\n\n### From Source\n\n\tgit clone https://gitlab.com/sean-c/pdf_rules\n\tcd pdf_rules\n\tpython setup.py install\n\n### Dependencies\n\nThe module `pdftotext` is currently used to read the pdf files into a txt format before applying the rules, but this is currently not installed by default on Windows.  On Windows, you will have to find another way to convert the pdfs into txt and then pass the txt into the `PDF` class.\n\nPlease see [wsl-wrapper](https://gitlab.com/sean-c/wsl-wrapper) for a way to convert pdfs to good quality txts on Windows.\n\n## Tutorial\n\nFor the purpose of this example, please refer to `tests/test_pdf.pdf` (or `tests/test_txt.txt` if you couldn't install `pdftotext`), the example assumes the working directory to be the locaton of this document.\n\n### The Basics (no levels)\n\nBefore we extract data, we need to create an instance of the `pdf_rules.PDF` class, passing `tests/test_pdf.pdf` as an argument.\n\n\timport pdf_rules\n\n\tpdf = pdf_rules.PDF('tests/test_pdf.pdf')\n\n\tprint(pdf)\n\nThis creates the object `pdf` and reads the file `tests/test_pdf.pdf` into it as a list of strings, you can use `tests/test_txt.txt` instead by passing it in place of the pdf version.  The `print` function will print the file as held by the `pdf` object (with line numbers). Next, data can be found with the `add_field` method:\n\n\tpdf.add_field(\n\t\t'Account',\n\t\tlambda rd, i, l: 'Account' in l,\n\t\tlambda rd, i, l: l[-7:])\n\nWhere 'Account' is the field heading and the two `lambda` functions are the 'trigger' and the 'rule'. The trigger and rule have to follow the format `lambda rd, i, l: \u003cexpression\u003e`, where `rd` is the whole document, `i` is the line number, and `l` is the line. Failiure to pass `rd`, `i`, and `l` in that order will result in an exception.\n\nThe library reads the document and stops when the 'trigger' returns `True`, the data is then extracted by the 'rule' function. The data is kept in `pdf.hierarchy`. To create a csv:\n\n\tcsv = pdf_rules.CSV(pdf)\n\tprint(csv)\n\tcsv.write()\n\nThis should output:\n\n\t['Account']\n\t['ABC1234']\n\nand the same output should be written to `tests/test_pdf_pdfrules.csv`.\n\n### Creating Levels\n\nIn order to extract the invoice data, we must create 'levels', these levels allow the extraction of recurring similar data, like the addresses in the example pdf.\n\n\tpdf.add_level(\n                lambda rd, i, l: 'Charges' in l,\n                lambda rd, i, l: l == '.')\n\tpdf.add_field(\n                'Address',\n                lambda rd, i, l: 'Charges' in l,\n                lambda rd, i, l: l.split(' - ')[1],\n\n                # IMPORTANT: always remember to pass the appropriate level!\n                level=1)\n\nHere, an extra level is added, which starts on every line containing 'Charges', and ends on every line containing just the '.' character. The tables each contain some unique data, which we can collect by creating another field 'Addrssses'.\n\n\u003e Note that all `pdf` objects have a level 0, covering the whole document, so the `add_field` function needs the level stated if it is not intended for level 0\n\nThe output should be:\n\n\t['Account', 'Address']\n\t['ABC1234', '1234 Fake Street, London W15 6GH']\n\t['ABC1234', '5678 Fake Ave., Glasgow G3 6HJ'\n\n\u003e Note that 'Address' items found in level 1 inherited the 'Account' from level 0\n\n### PDF.show\n\nYou can use `pdf.show()` to display a view of the pdf with the rows highlighted according to the levels picked up by all the `add_levels` calls.  This can help with troubleshooting and fine-tuning the `add_levels` trigger arguments.\n\nAs of v1.3, `pdf.show()` now has `T` and `D` in the left margin to indicate when an `add_field` trigger or rule callback is triggered.\n\n`PDF.show` currently relies on curses, so it will only work from a terminal.\n\n### Levels Within Levels\n\nThe `add_level` function can act within levels, where all data found in sub-levels will inherit data from higher ones.\n\nAdding another level, we can extract more data:\n\n\timport re\n\n\tpdf.add_level(\n                lambda rd, i, l: re.search(r'[A-Z]{2}/\\d{4}/[A-Z]', l),\n                lambda rd, i, l: False)\n\tpdf.add_field(\n                'ID',\n                lambda rd, i, l: re.search(r'[A-Z]{2}/\\d{4}/[A-Z]', l),\n                lambda rd, i, l: l.strip(),\n                level=2)\n\n\u003e Notice that the 'trigger' for level 2 is set to `False`, this will cut off the table just before the start of the next one.\n\nOur output so far:\n\n\t['Account', 'Address', 'ID']\n\t['ABC1234', '1234 Fake Street, London W15 6GH', 'IG/1234/H']\n\t['ABC1234', '1234 Fake Street, London W15 6GH', 'ID/5678/I']\n\t['ABC1234', '1234 Fake Street, London W15 6GH', 'RD/9012/P']\n\t['ABC1234', '1234 Fake Street, London W15 6GH', 'IN/5724/O']\n\t['ABC1234', '5678 Fake Ave., Glasgow G3 6HJ', 'DH/0471/U']\n\t['ABC1234', '5678 Fake Ave., Glasgow G3 6HJ', 'JF/8364/N']\n\t['ABC1234', '5678 Fake Ave., Glasgow G3 6HJ', 'HD/1684/Q']\n\n\nAdding yet another level to find each line with charges, we will use trigger on every line with a date in, the line 'Invoice Date 01/10/2020' will not csuse a trigger as it is not in level 2.\n\n\tpdf.add_level(\n                lambda rd, i, l: re.search(r'\\d{2}/\\d{2}/\\d{4}', l),\n                lambda rd, i, l: False)\n\tpdf.add_field(\n                'Cost',\n                lambda rd, i, l: True,\n                lambda rd, i, l: pdf_rules.listify(l)[-1],\n                level=3)\n\n\u003e Note the use of `pdf_rules.listify`, this is a helper function to crudely convery the line into a list, delimited by two+ spaces.\n\u003e Also note the 'trigger' for 'Cost' is set to `True`, this means it will trigger on every line, use with caution.\n\nWe now have:\n\n\t['Account', 'Address', 'ID', 'Cost']\n\t['ABC1234', '1234 Fake Street, London W15 6GH', 'IG/1234/H', '£100.00']\n\t['ABC1234', '1234 Fake Street, London W15 6GH', 'IG/1234/H', '-$23.00']\n\t['ABC1234', '1234 Fake Street, London W15 6GH', 'IG/1234/H', '£50.00']\n\t['ABC1234', '1234 Fake Street, London W15 6GH', 'ID/5678/I', '£52.00']\n\t['ABC1234', '1234 Fake Street, London W15 6GH', 'RD/9012/P', '£48.00']\n\t['ABC1234', '1234 Fake Street, London W15 6GH', 'IN/5724/O', '-£324.00']\n\t['ABC1234', '5678 Fake Ave., Glasgow G3 6HJ', 'DH/0471/U', '£64.00']\n\t['ABC1234', '5678 Fake Ave., Glasgow G3 6HJ', 'JF/8364/N', '£83.00']\n\t['ABC1234', '5678 Fake Ave., Glasgow G3 6HJ', 'HD/1684/Q', '£45.00']\n\nFrom here we could keep adding rules in order to catch all the data, just as we did with the cost field. \n\n### Helpful Things\n\nThere are some very useful features that I won't go into here, but you can see `tutorial.py` for examples of how to use them.\n\n#### Fallbacks\n\nAn optional argument of the `PDF.add_field` function is the `fallback`.  This is `None` by default but if you pass `fallback='1234'`, then pdf_rules will use '1234' for that field whenever the trigger doesn't trigger, or the rule returns `None` or throws an exception.\n\n#### Get last Entry\n\nYou can get the last entry found by pdf_rules for a given field with `pdf.last_entry('field')`\n\n\n## To Do\n\n1. Highlight matches in `PDF.show`\n2. Offer alternatives to curses for `PDF.show`, maybe use `pillow` to create an image and show in popup.\n0. More tests\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/gitlab.com%2Fsean-c%2Fpdf_rules","html_url":"https://awesome.ecosyste.ms/projects/gitlab.com%2Fsean-c%2Fpdf_rules","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/gitlab.com%2Fsean-c%2Fpdf_rules/lists"}