{"id":37072803,"url":"https://github.com/leandroroser/prettyparser","last_synced_at":"2026-01-14T08:32:51.181Z","repository":{"id":57454747,"uuid":"426059336","full_name":"leandroroser/prettyparser","owner":"leandroroser","description":"Parallel processing and parsing PDF and TXT files, and Python objects with text (str, list) using rules (regular expressions). ","archived":false,"fork":false,"pushed_at":"2023-01-29T22:03:01.000Z","size":109,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-22T19:11:32.965Z","etag":null,"topics":["pdf-parser","regex"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/prettyparser/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/leandroroser.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-11-09T02:05:02.000Z","updated_at":"2023-03-08T00:25:13.000Z","dependencies_parsed_at":"2023-02-16T01:25:15.244Z","dependency_job_id":null,"html_url":"https://github.com/leandroroser/prettyparser","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/leandroroser/prettyparser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leandroroser%2Fprettyparser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leandroroser%2Fprettyparser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leandroroser%2Fprettyparser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leandroroser%2Fprettyparser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/leandroroser","download_url":"https://codeload.github.com/leandroroser/prettyparser/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leandroroser%2Fprettyparser/sbom","scorecard":{"id":581919,"data":{"date":"2025-08-11","repo":{"name":"github.com/leandroroser/prettyparser","commit":"479412e8b0986bdd1e2d0b8997f373115dd4555e"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":1.7,"checks":[{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE.txt:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE.txt:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'main'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":0,"reason":"16 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: GHSA-3ww4-gg4f-jr7f","Warn: Project is vulnerable to: GHSA-5cpq-8wj7-hf2v","Warn: Project is vulnerable to: PYSEC-2024-225 / GHSA-6vqw-3v5j-54x4","Warn: Project is vulnerable to: GHSA-9v9h-cgj8-h64p","Warn: Project is vulnerable to: GHSA-h4gh-qq45-vh27","Warn: Project is vulnerable to: PYSEC-2023-254 / GHSA-jfhm-5ghh-2f97","Warn: Project is vulnerable to: GHSA-jm77-qphf-c4w8","Warn: Project is vulnerable to: GHSA-v8gr-m533-ghj9","Warn: Project is vulnerable to: GHSA-w7pp-m8wf-vj6r","Warn: Project is vulnerable to: GHSA-x4qr-2fvf-3mr5","Warn: Project is vulnerable to: GHSA-3f63-hfp8-52jq","Warn: Project is vulnerable to: GHSA-44wm-f244-xhp3","Warn: Project is vulnerable to: PYSEC-2023-227 / GHSA-8ghj-p4vj-mr35","Warn: Project is vulnerable to: GHSA-j7hp-h8jx-5ppr","Warn: Project is vulnerable to: PYSEC-2023-175","Warn: Project is vulnerable to: GHSA-g7vv-2v7x-gj9p"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-20T19:23:59.970Z","repository_id":57454747,"created_at":"2025-08-20T19:23:59.970Z","updated_at":"2025-08-20T19:23:59.970Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28414232,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T08:31:27.429Z","status":"ssl_error","status_checked_at":"2026-01-14T08:31:19.098Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pdf-parser","regex"],"created_at":"2026-01-14T08:32:50.567Z","updated_at":"2026-01-14T08:32:51.162Z","avatar_url":"https://github.com/leandroroser.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n![icon](https://user-images.githubusercontent.com/10769732/140857203-e0580717-52c3-4cdd-affc-00ad5bf0a526.png)\n\n\n\nprettyparser is a Python library for parallel processing and parsing PDF/TXT and Python objects with text (str, list) using rules (regular expressions). \nIn case of PDF files, the package reads the content using pdfplumber and then performs a series of\ndata manipulations to generate a higher quality output, removing the boilerplate code needed to read/process/write the content of multiple files with multiple pages. A custom processing function using pdfplumber that takes a page and returns a processed text is also allowed. Additional data processing steps can be added via custom regular expressions, that are compiled for improved speed.\n\n\n## Installation\n\n```\n$ git clone https://github.com/leandroroser/prettyparser\n$ cd prettyparser\n$ pip install -e .\n```\n\nor\n\n```\n$ pip install prettyparser\n```\n\n\n## Example: processing a series PDF files\n\n\n```Python\nimport regex as re\nfrom prettyparser import PrettyParser\n\nfiles = [\"./BOOKS/PDF/PDF1.pdf\", \"./BOOKS/PDF/PDF2.pdf\"]\noutput = \"./BOOKS/TXT\"\nparser = PrettyParser(files, None, output, mode = 'pdf',\n                      args = [[r\"(\\n\\s*\\d+\\s*\\n)|(\\n\\s*\\d+\\s*$)\", r'\\n\\n'],\n                            [r\"\\n\\s*-\\d-\\s*\\n\", r'\\n\\n'], \n                            [r\"\\n\\s*(\\* *)+\\s*\\n\", r'\\n\\n'],\n                            [r\"__some_header_text\", r'\\n\\n', re.IGNORECASE]],\n                            remove_whitelines = True,\n                            paragraphs_spacing = 1,\n                            remove_hyphen_eol = True)\nparser.run()\n```\n\n\n## Example: processing a folder with multiple PDF files\n\n\n```Python\nimport regex as re\nfrom prettyparser import PrettyParser\n\ndirectory = \"./BOOKS/PDF\"\noutput = \"./BOOKS/TXT\"\nparser = PrettyParser(None, directory, output, mode = 'pdf',\n                      args = [[r\"(\\n\\s*\\d+\\s*\\n)|(\\n\\s*\\d+\\s*$)\", r'\\n\\n'],\n                            [r\"\\n\\s*-\\d-\\s*\\n\", r'\\n\\n'], \n                            [r\"\\n\\s*(\\* *)+\\s*\\n\", r'\\n\\n'],\n                            [r\"__some_header_text\", r'\\n\\n', re.IGNORECASE]],\n                            remove_whitelines = True,\n                            paragraphs_spacing = 1,\n                            remove_hyphen_eol = True)\nparser.run()\n```\n\n## Example: processing a folder with multiple TXT files\nLet's assume that the previous output isn't good enough and needs additional corrections. \nA quicker way for testing additional corrections can be implemented by using the previous TXT output:\n\n```Python\ndirectory = \"./BOOKS/TXT\"\noutput = \"./BOOKS/TXT_REPARSED\"\nparser = PrettyParser(None, directory, output,  mode = 'txt', \n                        args=[[r\"some other header.*\\d+\", r''],\n                            [r\"^\\d+.*\", r'', re.MULTILINE], \n                            [r\"([A-Z]+)( *\\n)([A-Z]+)\", r'\\1\\3'],\n                            remove_whitelines = True,\n                            paragraphs_spacing = 1,\n                            remove_hyphen_eol = True)\nparser.run()\n```\n\n## Example: processing a Python str for a quick test of the app\n\n```Python\nimport regex as re\nfrom prettyparser import PrettyParser\n\n\ntxt = \"\"\"\nheader to remove\n\nThis is a text with multiple problems. For exam-\nple the latter word can be joined. \nThe portions of this line can be\njoined\nin a single line.\nHERE ALSO IS SOME\nUPPERCASE TEXT\nTO JOIN\nSome Other Ugly Stuff To Remove IGNORING Case. \n\nRemove the line below:\n\n* * * \n\nRemove empty lines and finally separate paragraphs with a blank line.\n\n\nBelow is the page number-\u003e.\n99\n\"\"\"\nparser = PrettyParser(txt, mode = \"pyobj\", args = [[r\"\\s*header to remove\\s*\\n\",r\"\"],\n                                                    [r\"(\\n\\s*\\d+\\s*\\n)\", r'\\n\\n'],\n                                                    [r\"\\n\\s*(\\* *)+\\s*\\n\", r'\\n\\n'],\n                                                    [r\"\\n.*some other ugly stuff.*\", \n                                                    r'\\n\\n', re.IGNORECASE]],\n                                                    remove_whitelines = True,\n                                                    paragraphs_spacing = 1,\n                                                    remove_hyphen_eol = True)\noutput = parser.run()\nprint(output[0])\n```\n\n```\nThis is a text with multiple problems. For example the latter word can be joined.\n\nThe portions of this line can be joined in a single line.\n\nHERE ALSO IS SOME UPPERCASE CASE TEXT TO JOIN\n\nRemove the line below: \n\nRemove empty lines and finally separate each line with a blank line.\n\nBelow is the page number-\u003e.\n```\n\n## Runnning from the command line\n\n```\n prettyparser --directories /home/BOOKS --output /home/BOOKS_PARSED --mode 'pdf'\n```\n\n\n\nArguments\n---------\n- **files (list or str)**: Path to parse for pdf/txt operations. If a string is passed, it will be treated as a directory when mode is 'pdf' or 'txt'. If a str or list is passed when mode is 'pyobj', it will be treated as a str/list of text files already loaded in memory in the corresponding object\n- **output (str)**: output directory\n- **args (list)**: list of tuples of the form (regex, replacement, flags). The flag can be absent\n- **mode (str)**: 'pdf', 'txt' or 'pyobj' (the latter for Python lists and strings)\n- **default (bool)**: if True, perform several default cleanup operations (default)\n- **remove_whitelines (bool)**: if True, remove whitespaces\n- **paragraphs_spacing (int)**: number of newlines between paragraphs\n- **page_spacing (str)**: string to insert between pages\n- **remove_hyphen_eol (bool)**: if True, remove end of line hyphens and merge subwords\n- **custom_pdf_fun (Callable)**: custom function to parse pdf files\n- **overwrite(bool)**: Overwrite file if exists. Default False\n- **n_jobs(int)**: Number of jobs. Default: number of cores -1\n  It must accept a pdfplumber page as argument and return a text to be joined with previous pages\n\nCurrent language support for the default parser\n------------------------------------------------\nEnglish, Spanish, German, French, Portuguese\n\nLicense\n-------\n© Leandro Roser, 2023. Licensed under an [Apache-2](https://github.com/leandroroser/prettyparser/blob/main/LICENSE.txt) license.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleandroroser%2Fprettyparser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fleandroroser%2Fprettyparser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleandroroser%2Fprettyparser/lists"}