{"id":48574080,"url":"https://github.com/pd3f/dehyphen","last_synced_at":"2026-04-08T15:36:40.227Z","repository":{"id":43320921,"uuid":"275787240","full_name":"pd3f/dehyphen","owner":"pd3f","description":"📜 Dehyphenation of broken text (mainly German), i.e., extracted from a PDF","archived":false,"fork":false,"pushed_at":"2022-03-08T12:43:02.000Z","size":197,"stargazers_count":39,"open_issues_count":8,"forks_count":4,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-03-05T18:24:38.425Z","etag":null,"topics":["dehyphenation","flair","flair-embeddings","german","hyphen","hyphens","nlp","pd3f","pdf","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pd3f.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-06-29T11:02:55.000Z","updated_at":"2025-05-12T07:51:13.000Z","dependencies_parsed_at":"2022-08-03T06:30:22.025Z","dependency_job_id":null,"html_url":"https://github.com/pd3f/dehyphen","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/pd3f/dehyphen","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pd3f%2Fdehyphen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pd3f%2Fdehyphen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pd3f%2Fdehyphen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pd3f%2Fdehyphen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pd3f","download_url":"https://codeload.github.com/pd3f/dehyphen/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pd3f%2Fdehyphen/sbom","scorecard":{"id":725354,"data":{"date":"2025-08-11","repo":{"name":"github.com/pd3f/dehyphen","commit":"92306a6a2f0936695d0b203b04195e782477d713"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":1.7,"checks":[{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: GNU General Public License v3.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":0,"reason":"31 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2024-48 / GHSA-fj7x-q9j7-g6q6","Warn: Project is vulnerable to: PYSEC-2022-42986 / GHSA-43fp-rhv2-5gv8","Warn: Project is vulnerable to: PYSEC-2023-135 / GHSA-xqr8-7jwr-rhp7","Warn: Project is vulnerable to: PYSEC-2024-60 / GHSA-jjg7-2v4v-x38h","Warn: Project is vulnerable to: GHSA-6p56-wp2h-9hxr","Warn: Project is vulnerable to: GHSA-fpfv-jqm9-f5jm","Warn: Project is vulnerable to: PYSEC-2020-92 / GHSA-hj5v-574p-mj7c","Warn: Project is vulnerable to: PYSEC-2022-42969","Warn: Project is vulnerable to: GHSA-9hjg-9r4m-mvj7","Warn: Project is vulnerable to: GHSA-9wx4-h78v-vm56","Warn: Project is vulnerable to: PYSEC-2023-74 / GHSA-j8r2-6x86-q33q","Warn: Project is vulnerable to: GHSA-3749-ghw9-m3mg","Warn: Project is vulnerable to: PYSEC-2022-43015 / GHSA-47fc-vmwq-366v","Warn: Project is vulnerable to: PYSEC-2025-41 / GHSA-53q9-r3pm-6pq6","Warn: Project is vulnerable to: PYSEC-2024-252 / GHSA-5pcm-hx3q-hm94","Warn: Project is vulnerable to: GHSA-887c-mr87-cxwp","Warn: Project is vulnerable to: PYSEC-2024-251 / GHSA-pg7h-5qx3-wjr3","Warn: Project is vulnerable to: PYSEC-2024-250","Warn: Project is vulnerable to: PYSEC-2024-259","Warn: Project is vulnerable to: GHSA-g7vv-2v7x-gj9p","Warn: Project is vulnerable to: GHSA-34jh-p97f-mpxf","Warn: Project is vulnerable to: PYSEC-2023-212 / GHSA-g4mx-q9vg-27p4","Warn: Project is vulnerable to: PYSEC-2023-207 / GHSA-gwvm-45gx-3cf8","Warn: Project is vulnerable to: PYSEC-2019-133 / GHSA-mh33-7rrq-662w","Warn: Project is vulnerable to: GHSA-pq67-6m6q-mj2v","Warn: Project is vulnerable to: PYSEC-2019-132 / GHSA-r64q-w8jr-g9qp","Warn: Project is vulnerable to: PYSEC-2023-192 / GHSA-v845-jxx5-vc9f","Warn: Project is vulnerable to: PYSEC-2020-148 / GHSA-wqvq-5m8c-6g24","Warn: Project is vulnerable to: PYSEC-2018-32 / GHSA-www2-v7xj-xrc6","Warn: Project is vulnerable to: PYSEC-2021-108","Warn: Project is vulnerable to: GHSA-jfmj-5v4g-7637"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-22T12:37:49.597Z","repository_id":43320921,"created_at":"2025-08-22T12:37:49.597Z","updated_at":"2025-08-22T12:37:49.597Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31562695,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-08T14:31:17.711Z","status":"ssl_error","status_checked_at":"2026-04-08T14:31:17.202Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dehyphenation","flair","flair-embeddings","german","hyphen","hyphens","nlp","pd3f","pdf","python"],"created_at":"2026-04-08T15:36:40.088Z","updated_at":"2026-04-08T15:36:40.220Z","avatar_url":"https://github.com/pd3f.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# `dehyphen` [![PyPI](https://img.shields.io/pypi/v/dehyphen.svg)](https://pypi.org/project/dehyphen/) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/dehyphen.svg)](https://pypi.org/project/dehyphen/) [![PyPI - Downloads](https://img.shields.io/pypi/dm/dehyphen)](https://pypistats.org/packages/dehyphen)\n\n*Experimental, use with care.*\n\nPython package for **dehyphenation of broken text**, i.e., extracted from a PDF. Mainly for the German but works for other languages as well.\n\n`dehyphen` tries to reconstruct the original continuous text by choosing the most probably way to join lines or paragraphs (and remove hyphens).\nSeveral options are getting scored by calculating the [perplexity](https://en.wikipedia.org/wiki/Perplexity#Perplexity_per_word) of text, using [Flair](https://github.com/flairNLP/flair)'s character-based [language models](https://machinelearningmastery.com/statistical-language-modeling-and-neural-language-models/).\nBased on these scores, the best fitting option is taken to guess the original text.\n\nCheck out the PDF text extraction pipeline [pd3f](https://github.com/pd3f/pd3f) that uses `dehypen` internally.\n\n\n## An Example\n\nFor this input\n\n\u003e die Bedeutung der finan-\n\u003e\n\u003e ziellen Interessen der Union\n\n`dehyphen` joines the lines and removes the '-'.\n\n\u003e die Bedeutung der **finanziellen** Interessen der Union\n\nBut in this example\n\n\u003e Auch andere EU-\n\u003e\n\u003e Staaten, wie bspw. Polen,\n\nthe lines are also joined bu the hyphen is kept (becaus it's part of the word).\n\n\u003e Auch andere **EU-Staaten**, wie bspw. Polen,\n\n\n## Installation\n\n```bash\npip install dehyphen\n```\n\nor\n\n```bash\npoetry add dehyphen\n```\n\n## Usage\n\n```python\nfrom dehyphen import FlairScorer\n\nscorer = FlairScorer(lang=\"de\")\n```\n\nYou need to set `lang` to `de` for German, `en` for English, `es` for Spanish, etc. Otherwise, a multi-language-model will be chosen as the default. [See this section in the source code for more models](https://github.com/flairNLP/flair/blob/8c09e62d9a5a3c227b9ca0fb9f214de9620d4ca0/flair/embeddings/token.py#L431) (but omit the \"-backwards\" and \"-forwards\" as specified by Flair). [Some are described here](https://github.com/flairNLP/flair/blob/master/resources/docs/embeddings/FLAIR_EMBEDDINGS.md) and [there is another repo with some more models](https://github.com/flairNLP/flair-lms).\n\nTo speed up computations, choose a `-fast` language model from Flair. However, there are currently only a few.\nThere is for instance a multi-language one named `multi-v0` that contains English, German, French and others.\nUnfortunately, there is no fast German-only model right now.\n\nUsing CUDA (with a GPU) dramatically improves performance.\n\n### 1. remove hyphens from the end of a line (within paragraphs)\n\n```python\n# returns cleaned paragraph\nscorer.dehyphen(special_format)\n```\n\nThe input text has to be in a special format. Paragraphs should be seperated by two newlines characters (`\\n\\n`). Line should be end with a single newline `\\n`. Several helper functions exists to transform the data into the required format.\n\n### 2. join paragraphs, e.g., to reverse a page break\n\n```python\n# returns the joined paragraphs if the language model thinks there were split, otherwise `None`\nscorer.is_split_paragraph(paragraph_1, paragraph_2)\n```\n\n## Example\n\n```python\nfrom dehyphen import FlairScorer\n\nscorer = FlairScorer(lang=\"de\")\n\nsome_german_text = \"\"\"Zwar wird durch die Einführung eines eigenen Strafgesetzes die Bedeutung der finan-\nziellen Interessen der Union gewiss unterstrichen, dennoch erscheint die Aufspaltung\ndes strafrechtlichen Vermögensschutzes zweifelhaft, insbesondere soweit es densel-\nben Schutzgegenstand, nämlich die vermögensrelevanten Interessen der Union be-\ntrifft. Zum einen wird es den Normunterworfenen ohne Not erschwert, die zu befolgen-\nden Strafgesetze zu erfassen. Zum anderen ergeben sich potentielle Auslegungsdif-\n\nferenzen durch die Verwendung teilweise abweichender Terminologie (finanzielle In-\nteressen vs. Vermögen). Schließlich wird der Schutz besagter Interessen ohnedies\nbislang innerhalb des StGB gewährleistet. Daher empfiehlt es sich u.E., sämtliche Re-\ngelungen des RegE in das StGB zu integrieren, soweit entsprechende Neuregelungen\nüberhaupt erforderlich sind. Hierdurch wird sich auch eine klarere Trennung von Straf-\nrecht und Verwaltungsrecht erreichen lassen.\n\nDas Erfolgsverständnis entspricht daher eher dem wesentlich weiteren Betrugsbegriff\nbspw. des US-amerikanischen Rechts (Federal Law bspw. Fraud, Defraud, Wire-\nFraud, Bank-Fraud, 18.U.S.C. §1341 ff.(2016)) , die teilweise auch ganz auf einen\nSchaden verzichten. Fraud erfasst auch viele untreue- und unterschlagungsähnliche\nVerhaltensweisen sowie betrügerische Verfügungen als solche. Auch andere EU-\nStaaten, wie bspw. Polen, liegen im Hinblick auf den Erfolg näher bei der Richtlinie\nals bei der deutschen Schadensdogmatik.\n\"\"\"\n\nspecial_format = text_to_format(some_german_text)\nfixed_hyphens = scorer.dehyphen(special_format)\n\n# checks if two paragraphs can be joined, useful to, e.g., reverse page breaks.\njoined_paragraph = scorer.is_split_paragraph(fixed_hyphens[:2])\n\nprint(joined_paragraph)\n```\n**Output text**:\n\nZwar wird durch die Einführung eines eigenen Strafgesetzes die Bedeutung der finanziellen Interessen der Union gewiss unterstrichen, dennoch erscheint die Aufspaltung des strafrechtlichen Vermögensschutzes zweifelhaft, insbesondere soweit es denselben Schutzgegenstand, nämlich die vermögensrelevanten Interessen der Union betrifft. Zum einen wird es den Norm unterworfenen ohne Not erschwert, die zubefolgenden Strafgesetze zu erfassen. Zum anderen ergeben sich potentielle **Auslegungsdifferenzen** durch die Verwendung teilweise abweichender Terminologie (finanzielle Interessen vs. Vermögen). Schließlich wird der Schutz besagter Interessen ohnediesbislang innerhalb des StGB gewährleistet. Daher empfiehlt es sich u.E., sämtliche Regelungen des RegE in das StGB zu integrieren, soweit entsprechende Neuregelungenüberhaupt erforderlich sind. Hierdurch wird sich auch eine klarere Trennung von Strafrecht und Verwaltungsrecht erreichen lassen.\n\n*Hyphens are removed, paragraphs are joined along the word **Auslegungsdifferenzen**.*\n\n```python\nprint(fixed_hyphens[-1])\n```\n**Output text**:\n\nDas Erfolgsverständnis entspricht daher eher dem wesentlich weiteren Betrugsbegriff bspw. des US-amerikanischen Rechts (Federal Law bspw. Fraud, Defraud, **Wire-Fraud**, Bank-Fraud, 18.U.S.C. §1341 ff.(2016)), die teilweise auch ganz auf einen Schaden verzichten. Fraud erfasst auch viele untreue- und unterschlagungsähnliche Verhaltensweisen sowie betrügerische Verfügungen als solche. Auch andere **EU-Staaten**, wie bspw. Polen, liegen im Hinblick auf den Erfolg näher bei der Richtlinie als bei der deutschen Schadensdogmatik und Verwaltungsrecht erreichen lassen.\n\n***EU-Staaten** \u0026 **Wire-Fraud** are not dehyphenized.*\n\n\n## License\n\nGPLv3","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpd3f%2Fdehyphen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpd3f%2Fdehyphen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpd3f%2Fdehyphen/lists"}