{"id":34609606,"url":"https://github.com/amir-zeldes/hebpipe","last_synced_at":"2025-12-24T14:03:04.826Z","repository":{"id":42386529,"uuid":"148917890","full_name":"amir-zeldes/HebPipe","owner":"amir-zeldes","description":"An NLP pipeline for Hebrew","archived":false,"fork":false,"pushed_at":"2025-06-16T18:03:59.000Z","size":8836,"stargazers_count":38,"open_issues_count":4,"forks_count":14,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-08-25T01:38:11.589Z","etag":null,"topics":["hebrew","hebrew-nlp","lemmatization","morphological-analysis","nlp","part-of-speech-tagger","universal-dependencies"],"latest_commit_sha":null,"homepage":"","language":"Lex","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/amir-zeldes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-09-15T16:10:05.000Z","updated_at":"2025-06-16T18:02:15.000Z","dependencies_parsed_at":"2025-06-16T19:33:31.363Z","dependency_job_id":null,"html_url":"https://github.com/amir-zeldes/HebPipe","commit_stats":{"total_commits":105,"total_committers":5,"mean_commits":21.0,"dds":0.2952380952380952,"last_synced_commit":"4f395349b58a055c778d4db46d8d9f54daa71794"},"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"purl":"pkg:github/amir-zeldes/HebPipe","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amir-zeldes%2FHebPipe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amir-zeldes%2FHebPipe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amir-zeldes%2FHebPipe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amir-zeldes%2FHebPipe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/amir-zeldes","download_url":"https://codeload.github.com/amir-zeldes/HebPipe/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amir-zeldes%2FHebPipe/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28003721,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-24T02:00:07.193Z","response_time":83,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hebrew","hebrew-nlp","lemmatization","morphological-analysis","nlp","part-of-speech-tagger","universal-dependencies"],"created_at":"2025-12-24T14:02:40.035Z","updated_at":"2025-12-24T14:03:04.806Z","avatar_url":"https://github.com/amir-zeldes.png","language":"Lex","funding_links":[],"categories":[],"sub_categories":[],"readme":"# HebPipe Hebrew NLP Pipeline\n\nA simple NLP pipeline for Hebrew text in UTF-8 encoding, using standard components. Basic features:\n\n  * Performs end to end processing, optionally skipping steps as needed:\n    * whitespace tokenization\n    * morphological segmentation\n    * POS tagging\n    * morphological tagging\n    * dependency parsing\n    * named and non-named entity type recognition (**experimental**)\n    * coreference resolution (**experimental**)\n  * Does not alter the input string (text reconstructible from, and alignable to output)\n  * Compatible with Python 3.5+, Linux, Windows and OSX\n\nNote that entity recognition and coreference are still in beta and offer rudimentary accuracy.\n\nTo cite this tool in academic papers please refer to this paper:\n\nZeldes, Amir, Nick Howell, Noam Ordan and Yifat Ben Moshe (2022) [A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing](https://arxiv.org/abs/2210.07873). In: *Proceedings of EMNLP 2022*. Abu Dhabi, UAE.\n\n\n```\n@InProceedings{ZeldesHowellOrdanBenMoshe2022,\n  author    = {Amir Zeldes and Nick Howell and Noam Ordan and Yifat Ben Moshe},\n  booktitle = {Proceedings of {EMNLP} 2022},\n  title     = {A SecondWave of UD Hebrew Treebanking and Cross-Domain Parsing},\n  pages     = {4331--4344},\n  year      = {2022},\n  address   = {Abu Dhabi, UAE},\n}\n```\n\n## Performance\n\nCurrent scores on UD_Hebrew-HTB (IAHLT version tokenization) using the official conll scorer, end to end from plain text, trained jointly on UD Hebrew:\n\n```\nMetric     | Precision |    Recall |  F1 Score | AligndAcc\n-----------+-----------+-----------+-----------+-----------\nTokens     |     99.93 |     99.97 |     99.95 |\nSentences  |     98.39 |     99.39 |     98.89 |\nWords      |     99.14 |     99.09 |     99.11 |\nUPOS       |     96.17 |     96.12 |     96.15 |     97.01\nXPOS       |     96.17 |     96.12 |     96.15 |     97.01\nUFeats     |     90.25 |     90.21 |     90.23 |     91.04\nAllTags    |     89.61 |     89.57 |     89.59 |     90.39\nLemmas     |     95.26 |     95.21 |     95.23 |     96.09\nUAS        |     90.45 |     90.41 |     90.43 |     91.24\nLAS        |     87.64 |     87.60 |     87.62 |     88.41\nCLAS       |     82.82 |     82.33 |     82.57 |     83.39\nMLAS       |     69.68 |     69.27 |     69.47 |     70.16\nBLEX       |     78.01 |     77.55 |     77.78 |     78.55\n```\n\nCurrent scores on UD_Hebrew-IAHLTwiki using the official conll scorer, end to end from plain text, trained jointly on UD Hebrew:\n\n```\nMetric     | Precision |    Recall |  F1 Score | AligndAcc\n-----------+-----------+-----------+-----------+-----------\nTokens     |     99.71 |     99.89 |     99.80 |\nSentences  |     99.49 |     99.75 |     99.62 |\nWords      |     99.48 |     99.19 |     99.33 |\nUPOS       |     96.57 |     96.29 |     96.43 |     97.08\nXPOS       |     96.57 |     96.29 |     96.43 |     97.08\nUFeats     |     90.90 |     90.63 |     90.77 |     91.38\nAllTags    |     90.21 |     89.95 |     90.08 |     90.69\nLemmas     |     97.37 |     97.09 |     97.23 |     97.89\nUAS        |     92.44 |     92.17 |     92.31 |     92.93\nLAS        |     90.08 |     89.82 |     89.95 |     90.56\nCLAS       |     86.48 |     85.82 |     86.15 |     86.81\nMLAS       |     73.04 |     72.49 |     72.76 |     73.32\nBLEX       |     83.61 |     82.98 |     83.29 |     83.93\n```\n\n## Installation\n\nEither install from PyPI using pip:\n\n`pip install hebpipe`\n\nAnd run as a module:\n\n`python -m hebpipe example_in.txt`\n\nOr install manually: \n\n  * Clone this repository into the directory that the script should run in (git clone https://github.com/amir-zeldes/HebPipe)\n  * In that directory, install the dependencies under **Requirements**, e.g. by running `python setup.py install` or `pip install -r requirements.txt`\n  \nModels can be downloaded automatically by the script on its first run.\n  \n## Requirements\n\n### Python libraries\n\nRequired libraries:\n\n```\nrequests\ntransformers==4.35.2\ntorch==2.1.0\nxgboost==2.0.3\ngensim==4.3.2\nrftokenizer\u003e=2.2.0\nnumpy\nscipy\ndepedit\u003e=3.3.1\npandas==2.1.2\njoblib==1.3.2\nxmltodict==0.13.0\ndiaparser==1.1.2\nflair==0.13.0\nstanza==1.7.0\nconllu==4.5.3\nprotobuf==4.23.4\n```\n\nYou should be able to install these manually via pip if necessary (i.e. `pip install rftokenizer==2.2.0` etc.).\n\nNote that some older versions of Python + Windows do not install numpy correctly from pip, in which case you can download compiled binaries for your version of Python + Windows here: https://www.lfd.uci.edu/~gohlke/pythonlibs/\n\n\n### Model files\n\nModel files are too large to include in the standard GitHub repository. The software will offer to download them automatically. The latest models can also be downloaded manually at https://gucorpling.org/amir/download/heb_models_v4/. \n\n## Command line usage\n\n```\nusage: python heb_pipe.py [OPTIONS] files\n\npositional arguments:\n  files                 File name or pattern of files to process (e.g. *.txt)\n\noptions:\n  -h, --help            show this help message and exit\n\nstandard module options:\n  -w, --whitespace      Perform white-space based tokenization of large word forms\n  -t, --tokenize        Tokenize large word forms into smaller morphological segments\n  -p, --posmorph        Do POS and Morph tagging\n  -l, --lemma           Do lemmatization\n  -d, --dependencies    Parse with dependency parser\n  -e, --entities        Add entity spans and types\n  -c, --coref           Add coreference annotations\n  -s SENT, --sent SENT  XML tag to split sentences, e.g. \"s\" for \u003cs ..\u003e ... \u003c/s\u003e, or \"newline\" to use newlines, \"auto\" for automatic splitting, or \"both\" for both\n  -o {pipes,conllu,sgml}, --out {pipes,conllu,sgml}\n                        Output CoNLL format, SGML or just tokenize with pipes\n\nless common options:\n  -q, --quiet           Suppress verbose messages\n  -x EXTENSION, --extension EXTENSION\n                        Extension for output files (default: .conllu)\n  --cpu                 Use CPU instead of GPU (slower)\n  --disable_lex         Do not use lexicon during lemmatization\n  --dirout DIROUT       Optional output directory (default: this dir)\n  --from_pipes          Input contains subtoken segmentation with the pipe character (no automatic tokenization is performed)\n  --version             Print version number and quit\n```\n\n### Example usage\n\nWhitespace tokenize, tokenize morphemes, add pos, lemma, morph, dep parse with automatic sentence splitting,\nentity recognition and coref for one text file, output in default conllu format:\n\u003e python heb_pipe.py -wtpldec example_in.txt\n\nOR specify no processing options (automatically assumes you want all steps)\n\u003e python heb_pipe.py example_in.txt\n\nJust tokenize a file using pipes:\n\u003e python heb_pipe.py -wt -o pipes example_in.txt\n\nPOS tag, lemmatize, add morphology and parse a pre-tokenized file, splitting sentences by existing \u003csent\u003e tags:\n\u003e python heb_pipe.py -pld -s sent example_in.txt\n\nAdd full analyses to a whole directory of *.txt files, output to a specified directory:\n\u003e python heb_pipe.py -wtpldec --dirout /home/heb/out/ *.txt\n\nParse a tagged TT SGML file into CoNLL tabular format for treebanking, use existing tag \u003csent\u003e to recognize sentence borders:\n\u003e python heb_pipe.py -d -s sent example_in.tt\n\n## Input formats\n\nThe pipeline accepts the following kinds of input:\n\n  * Plain text, with normal Hebrew whitespace behavior. Newlines are assumed to indicate a sentence break, but longer paragraphs will receive automatic sentence splitting too (use: -s both).\n  * Gold super-tokenized: if whitespace tokenization is already done, you can leave out `-w`. The system expect one super-token per line in this case (e.g. בבית is on one line)\n  * Gold tokenized: if gold morphological segmentation is already done, you can input one gold token per line.\n  * Pipes: if morphological segmentation is already done, you can also input one super-token per line with sub-tokens separated by pipes - use `--from_pipes` for this option (allows running the segmenter, outputting pipes for manual correction, then continuing NLP processing from pipes)\n  * XML sentence tags in input: use -s TAGNAME to indicate an XML tag providing gold sentence boundaries.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famir-zeldes%2Fhebpipe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famir-zeldes%2Fhebpipe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famir-zeldes%2Fhebpipe/lists"}