{"id":21582407,"url":"https://github.com/ruanchaves/hashformers","last_synced_at":"2026-01-08T12:15:45.794Z","repository":{"id":42131804,"uuid":"265834928","full_name":"ruanchaves/hashformers","owner":"ruanchaves","description":"Hashformers is a framework for hashtag segmentation with Transformers and Large Language Models (LLMs).","archived":false,"fork":false,"pushed_at":"2024-08-21T18:04:56.000Z","size":24740,"stargazers_count":71,"open_issues_count":0,"forks_count":5,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-06-27T10:07:52.679Z","etag":null,"topics":["bert","deep-learning","hashtag-segmentor","large-language-models","llms","natural-language-processing","nlp","paper","segmentation","sentiment-analysis","sentiment-classification","sentiment-polarity","transformer","transformers","transformers-gpt2","tweet-analysis","tweets-classification","twitter","twitter-sentiment-analysis","word-segmentation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ruanchaves.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-21T11:48:18.000Z","updated_at":"2025-05-07T12:47:31.000Z","dependencies_parsed_at":"2024-08-21T20:12:11.995Z","dependency_job_id":"41292f50-b9e2-4097-9f93-693cfc8793b9","html_url":"https://github.com/ruanchaves/hashformers","commit_stats":{"total_commits":498,"total_committers":2,"mean_commits":249.0,"dds":0.002008032128514081,"last_synced_commit":"c73f5610f2bdbf6ef7b4fd0ff9f7db61c73c8df2"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/ruanchaves/hashformers","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruanchaves%2Fhashformers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruanchaves%2Fhashformers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruanchaves%2Fhashformers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruanchaves%2Fhashformers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ruanchaves","download_url":"https://codeload.github.com/ruanchaves/hashformers/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruanchaves%2Fhashformers/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262235783,"owners_count":23279567,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","deep-learning","hashtag-segmentor","large-language-models","llms","natural-language-processing","nlp","paper","segmentation","sentiment-analysis","sentiment-classification","sentiment-polarity","transformer","transformers","transformers-gpt2","tweet-analysis","tweets-classification","twitter","twitter-sentiment-analysis","word-segmentation"],"created_at":"2024-11-24T14:15:46.274Z","updated_at":"2026-01-08T12:15:45.787Z","avatar_url":"https://github.com/ruanchaves.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ✂️ hashformers\r\n\r\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ruanchaves/hashformers/blob/master/hashformers.ipynb) [![PyPi license](https://badgen.net/pypi/license/pip/)](https://github.com/ruanchaves/hashformers/blob/master/LICENSE) [![stars](https://img.shields.io/github/stars/ruanchaves/hashformers)](https://github.com/ruanchaves/hashformers)\r\n\r\n**Hashformers** is a word segmentation library that fills a gap in the NLP ecosystem between heuristic-based splitters and LLM prompt-based segmentation. It can be used with any language model from the [Hugging Face Model Hub](https://huggingface.co/models), from auto-regressive models like GPT-2 to recent large language models (LLMs).\r\n\r\n**Hashformers** uses language models and a beam search algorithm to segment text without spaces into words. Benchmarks show that it can outperform heuristic-based splitters and LLM prompt-based approaches on word segmentation tasks.\r\n\r\n\u003cp align=\"center\"\u003e\r\n\u003ch3\u003e \u003ca href=\"https://colab.research.google.com/github/ruanchaves/hashformers/blob/master/hashformers.ipynb\"\u003e ✂️ Google Colab Tutorial \u003c/a\u003e \u003c/h3\u003e\r\n\u003c/p\u003e\r\n\r\n\u003cp align=\"center\"\u003e\r\n\u003ch3\u003e \u003ca href=\"https://github.com/ruanchaves/hashformers/blob/master/tutorials/EVALUATION-January_2026.md\"\u003e ✂️ Evaluation Report \u003c/a\u003e \u003c/h3\u003e\r\n\u003c/p\u003e\r\n\r\n---\r\n\r\n## 🚀 Quick Start\r\n\r\n### Installation\r\n\r\n```bash\r\npip install hashformers\r\n```\r\n\r\n### Basic Usage\r\n\r\n```python\r\nfrom hashformers import TransformerWordSegmenter as WordSegmenter\r\n\r\nws = WordSegmenter(\r\n    segmenter_model_name_or_path=\"distilgpt2\"\r\n) # You can use any model from the Hugging Face Model Hub\r\n\r\nsegmentations = ws.segment([\r\n    \"#weneedanationalpark\",\r\n    \"#icecold\"\r\n])\r\n\r\nprint(segmentations)\r\n# ['we need a national park', 'ice cold']\r\n```\r\n\r\n### Using Language-Specific Models\r\n\r\n```python\r\n# Russian hashtags with RuGPT3\r\nws = WordSegmenter(\r\n    segmenter_model_name_or_path=\"ai-forever/rugpt3small_based_on_gpt2\"\r\n)\r\n\r\nsegmentations = ws.segment([\"#москвасити\"])\r\n\r\nprint(segmentations)\r\n# ['москва сити']\r\n```\r\n\r\n### spaCy Integration\r\n\r\nHashformers can be used as a spaCy pipeline component:\r\n\r\n```python\r\nimport spacy\r\nimport hashformers.spacy  # registers the \"hashformers\" component\r\n\r\nnlp = spacy.blank(\"en\")\r\nnlp.add_pipe(\"hashformers\", config={\"model\": \"distilgpt2\"})\r\n\r\ndoc = nlp(\"#weneedanationalpark\")\r\nprint(doc._.segmented)  # \"we need a national park\"\r\n```\r\n\r\nInstall with spaCy support:\r\n\r\n```bash\r\npip install hashformers[spacy]\r\n```\r\n\r\n## When to Use Hashformers?\r\n\r\nThe table below outlines when to use **Hashformers** versus other approaches like heuristic-based splitters (e.g., SymSpell, WordNinja) or large LLMs.\r\n\r\n| Approach | Examples | Recommended When... | Notes |\r\n|----------|----------|---------------------|-------|\r\n| **Heuristic-based** | [SymSpell](https://github.com/wolfgarbe/SymSpell), [Ekphrasis](https://github.com/cbaziotis/ekphrasis), [WordNinja](https://github.com/keredson/wordninja), [Spiral (Ronin)](https://github.com/casics/spiral) | • **Scalability** is a primary requirement.\u003cbr\u003e\u003cbr\u003e• The segmentation domain works well with a standard pre-built vocabulary. | Fast and efficient, but requires a pre-built vocabulary which can be limiting for niche domains or languages. |\r\n| **Hashformers** | [Hashformers](https://github.com/ruanchaves/hashformers) | • **Scalability** is needed.\u003cbr\u003e\u003cbr\u003e• You are working in a domain or language where a Language Model is readily available, but compiling a manual vocabulary for your task is too burdensome. | Evidence shows Hashformers can be superior to LLMs of similar scale (0.5B parameters). |\r\n| **Large LLMs** | [OpenAI](https://openai.com/), Local LLM Deployment | • **Cost, latency, and scalability** are not concerns.\u003cbr\u003e\u003cbr\u003e• You are segmenting a **low volume** of items. | To gain an accuracy advantage over Hashformers, you generally need to use significantly larger LLMs. |\r\n\r\n---\r\n\r\n## 📚 Research \u0026 Citations\r\n\r\nHashformers was recognized as **state-of-the-art** for hashtag segmentation at [LREC 2022](https://aclanthology.org/2022.lrec-1.782.pdf).\r\n\r\n### Papers Using Hashformers\r\n\r\n- [Zero-shot hashtag segmentation for multilingual sentiment analysis](https://arxiv.org/abs/2112.03213)\r\n\r\n- [HashSet -- A Dataset For Hashtag Segmentation (LREC 2022)](https://aclanthology.org/2022.lrec-1.782/)\r\n\r\n- [Generalizability of Abusive Language Detection Models on Homogeneous German Datasets](https://link.springer.com/article/10.1007/s13222-023-00438-1#Fn3) \r\n\r\n- [The problem of varying annotations to identify abusive language in social media content](https://www.cambridge.org/core/journals/natural-language-engineering/article/problem-of-varying-annotations-to-identify-abusive-language-in-social-media-content/B47FCCCEBF6EDF9C628DCC69EC5E0826)\r\n\r\n- [NUSS: An R package for mixed N-grams and unigram sequence segmentation](https://www.sciencedirect.com/science/article/pii/S2352711025002754#bbib0017)\r\n\r\n### Citation\r\n\r\nIf you find **Hashformers** useful, please consider citing our paper:\r\n\r\n```bibtex\r\n@misc{rodrigues2021zeroshot,\r\n      title={Zero-shot hashtag segmentation for multilingual sentiment analysis}, \r\n      author={Ruan Chaves Rodrigues and Marcelo Akira Inuzuka and Juliana Resplande Sant'Anna Gomes and Acquila Santos Rocha and Iacer Calixto and Hugo Alexandre Dantas do Nascimento},\r\n      year={2021},\r\n      eprint={2112.03213},\r\n      archivePrefix={arXiv},\r\n      primaryClass={cs.CL}\r\n}\r\n```\r\n\r\n---\r\n\r\n## 🤝 Contributing\r\n\r\nPull requests are welcome! [Read our paper](https://arxiv.org/abs/2112.03213) for details on the framework architecture.\r\n\r\n```bash\r\ngit clone https://github.com/ruanchaves/hashformers.git\r\ncd hashformers\r\npip install -e .\r\n```\r\n\r\n---\r\n\r\n## 📖 Resources\r\n\r\n- [15 Datasets for Word Segmentation on the Hugging Face Hub](https://medium.com/@ruanchaves/15-datasets-for-word-segmentation-on-the-hugging-face-hub-4f24cb971e48)\r\n- [Benchmark Scripts](scripts/)\r\n- [Evaluation Report (January 2026)](tutorials/EVALUATION-January_2026.md)\r\n- [Evaluation Report (February 2022)](tutorials/EVALUATION-February_2022.md)\r\n\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fruanchaves%2Fhashformers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fruanchaves%2Fhashformers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fruanchaves%2Fhashformers/lists"}