https://github.com/ruanchaves/hashformers
Hashformers is a framework for hashtag segmentation with Transformers and Large Language Models (LLMs).
https://github.com/ruanchaves/hashformers
bert deep-learning hashtag-segmentor large-language-models llms natural-language-processing nlp paper segmentation sentiment-analysis sentiment-classification sentiment-polarity transformer transformers transformers-gpt2 tweet-analysis tweets-classification twitter twitter-sentiment-analysis word-segmentation
Last synced: 5 months ago
JSON representation
Hashformers is a framework for hashtag segmentation with Transformers and Large Language Models (LLMs).
- Host: GitHub
- URL: https://github.com/ruanchaves/hashformers
- Owner: ruanchaves
- License: mit
- Created: 2020-05-21T11:48:18.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2024-08-21T18:04:56.000Z (almost 2 years ago)
- Last Synced: 2025-06-27T10:07:52.679Z (12 months ago)
- Topics: bert, deep-learning, hashtag-segmentor, large-language-models, llms, natural-language-processing, nlp, paper, segmentation, sentiment-analysis, sentiment-classification, sentiment-polarity, transformer, transformers, transformers-gpt2, tweet-analysis, tweets-classification, twitter, twitter-sentiment-analysis, word-segmentation
- Language: Python
- Homepage:
- Size: 23.6 MB
- Stars: 71
- Watchers: 4
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ✂️ hashformers
[](https://colab.research.google.com/github/ruanchaves/hashformers/blob/master/hashformers.ipynb) [](https://github.com/ruanchaves/hashformers/blob/master/LICENSE) [](https://github.com/ruanchaves/hashformers)
**Hashformers** is a word segmentation library that fills a gap in the NLP ecosystem between heuristic-based splitters and LLM prompt-based segmentation. It can be used with any language model from the [Hugging Face Model Hub](https://huggingface.co/models), from auto-regressive models like GPT-2 to recent large language models (LLMs).
**Hashformers** uses language models and a beam search algorithm to segment text without spaces into words. Benchmarks show that it can outperform heuristic-based splitters and LLM prompt-based approaches on word segmentation tasks.
✂️ Google Colab Tutorial
✂️ Evaluation Report
---
## 🚀 Quick Start
### Installation
```bash
pip install hashformers
```
### Basic Usage
```python
from hashformers import TransformerWordSegmenter as WordSegmenter
ws = WordSegmenter(
segmenter_model_name_or_path="distilgpt2"
) # You can use any model from the Hugging Face Model Hub
segmentations = ws.segment([
"#weneedanationalpark",
"#icecold"
])
print(segmentations)
# ['we need a national park', 'ice cold']
```
### Using Language-Specific Models
```python
# Russian hashtags with RuGPT3
ws = WordSegmenter(
segmenter_model_name_or_path="ai-forever/rugpt3small_based_on_gpt2"
)
segmentations = ws.segment(["#москвасити"])
print(segmentations)
# ['москва сити']
```
### spaCy Integration
Hashformers can be used as a spaCy pipeline component:
```python
import spacy
import hashformers.spacy # registers the "hashformers" component
nlp = spacy.blank("en")
nlp.add_pipe("hashformers", config={"model": "distilgpt2"})
doc = nlp("#weneedanationalpark")
print(doc._.segmented) # "we need a national park"
```
Install with spaCy support:
```bash
pip install hashformers[spacy]
```
## When to Use Hashformers?
The table below outlines when to use **Hashformers** versus other approaches like heuristic-based splitters (e.g., SymSpell, WordNinja) or large LLMs.
| Approach | Examples | Recommended When... | Notes |
|----------|----------|---------------------|-------|
| **Heuristic-based** | [SymSpell](https://github.com/wolfgarbe/SymSpell), [Ekphrasis](https://github.com/cbaziotis/ekphrasis), [WordNinja](https://github.com/keredson/wordninja), [Spiral (Ronin)](https://github.com/casics/spiral) | • **Scalability** is a primary requirement.
• The segmentation domain works well with a standard pre-built vocabulary. | Fast and efficient, but requires a pre-built vocabulary which can be limiting for niche domains or languages. |
| **Hashformers** | [Hashformers](https://github.com/ruanchaves/hashformers) | • **Scalability** is needed.
• You are working in a domain or language where a Language Model is readily available, but compiling a manual vocabulary for your task is too burdensome. | Evidence shows Hashformers can be superior to LLMs of similar scale (0.5B parameters). |
| **Large LLMs** | [OpenAI](https://openai.com/), Local LLM Deployment | • **Cost, latency, and scalability** are not concerns.
• You are segmenting a **low volume** of items. | To gain an accuracy advantage over Hashformers, you generally need to use significantly larger LLMs. |
---
## 📚 Research & Citations
Hashformers was recognized as **state-of-the-art** for hashtag segmentation at [LREC 2022](https://aclanthology.org/2022.lrec-1.782.pdf).
### Papers Using Hashformers
- [Zero-shot hashtag segmentation for multilingual sentiment analysis](https://arxiv.org/abs/2112.03213)
- [HashSet -- A Dataset For Hashtag Segmentation (LREC 2022)](https://aclanthology.org/2022.lrec-1.782/)
- [Generalizability of Abusive Language Detection Models on Homogeneous German Datasets](https://link.springer.com/article/10.1007/s13222-023-00438-1#Fn3)
- [The problem of varying annotations to identify abusive language in social media content](https://www.cambridge.org/core/journals/natural-language-engineering/article/problem-of-varying-annotations-to-identify-abusive-language-in-social-media-content/B47FCCCEBF6EDF9C628DCC69EC5E0826)
- [NUSS: An R package for mixed N-grams and unigram sequence segmentation](https://www.sciencedirect.com/science/article/pii/S2352711025002754#bbib0017)
### Citation
If you find **Hashformers** useful, please consider citing our paper:
```bibtex
@misc{rodrigues2021zeroshot,
title={Zero-shot hashtag segmentation for multilingual sentiment analysis},
author={Ruan Chaves Rodrigues and Marcelo Akira Inuzuka and Juliana Resplande Sant'Anna Gomes and Acquila Santos Rocha and Iacer Calixto and Hugo Alexandre Dantas do Nascimento},
year={2021},
eprint={2112.03213},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
---
## 🤝 Contributing
Pull requests are welcome! [Read our paper](https://arxiv.org/abs/2112.03213) for details on the framework architecture.
```bash
git clone https://github.com/ruanchaves/hashformers.git
cd hashformers
pip install -e .
```
---
## 📖 Resources
- [15 Datasets for Word Segmentation on the Hugging Face Hub](https://medium.com/@ruanchaves/15-datasets-for-word-segmentation-on-the-hugging-face-hub-4f24cb971e48)
- [Benchmark Scripts](scripts/)
- [Evaluation Report (January 2026)](tutorials/EVALUATION-January_2026.md)
- [Evaluation Report (February 2022)](tutorials/EVALUATION-February_2022.md)