https://github.com/ruanchaves/hashformers

Hashformers is a framework for hashtag segmentation with Transformers and Large Language Models (LLMs).
https://github.com/ruanchaves/hashformers

bert deep-learning hashtag-segmentor large-language-models llms natural-language-processing nlp paper segmentation sentiment-analysis sentiment-classification sentiment-polarity transformer transformers transformers-gpt2 tweet-analysis tweets-classification twitter twitter-sentiment-analysis word-segmentation

Last synced: 2 months ago
JSON representation

Hashformers is a framework for hashtag segmentation with Transformers and Large Language Models (LLMs).

Host: GitHub
URL: https://github.com/ruanchaves/hashformers
Owner: ruanchaves
License: mit
Created: 2020-05-21T11:48:18.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2024-08-21T18:04:56.000Z (8 months ago)
Last Synced: 2025-02-07T23:34:23.968Z (3 months ago)
Topics: bert, deep-learning, hashtag-segmentor, large-language-models, llms, natural-language-processing, nlp, paper, segmentation, sentiment-analysis, sentiment-classification, sentiment-polarity, transformer, transformers, transformers-gpt2, tweet-analysis, tweets-classification, twitter, twitter-sentiment-analysis, word-segmentation
Language: Python
Homepage:
Size: 23.6 MB
Stars: 70
Watchers: 5
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # ✂️ hashformers

[![HF Spaces](https://raw.githubusercontent.com/obss/sahi/main/resources/hf_spaces_badge.svg)](https://ruanchaves-hashtag-segmentation.hf.space/) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ruanchaves/hashformers/blob/master/hashformers.ipynb) [![PyPi license](https://badgen.net/pypi/license/pip/)](https://github.com/ruanchaves/hashformers/blob/master/LICENSE) [![stars](https://img.shields.io/github/stars/ruanchaves/hashformers)](https://github.com/ruanchaves/hashformers) [![tweet](https://img.shields.io/twitter/url?style=social&url=https%3A%2F%2Fgithub.com%2Fruanchaves%2Fhashformers)](https://www.twitter.com/share?url=https://github.com/ruanchaves/hashformers)

Hashtag segmentation is the task of automatically adding spaces between the words on a hashtag. 

[Hashformers](https://github.com/ruanchaves/hashformers) is the current **state-of-the-art** for hashtag segmentation, as demonstrated on [this paper accepted at LREC 2022](https://aclanthology.org/2022.lrec-1.782.pdf). 

Hashformers is also **language-agnostic**: you can use it to segment hashtags not just with English models, but also using any language model available on the [Hugging Face Model Hub](https://huggingface.co/models).



    

  ✂️ Segment hashtags on Hugging Face Spaces  


  ✂️ Get started - Google Colab tutorial  


  ✂️ Read the Docs  


## Basic usage

```python

from hashformers import TransformerWordSegmenter as WordSegmenter

ws = WordSegmenter(

    segmenter_model_name_or_path="gpt2",

    segmenter_model_type="incremental",

    reranker_model_name_or_path="google/flan-t5-base",

    reranker_model_type="seq2seq"

)

segmentations = ws.segment([

    "#weneedanationalpark",

    "#icecold"

])

print(segmentations)

# [ 'we need a national park',

# 'ice cold' ]

```

It is also possible to use hashformers without a reranker by setting the `reranker_model_name_or_path` and the `reranker_model_type` to `None`. 

## Installation

```

pip install hashformers

```

**Important**: Hashformers is designed to work with `Python 3.10.12`, the version currently used on Google Colab.

## What models can I use?

Visit the [HuggingFace Model Hub](https://huggingface.co/models) and choose your models for the `WordSegmenter` class.

You can use any model supported by the [minicons](https://github.com/kanishkamisra/minicons) library. Currently `hashformers` supports the following model types as the `segmenter_model_type` or `reranker_model_type`:

### `incremental`

Auto-regressive models like GPT-2 and XLNet, or any model that can be loaded with `AutoModelForCausalLM`. This includes large language models (LLMs) such as Alpaca-LoRA ( `chainyo/alpaca-lora-7b` ) and GPT-J ( `EleutherAI/gpt-j-6b` ).

```python

ws = WordSegmenter(

    segmenter_model_name_or_path="EleutherAI/gpt-j-6b",

    segmenter_model_type="incremental",

    reranker_model_name_or_path=None,

    reranker_model_type=None

)

```

### `masked`

Masked language models like BERT, or any model that can be loaded with `AutoModelForMaskedLM`.

### `seq2seq`

Seq2Seq models like FLAN-T5 ( `google/flan-t5-base` ), or any model that can be loaded with `AutoModelForSeq2SeqLM`.

Best results are usually achieved by using an `incremental` model as the `segmenter_model_name_or_path` and a `masked` or `seq2seq` model as the `reranker_model_name_or_path`. 

A segmenter is always required, however a reranker is optional. 

## Contributing 

Pull requests are welcome!  [Read our paper](https://arxiv.org/abs/2112.03213) for more details on the inner workings of our framework.

If you want to develop the library, you can install **hashformers** directly from this repository ( or your fork ):

```

git clone https://github.com/ruanchaves/hashformers.git

cd hashformers

pip install -e .

```

## Relevant Papers 

This is a collection of papers that have utilized the *hashformers* library as a tool in their research.

### hashformers v1.3

These papers have utilized `hashformers` version 1.3 or below.

* [Zero-shot hashtag segmentation for multilingual sentiment analysis](https://arxiv.org/abs/2112.03213)

* [HashSet -- A Dataset For Hashtag Segmentation (LREC 2022)](https://aclanthology.org/2022.lrec-1.782/)

* [Generalizability of Abusive Language Detection Models on Homogeneous German Datasets](https://link.springer.com/article/10.1007/s13222-023-00438-1#Fn3) 

* [The problem of varying annotations to identify abusive language in social media content](https://www.cambridge.org/core/journals/natural-language-engineering/article/problem-of-varying-annotations-to-identify-abusive-language-in-social-media-content/B47FCCCEBF6EDF9C628DCC69EC5E0826)

## Blog Posts

* [15 Datasets for Word Segmentation on the Hugging Face Hub](https://ruanchaves.medium.com/15-datasets-for-word-segmentation-on-the-hugging-face-hub-4f24cb971e48)

## Citation

```

@misc{rodrigues2021zeroshot,

      title={Zero-shot hashtag segmentation for multilingual sentiment analysis}, 

      author={Ruan Chaves Rodrigues and Marcelo Akira Inuzuka and Juliana Resplande Sant'Anna Gomes and Acquila Santos Rocha and Iacer Calixto and Hugo Alexandre Dantas do Nascimento},

      year={2021},

      eprint={2112.03213},

      archivePrefix={arXiv},

      primaryClass={cs.CL}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ruanchaves/hashformers

Awesome Lists containing this project

README

✂️ Segment hashtags on Hugging Face Spaces

✂️ Get started - Google Colab tutorial

✂️ Read the Docs