https://github.com/hetpandya/paraphrase-datasets-pretrained-models

A collection of preprocessed datasets and pretrained models for generating paraphrases.
https://github.com/hetpandya/paraphrase-datasets-pretrained-models

Last synced: 3 months ago
JSON representation

A collection of preprocessed datasets and pretrained models for generating paraphrases.

Host: GitHub
URL: https://github.com/hetpandya/paraphrase-datasets-pretrained-models
Owner: hetpandya
License: apache-2.0
Created: 2021-06-28T08:31:49.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2021-07-13T12:44:52.000Z (almost 4 years ago)
Last Synced: 2025-01-15T21:30:36.875Z (4 months ago)
Homepage:
Size: 5.21 MB
Stars: 29
Watchers: 1
Forks: 4
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Paraphrase datasets and pretrained models

This repository consists of a collection of preprocessed datasets for training models and pretrained models to generate paraphrases.

## Datasets

Each dataset has a `README` that describes the dataset, its source and preprocessed format. The datasets are stored in the [datasets](https://github.com/hetpandya/paraphrase-datasets-pretrained-models/tree/main/datasets) directory.

Dataset Type| File Name

--- | ---

TaPaCo Original|  [tapaco_huggingface.csv](https://github.com/hetpandya/paraphrase-datasets-pretrained-models/blob/main/datasets/tapaco/tapaco_huggingface.csv)

TaPaCo Preprocessed|  [tapaco_paraphrases_dataset.csv](https://github.com/hetpandya/paraphrase-datasets-pretrained-models/blob/main/datasets/tapaco/tapaco_paraphrases_dataset.csv)

## Pretrained models

List of models trained on various datasets for paraphrase generation.

Model| Dataset | Location

--- | --- | --- |

t5-small| [tapaco](https://github.com/hetpandya/paraphrase-datasets-pretrained-models/tree/main/datasets/tapaco) | [huggingface](https://huggingface.co/hetpandya/t5-small-tapaco)

t5-small| [Quora Question Pairs](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs) | [huggingface](https://huggingface.co/hetpandya/t5-small-quora)

t5-base| [tapaco](https://github.com/hetpandya/paraphrase-datasets-pretrained-models/tree/main/datasets/tapaco) | [huggingface](https://huggingface.co/hetpandya/t5-base-tapaco)

### Model Training

Examples for training models on the datasets can be found in the [examples](https://github.com/hetpandya/paraphrase-datasets-pretrained-models/tree/main/examples) directory.

### T5 model usage example

**Install dependencies using:**

```

pip install transformers sentencepiece

```

**Usage**

```python

from transformers import T5ForConditionalGeneration, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("hetpandya/t5-small-tapaco")

model = T5ForConditionalGeneration.from_pretrained("hetpandya/t5-small-tapaco")

def get_paraphrases(sentence, prefix="paraphrase: ", n_predictions=5, top_k=120, max_length=256,device="cpu"):

        text = prefix + sentence + " "

        encoding = tokenizer.encode_plus(

            text, pad_to_max_length=True, return_tensors="pt"

        )

        input_ids, attention_masks = encoding["input_ids"].to(device), encoding[

            "attention_mask"

        ].to(device)

        model_output = model.generate(

            input_ids=input_ids,

            attention_mask=attention_masks,

            do_sample=True,

            max_length=max_length,

            top_k=top_k,

            top_p=0.98,

            early_stopping=True,

            num_return_sequences=n_predictions,

        )

        outputs = []

        for output in model_output:

            generated_sent = tokenizer.decode(

                output, skip_special_tokens=True, clean_up_tokenization_spaces=True

            )

            if (

                generated_sent.lower() != sentence.lower()

                and generated_sent not in outputs

            ):

                outputs.append(generated_sent)

        return outputs

paraphrases = get_paraphrases("The house will be cleaned by me every Saturday.")

for sent in paraphrases:

  print(sent)

```

**Output**

```

The house is cleaned every Saturday by me.

The house will be cleaned on Saturday.

I will clean the house every Saturday.

I get the house cleaned every Saturday.

I will clean this house every Saturday.

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hetpandya/paraphrase-datasets-pretrained-models

Awesome Lists containing this project

README