https://github.com/tikquuss/lm_hf

Causal and Mask Language Modeling with 🤗 Transformers
https://github.com/tikquuss/lm_hf

clm mlm transformer

Last synced: 3 months ago
JSON representation

Causal and Mask Language Modeling with 🤗 Transformers

Host: GitHub
URL: https://github.com/tikquuss/lm_hf
Owner: Tikquuss
License: gpl-3.0
Created: 2022-03-01T22:30:42.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2024-02-10T05:13:20.000Z (over 1 year ago)
Last Synced: 2025-01-18T13:41:10.165Z (5 months ago)
Topics: clm, mlm, transformer
Language: Python
Homepage:
Size: 85.9 KB
Stars: 0
Watchers: 4
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        ## 1. Setting

```bash

git clone https://github.com/Tikquuss/lm_hf

cd lm_hf

python3 -m install pip

pip install -r requirements.txt

```

## 2. Build a tokenizer from scratch if you are not going to use a pre-trained model (supports txt and csv)  

See [tokenizing.py](src/tokenizing.py) for all other parameters (and descriptions).

```bash

st=my/save/path

mkdir -p $st

datapath=/path/to/data

text_column=text

python -m src.tokenizing -fe gpt2 -p ${datapath}/data_train.csv,${datapath}/data_val.csv,${datapath}/data_test.csv -vs 25000 -mf 2 -st $st -tc $text_column

#python -m src.tokenizing -fe bert-base-uncased -p wikitext -dn wikitext-2-raw-v1 --vocab_size 25000 -st $st

# ...

```

The tokenizer will be saved in ```${save_to}/tokenizer.pt```.

## 3. Dictionary (work, but deprecated for the moment)

You can, instead of pre-training a tokenizer, build a simple vocabulary (by dividing the sentences according to the whitespace character - 

default option, or by dividing sentences into phonemes, ...), then build the tokenizer with this vocabulary during the training/evaluation (```tokenizer_params="vocab_file=str(${save_to}/word_to_id.txt),t_class=str(bert_tokenizer),..."```).

```bash

st=my/save/path

mkdir -p $st

datapath=/path/to/data

text_column=text

python -m src.utils -p ${datapath}/data_train.csv,${datapath}/data_val.csv,${datapath}/data_test.csv -st $st -tc $text_column

```

But this option is not recommended for the moment (any deep sanitary check has been done so far).

## 4. Train and/or evaluate a model (from scratch or from a pre-trained model and/or tokenizer)  

See [trainer.py](src/trainer.py) and [train.sh](train.sh) for all other parameters (and descriptions)

```bash

. train.sh

```

## 5. TensorBoard (visualize the evolution of the loss/acc/... per step/epoch/...)

See https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html

```

%load_ext tensorboard

%tensorboard --logdir ${log_dir}/${task}/lightning_logs

```

## 6. Prediction

To generate texts or fill the masks, you have to use the ```predict_params``` parameter.

By default, this will be done on the test dataset (or the dataset specified with the ```split``` parameter), but it is better to put your examples in a text file, csv, json (...) and use it instead of the test dataset (```test_data_files``` parameter).

Don't forget to set the ```group_texts``` parameter to ```False``` in this case, and make sure that the length of the prompts or sentences (and the value of the ```max_length``` parameter) does not exceed the value of the ```max_position_embeddings```/```n_positions```/ ... parameter of your model.

- For example for text generation, the file can be in the following form:

    * for a text file :

    ```

    prompt 1

    prompt 2

    ...

    ```

    * for a csv file :

    ```

    text_column | ...

    prompt 1    | ...

    prompt 2    | ...

    ...	    | ...

    ```

- For the mask filling, replace the prompts below by the sentences on which to do the MLM

The result will be stored by default in the ```${log_dir}/${task}/predict.txt``` file, but you can change this path by adding this value to the ```predict_params``` parameter:

```bash

predict_params="...,output_file=str(my_path/file.txt),..."

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tikquuss/lm_hf

Awesome Lists containing this project

README