Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ayoolaolafenwa/chatlm

The github repository of ChatLM.
https://github.com/ayoolaolafenwa/chatlm

artificial-intelligence chat chatgpt gpt largelanguagemodel llm machine-learning natural-language-processing nlp

Last synced: 3 months ago
JSON representation

The github repository of ChatLM.

Host: GitHub
URL: https://github.com/ayoolaolafenwa/chatlm
Owner: ayoolaolafenwa
License: apache-2.0
Created: 2023-06-28T10:33:50.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-03-16T12:03:25.000Z (11 months ago)
Last Synced: 2024-03-16T14:23:29.645Z (11 months ago)
Topics: artificial-intelligence, chat, chatgpt, gpt, largelanguagemodel, llm, machine-learning, natural-language-processing, nlp
Language: Python
Homepage: https://huggingface.co/spaces/ayoolaolafenwa/ChatLM
Size: 39.1 KB
Stars: 10
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        
## ChatLM 

It is a chat Large Language Model finetuned with pretrained [Falcon-1B model](https://huggingface.co/tiiuae/falcon-rw-1b)

and trained on [chat-bot-instructions prompts dataset](https://huggingface.co/datasets/ayoolaolafenwa/sft-data).

ChatLM was trained on a dataset containing normal day to day human conversations, due to limited data used in training

it does not generalize well for tasks like coding, current affairs and hallucinations may occur.

# Have a live chat with ChatLM on Huggingface space https://huggingface.co/spaces/ayoolaolafenwa/ChatLM

### Install Required Packages

```

pip install transformers

pip install accelerate

pip install einops

pip install bitsandbytes

```

## Load Model in bfloat16

``` python

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "ayoolaolafenwa/ChatLM"

tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True,

torch_dtype=torch.bfloat16).to("cuda")

prompt = ": Give me a financial advise on investing in stocks. : "

tokens = tokenizer(prompt, return_tensors="pt")

token_ids = tokens.input_ids

attention_mask=tokens.attention_mask

token_ids = token_ids.to(model.device)

attention_mask=attention_mask.to(model.device)

outputs = model.generate(input_ids=token_ids, attention_mask = attention_mask,  max_length=2048,do_sample=True,

num_return_sequences=1,top_k = 10, temperature = 0.7, eos_token_id=tokenizer.eos_token_id)

output_text = tokenizer.decode(outputs[0])

output_text = output_text.replace("<|endoftext|>", "")

print(output_text)

```

## Load Model in bfloat16 and int8

``` python

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "ayoolaolafenwa/ChatLM"

tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True,

torch_dtype=torch.bfloat16, load_in_8bit=True)

prompt = ": Give me a financial advise on investing in stocks. : "

tokens = tokenizer(prompt, return_tensors="pt")

token_ids = tokens.input_ids

attention_mask=tokens.attention_mask

token_ids = token_ids.to(model.device)

attention_mask=attention_mask.to(model.device)

outputs = model.generate(input_ids=token_ids, attention_mask = attention_mask,  max_length=2048,do_sample=True,

num_return_sequences=1,top_k = 10, temperature = 0.7, eos_token_id=tokenizer.eos_token_id)

output_text = tokenizer.decode(outputs[0])

output_text = output_text.replace("<|endoftext|>", "")

print(output_text)

```

# Training procedure for Supervised Finetuning

## Dataset Preparation

Chatbot Instructions prompts dataset from https://huggingface.co/datasets/alespalla/chatbot_instruction_prompts/viewer/alespalla--chatbot_instruction_prompts

was processed into a supervised finetuning format for training a user prompt and a corresponding response.

##### Download Data

``` python

from datasets import load_dataset

dataset = load_dataset("alespalla/chatbot_instruction_prompts", split = "train")

dataset.save_to_disk('ChatBotInsP')

dataset.to_csv('CIPtrain.csv')

```

##### Code to process dataset into Supervised finetuning format

``` python

# Import pandas library

import pandas as pd

# Read the text dataset from csv file

text_data = pd.read_csv("CIPtrain.csv")

# Create empty lists for prompts and responses

prompts = []

responses = []

# Loop through the text data

for i in range(len(text_data)):

    # Get the sender, message, and timestamp of the current row

    prompt = text_data["prompt"][i]

    prompt = str(prompt)

    response = text_data["response"][i]

    response = str(response)

    

    # Add the message to the prompts list with  tag

    prompts.append(": " + prompt)

    

    # Add the message to the responses list with  tag

    responses.append(": " + response)

# Create a new dataframe with prompts and responses columns

new_data = pd.DataFrame({"prompt": prompts, "response": responses})

#alespalla/chatbot_instruction_prompts

# Write the new dataframe to a csv file

new_data.to_csv("MyData/chatbot_instruction_prompts_train.csv", index=False)

```

The users` prompts in the dataset are appended with the tag  and the corresponding responses with the tag .

Check the the modified dataset https://huggingface.co/datasets/ayoolaolafenwa/sft-data .

### Training 

ChatLM was supervised finetuned with pretrained [Falcon 1-Billion parameters model](https://huggingface.co/tiiuae/falcon-rw-1b) trained on 350-Billion tokens 

of RefinedWeb. It was trained with a single H100 GPU for 1 epoch. It achieves Perplexity *1.738*.  

Check the full code for Supervised Finetune training [here](https://github.com/ayoolaolafenwa/ChatLM/blob/main/trainSFT.py). 

Check the training config [here](https://github.com/ayoolaolafenwa/ChatLM/blob/main/trainConf.conf)

### Run Training with accelerate

```

accelerate launch --config_file trainConf.conf trainSFT.py

```