https://github.com/swamikannan/creating-llamaindex-embeddingqafinetunedataset-compatible-files

Guidance on using LlamaIndex's finetuning() functions require you to first create synthetic question-answer pairs using generate_qa_embedding_pairs(). Here's how you can create a EmbeddingQAFinetuneDataset object from your own QA dataset
https://github.com/swamikannan/creating-llamaindex-embeddingqafinetunedataset-compatible-files

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/swamikannan/creating-llamaindex-embeddingqafinetunedataset-compatible-files
Owner: SwamiKannan
License: mit
Created: 2024-02-21T16:22:17.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-03-06T11:14:11.000Z (over 1 year ago)
Last Synced: 2025-03-03T16:48:22.374Z (7 months ago)
Language: Python
Size: 2.86 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Creating LlamaIndex EmbeddingQAFinetuneDataset compatible inputs



  



## Approach



## Introduction

An important lever to improve RAG performance is to finetune the embeddings model itself.

The embeddings model is used to tokenize and "embed" the input text into the vectorstore. Hence, the better your embeddings model aligns to the data and terminology of your domain, the better the RAG will be in extracting the correct documents to be passed to your LLM model. 

LlamaIndex provides a robust library called llama_index.finetuning to accomplish this. To use this library, we have to instantiate a EmbeddingQAFinetuneDataset object from our data. One way to instantiate this object is using a JSON file in a specific format.

The data in the JSON file is typically a question-answer dataset or a question-reference dataset (where the answer is not explicitly provided but the reference data in which the answer resides is provided. This is an even more powerful paradigm for RAG models since it is the basis on which they work). All guides (including the llamaindex documentation) create this data (question - reference sets from a link or html content using generate_qa_embedding_pairs() i.e. the input is a text file of all the relevant content and the LLM itself creates question-reference datasets. 


However, this is not suitable for a couple of reasons:

Generate_qa_embedding_pairs() uses an LLM to create these question - reference sets. This may be either:

  

 Expensive (OpenAI or Cohere) or computationally intensive (multi-billion parameter open source models) for large amounts of content e.g. 50K pages in Wikipedia

May not necessarily be coherent question - reference sets depending on the complexity of the model or the nuance in the data 





There are several open source datasets on HuggingFace which may be both nuanced and relevant to the domain we are targetting. Not leveraging this data would be a massive opportunity loss



Hence, this repo seeks to leverage external question-answer and question-reference datasets already available on Huggingface or any other data. The repo takes as input,  a simple dictionary or a json file, in a pre-defined format containing question / answers or question / reference details and transforms it into a EmbeddingQAFinetuneDataset-compatible json file that can, as a downstream activity, be used to finetune your embedding model.

## Usage

1. In the main.py file, review the create_dataset() function to understand the structure of the file that needs to be provided as an input.

2. Create your json input file in the format as in create_dataset(). You can also refer to the structure below in the "JSON template" section 

3. Run the following from a command prompt:

   

   ```

   python main.py 

   ```

5. The code also prints out the number of items in your json file to confirm processing.

### Addendum

Added code to create the json file referred to in step 1 and 2 above.  You do not have to mandatorily use this code. Rather, you can create your own json file as per the template structure mentioned below. 

#### 1. Using a HuggingFace dataset for finetuning

* Write a transform function that will create the columns 'question' and 'answer' in your dataset (as a Pandas dataframe). Illustratively, this will look like:

      

          def transform_data(df):

              df['question'] = 

              df['answer'] = < command to process data that gives us the text for your response / answer / context>

* Create a HFJSONCreator object:


    

          from process_data import HFJSONCreator

          hfjsonobject = HFJSONCreator(source, transform_df, split='validation', test_ratio=0, to_disk=True)

  where:


      


       source is the dataset link as shown below: 

       transform_fn is the name of the function defined in the step above  

       split is the split you want ('train', 'test', 'validation', etc.) 

       test_ratio - if you want to split the dataset into train and test, state the ratio of the test set else set to 0 

       to_disk - If you want to save the dataset to the local drive

       

      _{Refer to the image below for parameters 'source' and 'split'}


  

    

  ![get the dataset parameters](images/dataset_name.PNG "Get the dataset parameters")

  

   


* Run the following lines of code to create and write the json file to disk

           sciq_test_json.create_all_dicts()

           sciq_test_json.write_dict()

* Save all this code in a file .py in the src folder

* On running this code: 


      


         The final json file will be created in  /processed_data/templated_json 



         If to_disk is True, the dataset will be saved in /datasets 



      

#### 2. Using your own local data for finetuning

* Write a data loader function. This function has to load the data into a dataframe.


      e.g. if the base file is a csv, then you will have to load the data to a dataframe using pd.read_csv()


      e.g. if the base file is an Excel file, then you will have to load the data to a dataframe using pd.read_excel()


       Ensure that this function returns a pandas dataframe 

      

* Write a transform function that will create the columns 'question' and 'answer' in your dataset (as a Pandas dataframe). Illustratively, this will look like:

      

          def transform_data(df):

              df['question'] = 

              df['answer'] = < code to process data that gives us the text for your response / answer / context>

    

* Create a LocalJSONCreator object:


    

          from process_data import LocalJSONCreator

          hfjsonobject = LocalJSONCreator(data_path_name, load_data_fn, transform_data, split='train', test_ratio=0)

  where:


      

       data_path_name 

       load_data_fn is the name of the data loader function that was created above 

       transform_data is the name of the function defined in the step above  

       split is the split you want ('train', 'test', 'validation', etc.) 

       test_ratio - if you want to split the dataset into train and test, state the ratio of the test set else set to 0 

       


* Run the following lines of code to create and write the json file to disk

           sciq_test_json.create_all_dicts()

           sciq_test_json.write_dict()

* Save all this code in a file .py in the src folder

* On running this code: 


      


         The final json file will be created in  /processed_data/templated_json 



         If to_disk is True, the dataset will be saved in /datasets 



      


### JSON template



## Image credits:

 Image credit: Base Image  for cover generated using Segmind's Stable Diffusion XL 1.0 model. Additional image editing by me. 


 Prompt: cinematic film still, 4k, realistic, of a man casting spells on documents, Fujifilm XT3, long shot, ((low light:1.4)), landscape , very wide angle shot, somber, vignette, highly detailed, high budget Hollywood movie, bokeh, cinemascope, moody, epic, neon, gorgeous, film grain, grainy

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/swamikannan/creating-llamaindex-embeddingqafinetunedataset-compatible-files

Awesome Lists containing this project

README