https://github.com/swamikannan/creating-llamaindex-embeddingqafinetunedataset-compatible-files
Guidance on using LlamaIndex's finetuning() functions require you to first create synthetic question-answer pairs using generate_qa_embedding_pairs(). Here's how you can create a EmbeddingQAFinetuneDataset object from your own QA dataset
https://github.com/swamikannan/creating-llamaindex-embeddingqafinetunedataset-compatible-files
Last synced: 3 months ago
JSON representation
Guidance on using LlamaIndex's finetuning() functions require you to first create synthetic question-answer pairs using generate_qa_embedding_pairs(). Here's how you can create a EmbeddingQAFinetuneDataset object from your own QA dataset
- Host: GitHub
- URL: https://github.com/swamikannan/creating-llamaindex-embeddingqafinetunedataset-compatible-files
- Owner: SwamiKannan
- License: mit
- Created: 2024-02-21T16:22:17.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-03-06T11:14:11.000Z (over 1 year ago)
- Last Synced: 2025-03-03T16:48:22.374Z (7 months ago)
- Language: Python
- Size: 2.86 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Creating LlamaIndex EmbeddingQAFinetuneDataset compatible inputs
![]()
## Approach
## Introduction
An important lever to improve RAG performance is to finetune the embeddings model itself.The embeddings model is used to tokenize and "embed" the input text into the vectorstore. Hence, the better your embeddings model aligns to the data and terminology of your domain, the better the RAG will be in extracting the correct documents to be passed to your LLM model.
LlamaIndex provides a robust library called llama_index.finetuning to accomplish this. To use this library, we have to instantiate a EmbeddingQAFinetuneDataset object from our data. One way to instantiate this object is using a JSON file in a specific format.
The data in the JSON file is typically a question-answer dataset or a question-reference dataset (where the answer is not explicitly provided but the reference data in which the answer resides is provided. This is an even more powerful paradigm for RAG models since it is the basis on which they work). All guides (including the llamaindex documentation) create this data (question - reference sets from a link or html content using generate_qa_embedding_pairs() i.e. the input is a text file of all the relevant content and the LLM itself creates question-reference datasets.
However, this is not suitable for a couple of reasons:
- Expensive (OpenAI or Cohere) or computationally intensive (multi-billion parameter open source models) for large amounts of content e.g. 50K pages in Wikipedia
- May not necessarily be coherent question - reference sets depending on the complexity of the model or the nuance in the data
Hence, this repo seeks to leverage external question-answer and question-reference datasets already available on Huggingface or any other data. The repo takes as input, a simple dictionary or a json file, in a pre-defined format containing question / answers or question / reference details and transforms it into a EmbeddingQAFinetuneDataset-compatible json file that can, as a downstream activity, be used to finetune your embedding model.
## Usage
1. In the main.py file, review the create_dataset() function to understand the structure of the file that needs to be provided as an input.
2. Create your json input file in the format as in create_dataset(). You can also refer to the structure below in the "JSON template" section
3. Run the following from a command prompt:
```
python main.py
```
5. The code also prints out the number of items in your json file to confirm processing.
### Addendum
Added code to create the json file referred to in step 1 and 2 above. You do not have to mandatorily use this code. Rather, you can create your own json file as per the template structure mentioned below.
#### 1. Using a HuggingFace dataset for finetuning
* Write a transform function that will create the columns 'question' and 'answer' in your dataset (as a Pandas dataframe). Illustratively, this will look like:
def transform_data(df):
df['question'] =
df['answer'] = < command to process data that gives us the text for your response / answer / context>
* Create a HFJSONCreator object:
from process_data import HFJSONCreator
hfjsonobject = HFJSONCreator(source, transform_df, split='validation', test_ratio=0, to_disk=True)
where:
- source is the dataset link as shown below:
- transform_fn is the name of the function defined in the step above
- split is the split you want ('train', 'test', 'validation', etc.)
- test_ratio - if you want to split the dataset into train and test, state the ratio of the test set else set to 0
- to_disk - If you want to save the dataset to the local drive
Refer to the image below for parameters 'source' and 'split'

* Run the following lines of code to create and write the json file to disk
sciq_test_json.create_all_dicts()
sciq_test_json.write_dict()
* Save all this code in a file .py in the src folder
* On running this code:
- The final json file will be created in /processed_data/templated_json
- If to_disk is True, the dataset will be saved in /datasets
#### 2. Using your own local data for finetuning
* Write a data loader function. This function has to load the data into a dataframe.
e.g. if the base file is a csv, then you will have to load the data to a dataframe using pd.read_csv()
e.g. if the base file is an Excel file, then you will have to load the data to a dataframe using pd.read_excel()
Ensure that this function returns a pandas dataframe
* Write a transform function that will create the columns 'question' and 'answer' in your dataset (as a Pandas dataframe). Illustratively, this will look like:
def transform_data(df):
df['question'] =
df['answer'] = < code to process data that gives us the text for your response / answer / context>
* Create a LocalJSONCreator object:
from process_data import LocalJSONCreator
hfjsonobject = LocalJSONCreator(data_path_name, load_data_fn, transform_data, split='train', test_ratio=0)
where:
- data_path_name
- load_data_fn is the name of the data loader function that was created above
- transform_data is the name of the function defined in the step above
- split is the split you want ('train', 'test', 'validation', etc.)
- test_ratio - if you want to split the dataset into train and test, state the ratio of the test set else set to 0
* Run the following lines of code to create and write the json file to disk
sciq_test_json.create_all_dicts()
sciq_test_json.write_dict()
* Save all this code in a file .py in the src folder
* On running this code:
- The final json file will be created in /processed_data/templated_json
- If to_disk is True, the dataset will be saved in /datasets
### JSON template

## Image credits:
Image credit: Base Image for cover generated using Segmind's Stable Diffusion XL 1.0 model. Additional image editing by me.
Prompt: cinematic film still, 4k, realistic, of a man casting spells on documents, Fujifilm XT3, long shot, ((low light:1.4)), landscape , very wide angle shot, somber, vignette, highly detailed, high budget Hollywood movie, bokeh, cinemascope, moody, epic, neon, gorgeous, film grain, grainy