Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/muneeb1030/finetune-tiny-llama
Fine-tuning the Tiny Llama model to mimic my professor's writing style using the Llama Factory. The project involves data collection, preprocessing, preparation, fine-tuning, and evaluation.
https://github.com/muneeb1030/finetune-tiny-llama
data data-preparation data-preprocessing finetuning llama-factory llm pymupdf selenium-python spacy tinyllama webscraping
Last synced: 2 months ago
JSON representation
Fine-tuning the Tiny Llama model to mimic my professor's writing style using the Llama Factory. The project involves data collection, preprocessing, preparation, fine-tuning, and evaluation.
- Host: GitHub
- URL: https://github.com/muneeb1030/finetune-tiny-llama
- Owner: Muneeb1030
- Created: 2024-07-05T21:30:28.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-07-17T09:39:25.000Z (5 months ago)
- Last Synced: 2024-10-14T04:02:54.894Z (2 months ago)
- Topics: data, data-preparation, data-preprocessing, finetuning, llama-factory, llm, pymupdf, selenium-python, spacy, tinyllama, webscraping
- Language: Jupyter Notebook
- Homepage:
- Size: 390 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
# Fine-Tuning Tiny Llama using Llama Factory To Mimic Professor's Writing Style
## Overview
This project aims to fine-tune the Tiny Llama model using the Llama Factory to mimic my professor's writing style. The process involves several phases, including data collection, preprocessing, preparation, model fine-tuning, and evaluation. The final goal is to create a model that can generate text in the style of my professor's academic writings.## Phases of the Project
### Phase 1: Data Collection
The first step in this project was to collect data by scraping my professor's Google Scholar page. The objective was to gather a comprehensive set of research articles published by the professor.- **Tool Used:** Selenium
- **Details:** Selenium was used to automate the process of accessing the Google Scholar page and downloading the available PDFs of the research articles.
### Phase 2: Data Preprocessing
After collecting the PDFs, the next step was to preprocess these documents to ensure they were in a usable format for training the model.- **Purpose:** Normalize the content while preserving the writing style.
- **Tools Used:** pyMuPDF
### Steps:
- Remove page headers, footers, images, and tables along with their captions.
- Convert the remaining content into paragraph format, as individual words and phrases are insufficient for capturing writing style.
### Phase 3: Data Preparation
The preprocessed data needed to be formatted according to the requirements of the Llama Factory model training process.- **Initial Tools Tried:** spaCy, TF-IDF, BERT
- **Tool That Worked:** OpenAI API
### Process:
- Use the OpenAI API to generate the required data format.
- Ensure that the data is structured correctly for input into the Llama Factory model.
### Phase 4: Model Fine-Tuning
With the data prepared, the next phase involved fine-tuning the Tiny Llama model.- **Environment:** Google Colab
- **Tools Used:** Llama Factory
### Steps:
- Set up the Google Colab notebook and import necessary libraries.
- Load the Llama Factory UI and integrate the dataset.
- Define the prompt format and other configurations required by Llama Factory.
- Run the fine-tuning process to train the Tiny Llama model on the professor's writing style.
### Phase 5: Model Evaluation
The final phase focused on evaluating the performance of the fine-tuned model to ensure it accurately mimics the professor's writing style.### Process:
- Generate sample texts using the fine-tuned model.
- Compare the generated texts with the original writings to assess similarity in style and content.
- Make any necessary adjustments and re-train if needed.
## Getting Started
### Prerequisites
- Python 3.x
- Selenium
- pyMuPDF
- OpenAI API
### Installation
- Clone the repository
```bash
git clone https://github.com/yourusername/finetuning-tiny-llama.git
```
- Install the necessary Python packages
```bash
pip install selenium pymupdf openai
```
## Contributing
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.## Contact
For any questions or suggestions, please contact me at muhammadmuneeburrehman.vercel.app