Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/brain-facens/autotrain
The motivation behind this project stems from one of the most repetitive and time-consuming tasks in Machine Learning, particularly in Computer Vision: data labeling. A complete pipeline has been developed, covering everything from data preparation and transformation into the appropriate formats for image segmentation or object detection.
https://github.com/brain-facens/autotrain
Last synced: 2 days ago
JSON representation
The motivation behind this project stems from one of the most repetitive and time-consuming tasks in Machine Learning, particularly in Computer Vision: data labeling. A complete pipeline has been developed, covering everything from data preparation and transformation into the appropriate formats for image segmentation or object detection.
- Host: GitHub
- URL: https://github.com/brain-facens/autotrain
- Owner: brain-facens
- Created: 2024-10-14T15:02:00.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2024-10-16T20:27:13.000Z (about 1 month ago)
- Last Synced: 2024-10-18T19:47:34.805Z (30 days ago)
- Language: Python
- Homepage:
- Size: 5.65 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Autotrain
## About the Project
A complete pipeline has been developed, covering everything from data preparation and transformation into the appropriate formats for image segmentation or object detection, to splitting the data into training and validation sets. The training is based on a pre-trained model with data that is similar or relevant to the same task. This allows the model to leverage knowledge gained from previous workloads, speeding up fine-tuning and improving both efficiency and accuracy when handling new data.
The motivation behind this project stems from one of the most repetitive and time-consuming tasks in Machine Learning, particularly in Computer Vision: data labeling. This process, which is often done manually, can take a lot of time. This project aims to reduce that burden by automating part of the work, such as creating bounding boxes in images, saving precious hours — even when it comes to identifying kittens in your photos.
---
### Semi-Supervised LearningSemi-supervised learning is a technique that combines a small amount of labeled data with a large amount of unlabeled data to train computer vision models more efficiently. This approach is particularly useful when labeling large volumes of data is costly or time-consuming. During training, the model uses the labeled data to learn basic patterns and the unlabeled data to refine these representations, resulting in higher accuracy and better generalization in real-world scenarios.
---
## Limitations
The following project has some technical limitations:- It uses the YOLO v8 architecture, which can be easily updated to newer versions. However, at the time of development, this model was the most consistent and reliable for generalization.
- The current task classes are limited to two: segmentation and object detection. Updating the project to support more types of data for automatic retraining is planned for future versions.
- The data split is internally set to 70% for training and 30% for testing. Future updates will allow users to manually input these values or use techniques to optimize the data splitting process.## Installation
CLI
- Download the build for your Operating System from the "Releases" tab.
- Clone the repository:
```bash
cd
git clone https://github.com/brain-facens/autotrain.gitcd /autotrain
# Install the necessary libraries
pip install -r requirements.txt
```Note: The build you downloaded from the "Releases" tab should be placed inside the "autotrain" folder that you cloned.
Running:
- To format the dataset for segmentation:
```bash
./autotrain format segmentation --input_dir --output_positive_dir --output_negative_dir --model
```
- To format the dataset for object detection:
```bash
./autotrain format object_detection --input_dir --output_positive_dir --output_negative_dir --model
```
- To split the dataset:
```bash
./autotrain split_dataset --output_positive_dir
```
- To train the new model:
```bash
./autotrain train --model --dataset_yaml <.yaml with COCO format> --device --epochs --imgsz
```Usage Help
Need help using the project? Use the following command to better understand all available commands:
```bash
./autotrain --help
```Or if you're having trouble with a specific command:
```bash
./autotrain --help
```## Errors
if you try to run it on linux and get the following error
```bash
./autotrain: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34` not found (required by ./autotrain)
./autotrain: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32` not found (required by ./autotrain)
```run the libc6 library installation
```bash
sudo apt update
sudo apt install libc6
```## TODO
- [ x ] Create an option for the user to input the desired data split division
- [ ] Increase the activities that the package supports for automatic retraining
- [ ] Expand to more architectures and networks to be retrained, as well as open up for Data Science, NLP, and LLM activities## Collaborators
We would like to thank the following people who contributed to this project: