https://github.com/neptun-software/neptun.data.generators

Send scraped data from neptun-scraper to CHATGPT to generate training data for NEPTUN.AI.
https://github.com/neptun-software/neptun.data.generators

data generator

Last synced: 2 months ago
JSON representation

Send scraped data from neptun-scraper to CHATGPT to generate training data for NEPTUN.AI.

Host: GitHub
URL: https://github.com/neptun-software/neptun.data.generators
Owner: neptun-software
License: apache-2.0
Created: 2025-01-09T00:14:58.000Z (5 months ago)
Default Branch: main
Last Pushed: 2025-01-15T11:10:42.000Z (5 months ago)
Last Synced: 2025-02-11T18:34:13.754Z (4 months ago)
Topics: data, generator
Language: Python
Homepage:
Size: 1.53 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Dockerfile Processor

## Overview
This repository contains a Python-based tool for analyzing and processing Dockerfiles. It is designed to generate user-friendly questions and outputs in JSONL format for training or other applications. The tool uses the Hugging Face Inference API to interact with a language model, providing meaningful outputs based on the content of the Dockerfiles.

---

## Features
- **Automated Dockerfile Analysis:** Parses and validates Dockerfiles for processing.
- **Hugging Face Integration:** Uses `mistralai/Mistral-7B-Instruct-v0.3` for generating prompts and responses.
- **Error Handling & Retry Mechanism:** Handles API failures with retry logic and logs failures for later review.
- **Logging:** Tracks success and failure statistics in both console output and log files.
- **JSONL Output:** Generates well-structured JSONL files with system-user interactions for each Dockerfile.

---

## File Structure

```
.
├── dockerfiles
│ └── sources-gold # Directory containing input Dockerfiles
├── data
│ └── dockerfiles.jsonl # Output file storing processed data in JSONL format
├── logs
│ ├── success.log # Logs filenames successfully processed
│ └── failure.log # Logs filenames that failed processing
├── .env # Environment variables (e.g., API_TOKEN)
├── main.py # Main Python script for processing Dockerfiles
├── README.md # Repository documentation (this file)
└── requirements.txt # Python dependencies
```

---

## How It Works
1. **Dockerfile Parsing:**
- The tool reads Dockerfiles from the `dockerfiles/sources-gold` directory.
- Validates each file using the `dockerfile` library to ensure compatibility.

2. **Prompt Generation:**
- Constructs a prompt based on the content of the Dockerfile.
- Sends the prompt to the Hugging Face Inference API for processing.

3. **Response Handling:**
- Validates and cleans the model's response.
- Retries up to a defined limit if the response is invalid or empty.

4. **Output Generation:**
- Creates a JSONL entry with the Dockerfile content and the generated user question.
- Logs each file's success or failure into separate log files.

---

## Usage

### Prerequisites
1. Python 3.8+
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Set up the `.env` file with your Hugging Face API token:
```
API_TOKEN=your_hugging_face_api_token
```

### Running the Script
Execute the main script to process Dockerfiles:
```bash
python main.py
```

### Outputs
- **Processed Data:**
- Saved in `data/dockerfiles.jsonl` as structured JSONL.
- **Logs:**
- Successful files: `logs/success.log`
- Failed files: `logs/failure.log`

---

## Example JSONL Entry
```json
{
"text": "System: You are a Dockerfile generator.\n\nUser: Create a Dockerfile using...\n\nAssistant: FROM alpine:3.10\nRUN ..."
}
```

---

## Contributing
1. Fork the repository.
2. Create a new branch:
```bash
git checkout -b feature-branch
```
3. Make your changes and commit them:
```bash
git commit -m "Add new feature"
```
4. Push to your branch:
```bash
git push origin feature-branch
```
5. Open a pull request.

---

## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

---

## Contact
For questions or feedback, please create an issue in this repository.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/neptun-software/neptun.data.generators

Awesome Lists containing this project

README