https://github.com/timothepearce/synda
A CLI for generating synthetic data
https://github.com/timothepearce/synda
ai cli llm machine-learning synthetic-data
Last synced: 5 months ago
JSON representation
A CLI for generating synthetic data
- Host: GitHub
- URL: https://github.com/timothepearce/synda
- Owner: timothepearce
- License: apache-2.0
- Created: 2025-01-10T16:34:41.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-14T07:30:26.000Z (about 1 year ago)
- Last Synced: 2025-12-05T14:36:36.742Z (7 months ago)
- Topics: ai, cli, llm, machine-learning, synthetic-data
- Language: Python
- Homepage:
- Size: 843 KB
- Stars: 42
- Watchers: 2
- Forks: 10
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Synda
> [!WARNING]
> This project is in its very early stages of development and should not be used in production environments.
> [!NOTE]
> PR are more than welcome. Check the roadmap if you want to contribute or create discussion to submit a use-case.
Synda (*synthetic data*) is a package that allows you to create synthetic data generation pipelines.
It is opinionated and fast by design, with plans to become highly configurable in the future.
## Installation
Synda requires Python 3.10 or higher.
You can install Synda using pipx:
```bash
pipx install synda
```
## Usage
1. Create a YAML configuration file (e.g., `config.yaml`) that defines your pipeline:
```yaml
input:
type: csv
properties:
path: tests/stubs/simple_pipeline/source.csv
target_column: content
separator: "\t"
pipeline:
- type: split
method: chunk
name: chunk_faq
parameters:
size: 500
# overlap: 20
- type: split
method: separator
name: sentence_chunk_faq
parameters:
separator: .
keep_separator: true
- type: generation
method: llm
parameters:
provider: openai
model: gpt-4o-mini
template: |
Ask a question regarding the sentence about the content.
content: {chunk_faq}
sentence: {sentence_chunk_faq}
Instructions :
1. Use english only
2. Keep it short
question:
- type: clean
method: deduplicate-tf-idf
parameters:
strategy: fuzzy
similarity_threshold: 0.9
keep: first
- type: ablation
method: llm-judge-binary
parameters:
provider: openai
model: gpt-4o-mini
consensus: all # any, majority
criteria:
- Is the question written in english?
- Is the question consistent?
output:
type: csv
properties:
path: tests/stubs/simple_pipeline/output.csv
separator: "\t"
```
2. Add a model provider:
```bash
synda provider add openai --api-key [YOUR_API_KEY]
```
3. Generate some synthetic data:
```bash
synda generate config.yaml
```
## Pipeline Structure
The Nebula pipeline consists of three main parts:
- **Input**: Data source configuration
- **Pipeline**: Sequence of transformation and generation steps
- **Output**: Configuration for the generated data output
### Available Pipeline Steps
Currently, Synda supports four pipeline steps (as shown in the example above):
- **split**: Breaks down data (`method: chunk` or `method: split`)
- **generation**: Generates content using LLMs (`method: llm`)
- **clean**: Delete the duplicated data (`method: deduplicate-tf-idf`)
- **ablation**: Filters data based on defined criteria (`method: llm-judge-binary`)
- **metadata**: Add metadata to text (`method: word-position`)
More steps will be added in future releases.
## Roadmap
The following features are planned for future releases.
### Core
- [x] Implement a Proof of Concept
- [x] Implement a common interface (Node) for input and output of each step
- [x] Add SQLite support
- [x] Add setter command for provider variable (openai, etc.)
- [x] Store each execution and step in DB
- [x] Add "split" -> "separator" step
- [x] Add named step
- [x] Store each Node in DB
- [x] Add "clean" -> "deduplicate" step
- [x] Allow injecting params from distant step into prompt
- [x] Add Ollama with structured generation output
- [x] Retry a failed run
- [ ] Add asynchronous behaviour for any CLI
- [ ] Add vLLM with structured generation output
- [ ] Batch processing logic (via param.) for LLMs steps
- [ ] Move input into pipeline (step type: 'load')
- [ ] Move output into pipeline (step type: 'export')
- [ ] Allow pausing and resuming pipelines
- [ ] Trace each synthetic data with his historic
- [ ] Enable caching of each step's output
- [ ] Implement custom scriptable step for developer
- [ ] Use Ray for large workload
- [ ] Add a programmatic API
### Steps
- [x] input/output: .xls format
- [ ] input/output: Hugging Face datasets
- [ ] chunk: Semantic chunks
- [ ] clean: embedding deduplication
- [ ] ablation: LLMs as a juries
- [ ] masking: NER (GliNER)
- [ ] masking: Regexp
- [ ] masking: PII
- [ ] metadata: Word position
- [ ] metadata: Regexp
### Ideas
- [ ] translations (SeamlessM4T)
- [ ] speech-to-text
- [ ] text-to-speech
- [ ] metadata extraction
- [ ] tSNE / PCA
- [ ] custom steps?
## License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.