https://github.com/msoedov/validex

Simplifies the retrieval, extraction, and training of structured data from various unstructured sources.
https://github.com/msoedov/validex

llm-extraction structured-data-extraction structured-output

Last synced: about 2 months ago
JSON representation

Simplifies the retrieval, extraction, and training of structured data from various unstructured sources.

Host: GitHub
URL: https://github.com/msoedov/validex
Owner: msoedov
License: mit
Created: 2024-07-23T22:12:20.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-03-26T23:15:52.000Z (2 months ago)
Last Synced: 2025-03-28T11:07:51.203Z (2 months ago)
Topics: llm-extraction, structured-data-extraction, structured-output
Language: Python
Homepage:
Size: 377 KB
Stars: 137
Watchers: 3
Forks: 12
Open Issues: 9
Metadata Files:
- Readme: Readme.md
- License: LICENSE

Awesome Lists containing this project

README

        # ValidEx

ValidEx is a Python library that simplifies retrieval, extraction and training of structured data from various unstructured sources.

















## 🏷 Features

- **Structured Data Extraction**: Parse and extract structured data from various unstructured sources including web pages, text files, PDFs, and more.

- **Heuristic data cleaning**  text normalization (case, whitespace, special characters), deduplication

- **Concurrency Support**: Efficiently process multiple data sources simultaneously.

- **Retry Mechanism**: Implement automatic retries for failed extraction attempts.

- **Hallucination check**: Implement strategies to detect and reduce LLM hallucinations in extracted data.

- **Fine-tuning Dataset Export**: Generate datasets in JSONL format for OpenAI chat fine-tuning.

- **Local Model Creation**: Build custom extraction models combining Named Entity Recognition (NER) and regular expressions.

## 📦 Installation

To get started with ValidEx, simply install the package using pip:

```shell

pip install validex

```

## ⛓️ Quick Start

```python

import validex

from pydantic import BaseModel

class Superhero(BaseModel):

    name: str

    age: int

    power: str

    enemies: list[str]

def main():

    app = validex.App()

    app.add("https://www.britannica.com/topic/list-of-superheroes-2024795")

    app.add("*.txt")

    app.add("*.pdf")

    app.add("*.md")

    superheroes = app.extract(Superhero)

    print(f"Extracted superheroes: {list(superheroes)}")

    first_hero = app.extract_first(Superhero)

    print(f"First extracted hero: {first_hero}")

    print(f"Total cost: ${app.cost()}")

    print(f"Total usage: {app.usage}")

if __name__ == "__main__":

    main()

```

```python

[

    (

        Superhero(

            name="Batman",

            age=81,

            power="Brilliant detective skills, martial arts",

            enemies=["Joker", "Penguin"],

        ),

        {"url": "https://www.britannica.com/topic/list-of-superheroes-2024795"},

    ),

    (

        Superhero(

            name="Wonder Woman",

            age=80,

            power="Superhuman strength, speed, agility",

            enemies=["Ares", "Cheetah"],

        ),

        {"url": "https://www.britannica.com/topic/list-of-superheroes-2024795"},

    ),

    (

        Superhero(

            name="Spider-Man",

            age=59,

            power="Wall-crawling, spider sense",

            enemies=["Green Goblin", "Venom"],

        ),

        {"url": "https://www.britannica.com/topic/list-of-superheroes-2024795"},

    ),

    (

        Superhero(

            name="Captain America",

            age=101,

            power="Super soldier serum, shield",

            enemies=["Red Skull", "Hydra"],

        ),

        {"url": "https://www.britannica.com/topic/list-of-superheroes-2024795"},

    ),

    (

        Superhero(

            name="Superman", age=35, power="Flight", enemies=["Lex Luthor", "Doomsday"]

        ),

        {"url": "https://www.britannica.com/robots.txt"},

    ),

    (

        Superhero(

            name="Wonder Woman",

            age=30,

            power="Super Strength",

            enemies=["Ares", "Cheetah"],

        ),

        {"url": "https://www.britannica.com/robots.txt"},

    ),

    (

        Superhero(

            name="Spider-Man",

            age=25,

            power="Wall-crawling",

            enemies=["Green Goblin", "Venom"],

        ),

        {"url": "https://www.britannica.com/robots.txt"},

    ),

]

```

### Hallucinations and autofix

```python

class Superhero(BaseModel):

    name: str

    age: int

    power: str

    enemies: list[str]

    def fix(self):

        # Logic to auto fix and normalize the generated data

        if self.age < 0:

            self.age = 0

    def check_hallucinations(self):

        # Check name

        if not re.match(r"^[A-Za-z\s-]+$", self.name):

            raise ValueError(f"Name '{self.name}' contains unusual characters")

        # Check age

        if self.age < 0 or self.age > 1000:

            raise ValueError(f"Age {self.age} seems unrealistic")

        # Check power

        if len(self.power) > 50:

            raise ValueError("Power description is unusually long")

        # Check enemies

        if len(self.enemies) > 10:

            raise ValueError("Unusually high number of enemies")

        for enemy in self.enemies:

            if not re.match(r"^[A-Za-z\s-]+$", enemy):

                raise ValueError(f"Enemy name '{enemy}' contains unusual characters")

```

### Experimental: Export and fine tunning

```python

# Use the OpenAI chat fine-tuning format to save data

app.export_jsonl("fine_tune.jsonl")

# Local model training

app.fit()

app.save("state.validex")

app.infer_extract("booob")

```

### Multi-model Extraction

ValidEx supports extracting multiple models at once

```python

class Superhero2(BaseModel):

    name: str

    age: int

    power: str

    enemies: list[str]

multi_results = app.multi_extract(Superhero, Superhero2)

print(f"Multi-extraction results: {multi_results}")

```

### Limitations

TBD

## 🛠️ Roadmap

## 👋 Contributing

Contributions to ValidEx are welcome! If you'd like to contribute, please follow these steps:

- Fork the repository on GitHub

- Create a new branch for your changes

- Commit your changes to the new branch

- Push your changes to the forked repository

- Open a pull request to the main ValidEx repository

Before contributing, please read the contributing guidelines.

## License

ValidEx is released under the MIT License.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/msoedov/validex

Awesome Lists containing this project

README